Computational intensive application: loading vs. querying database

Question

My company has a suite of mature (roots go back 15+ years), computationally intensive applications (simulations) which we use to do consulting work. I work on the simulation/consulting side of the house, not the software development side. I recently learned about a particular feature of the architecture which concerns me: our large (200-500MB, sometimes more) set of input data are first completely loaded into memory from the database. Then, the simulation engine access the input data from memory as it goes.

The struck me as "pre-mature optimization" and/or an old-fashioned idea, as I believe there are database servers out there that can be quite fast. The architecture causes another problem, which is our simulation database can only be so big (a little over a GB) before a sim cannot run at all. The chief architect is a scientist, not a software developer, and somewhat hard-headed at that, so I am seeking comments on this question from the community.

Is our architecture hurting us? Are there any papers/work/test on the subject-- tradeoffs between database reading as-you-go and a database copy in memory? Any comments or insight would be appreciated. I would like to get schooled up before I take the issue up with my company.

S.Lott · Accepted Answer · 2011-05-05 14:50:14Z

Is our architecture hurting us?

No.

Are there any papers/work/test on the subject-- tradeoffs between database reading as-you-go and a database copy in memory?

Yes. Hundreds.

This is why databases cache things. The more of the database that's in memory the faster the database works.

All the clever caching algorithms (LRU, etc.) are designed to achieve pretty good performance with pretty good use of memory. For a general-purpose database product, this can involve some interesting tradeoffs.

For your application, however, putting everything in memory is clearly optimal. There's no tradeoff. You've already reached the upper limit on performance.

Any comments or insight would be appreciated.

Replacing your all-in-memory with anything else may involve more I/O and therefore be slower.

There are some things you can do if you want to speed things up.

Don't load files to a database and then query the database to create the in-memory structure. Just load the files into memory. Much faster.
If possible, break your processing down to create an OS-level pipeline where partial results are read from stdin and results are written on stdout. A series of pipeline stages will use up more cores and more memory, often leading to improved throughput.

jzd · Answer 2 · 2011-05-05 14:42:02Z

A database in memory has the potential to be faster than one on a hard disk. There is also options to do a mixture.

Regardless, there are many more factors that could be eliminating the hard disk option. I would suggest letting the chief architect deal with the architecture, (he is going to be hard headed if you try to do his job). Instead, focus on the size limitation and the business plan. Bring him the problem (You need to simulate data sets larger than 1GB) and let him figure out a solution.

+1 for let the architect do his job, and for letting him know the buisness need. He's paid to solve it after all. — Kevin D, May 5 '11 at 14:44

Kevin D · Answer 3 · 2011-05-05 14:42:59Z

I'm not an expert in this particular area but I can't see anything beating the "in memory approach". Yes you have the slow start of loading all that data but once it's there your data retrieval time is basically 0. If you are running the database on anything other than the same machine as the simulation then you introduce network latency to the equation.

Without a true understanding of the data and what you are simulating and how, this suggestion may or may not be worth it. You could load large chunks of data at a time, ie load 100MB / X rows worth of data, start processing it and then have the next 100MB ready to be processed by then end of the first batch.

ChrisF · Answer 4 · 2011-05-05 14:43:28Z

The problem is that you need to work with datasets greater than 1GB. To do this you are going to have to break the data down into smaller, more manageable chunks. Then your main problem is going to be whether the algorithms can work on partial data or not.

If they require all the data to be read into memory then it will take a major rewrite of the application to change this. You'll need to be able to read a subset of the data process that and then repeat until all the data is processed. Once each part is processed you'll then need to combine the results.

It's not as simple as changing how you read the data from the database.

asked	4 years ago
viewed	645 times
active	4 years ago

current community

your communities

more stack exchange communities

Computational intensive application: loading vs. querying database

4 Answers 4

Your Answer

Not the answer you're looking for? Browse other questions tagged architecture computer-science database optimization or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Computational intensive application: loading vs. querying database

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged architecture computer-science database optimization or ask your own question.

Related

Hot Network Questions