Running an algorithm against a large dataset

Question

We have a python application which does a DFS search against data in a database. Right now, that data is small so I can hold everything in a python container but the data is going to grow and there might come a time when it is impossible to hold everything in RAM. On the other hand, querying data from the database during every iteration is not a smart solution, is it?

But the question is more general. The solution is most probably in some smart caching but I don't have any experience in that.

What are the techniques of dealing with large data when you need to frequently (say you are running an algorithm) access it?

One of the techniques improving performance of database operations (RDBMS or NoSQL) is running most of the algorithm server-side in the form of stored procedure or database-specific batch script‌. Usually there's also quite many engine-specific optimization techniques available like indexes, data clustering etc. Your question is too vague and abstract for any practically useful recommendations — xmojmr, Mar 7 at 12:39
@xmojmr I am implying by server-side in this case you mean running an sql script instead of doing it in python or any other programming language but that is nearly impossible in my case(as I have some problem specific optimizations, which are not sensible with sql) and also, the performance wasn't quite good(I tried). — khajvah, Mar 7 at 14:13
Yes, by server-side I meant run the data processing logic as close as possible to the physical data-storage eliminating network I/O, that is inside the database server using the server's scripting language. For this to be efficient the problem needs to be translated into the relational algebra using sets and set operators (that's where SQL server's are good at). You may need to generate a few problem specific stored procedures and drive them by your Python code. Server-side performance must be better than client-side performance. Try again — xmojmr, Mar 7 at 15:06
@xmojmr Oh, by performance difference I meant when I keep the data in RAM and running python as opposed to running an sql script. Thanks for suggestion, I will consider your solution. — khajvah, Mar 7 at 17:41
@xmojmr The reasons I didn't put more information about specific database I use and the specific problem I have are 1. The current technology we use might change if we find something better and 2. I wanted to make this question more general answering a question: "How algorithms deal with data which is impossible to keep in the RAM?". — khajvah, Mar 7 at 17:57

Vitalij Zadneprovskij · Answer 1 · 2015-03-06 15:38:58Z

There is an entire field called External Memory Algotihms, there are lots of textbooks about it.

An example of an external memory algorithm is External Sorting.

The relational database management system is not always the best solution in such cases. Some NoSQL (Not Only Structured Query Language) database management systems have been designed to perform better in some cases.

About In Memory Depth First Search there are different algorithms and implementations.

asked	11 days ago
viewed	85 times
active	10 days ago

current community

your communities

more stack exchange communities

Running an algorithm against a large dataset

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged design data scalability caching or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Running an algorithm against a large dataset

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged design data scalability caching or ask your own question.

Related

Hot Network Questions