Take the 2-minute tour ×

Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

Fixing Memory Leak in Django + Scikit-learn

up vote 1 down vote favorite

How do you diagnose and fix memory leaks involving Django and Scikit-learn?

I'm working on a Django management command that trains several text classifiers implemented using scikit-learn. I'm using all the tricks I know to plug Django memory leaks, including:

Setting DEBUG=False
Use .iterator() for queryset iteration
Use .defer([columns]) to prevent huge column values from being unnecessarily loaded.
Clearing cached querysets by periodically calling MyModel.objects.update().
Manually invoking gc.collect() to speed up garbage collection.

These techniques have solved all my memory leak problems with long-running Django processes in the past. However, I'm still seeing a massive memory leak with scikit-learn, implying the problem may not be with Django.

My process looks basically like:

tmp_debug = settings.DEBUG
settings.DEBUG = False
try:
    documents = Document.objects.all().defer('text')
    for document in documents.iterator():

        classifier = Pipeline([
            ('vectorizer', HashingVectorizer(ngram_range=(1,4))),
            ('tfidf', TfidfTransformer()),
            ('clf', OneVsRestClassifier(LinearSVC())),
        ])

        x_train = document.training_vector
        y_train = document.classification_index

        classifier.fit(x_train, y_train)

        obj, _ = SavedClassifier.objects.get_or_create(document=document)
        _, fn = tempfile.mkstemp()
        joblib.dump(classifier, fn, compress=9)
        obj.classifier = b64encode(open(fn, 'rb').read())
        os.remove(fn)
        obj.save()

        Document.objects.update()
        gc.collect()
finally:
    settings.DEBUG = tmp_debug

Each document object contains several pages of text (in the "text" field). I have about 50 document records, and after parsing 5 documents, the script is consuming about 4GB of memory, having steadily increased over time.

Is there any way I can diagnose and fix this memory leak, short of running my script once for each document?

edited Jul 12 '13 at 14:49

asked Jul 11 '13 at 19:31

Cerin
10.4k1987184

Weird: maybe there is an issue in the liblinear bindings: can you try to replace LinearSVC by MultinomialNB or PassiveAggressiveClassifer for instance? – ogrisel Jul 12 '13 at 8:49

@ogrisel, It doesn't seem limited to LinearSVC. The same high memory usage happens with MultinomialNB and PassiveAggressiveClassifer. – Cerin Jul 12 '13 at 15:19

Alright, then you will have to track it done with a tool like: pypi.python.org/pypi/meliae or pypi.python.org/pypi/memory_profiler – ogrisel Jul 13 '13 at 14:59

add a comment |

Your Answer

Sign up or log in

Post as a guest

Name

Email required, but not shown

Post as a guest

Name

Email required, but not shown

discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged python django django-models memory-leaks scikit-learn or ask your own question.

question feed

asked	1 year ago
viewed	179 times

current community

your communities

more stack exchange communities

Fixing Memory Leak in Django + Scikit-learn

Your Answer

Browse other questions tagged python django django-models memory-leaks scikit-learn or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Fixing Memory Leak in Django + Scikit-learn

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged python django django-models memory-leaks scikit-learn or ask your own question.

Related

Hot Network Questions