How do you diagnose and fix memory leaks involving Django and Scikit-learn?
I'm working on a Django management command that trains several text classifiers implemented using scikit-learn. I'm using all the tricks I know to plug Django memory leaks, including:
- Setting DEBUG=False
- Use .iterator() for queryset iteration
- Use .defer([columns]) to prevent huge column values from being unnecessarily loaded.
- Clearing cached querysets by periodically calling MyModel.objects.update().
- Manually invoking gc.collect() to speed up garbage collection.
These techniques have solved all my memory leak problems with long-running Django processes in the past. However, I'm still seeing a massive memory leak with scikit-learn, implying the problem may not be with Django.
My process looks basically like:
tmp_debug = settings.DEBUG
settings.DEBUG = False
try:
documents = Document.objects.all().defer('text')
for document in documents.iterator():
classifier = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4))),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC())),
])
x_train = document.training_vector
y_train = document.classification_index
classifier.fit(x_train, y_train)
obj, _ = SavedClassifier.objects.get_or_create(document=document)
_, fn = tempfile.mkstemp()
joblib.dump(classifier, fn, compress=9)
obj.classifier = b64encode(open(fn, 'rb').read())
os.remove(fn)
obj.save()
Document.objects.update()
gc.collect()
finally:
settings.DEBUG = tmp_debug
Each document object contains several pages of text (in the "text" field). I have about 50 document records, and after parsing 5 documents, the script is consuming about 4GB of memory, having steadily increased over time.
Is there any way I can diagnose and fix this memory leak, short of running my script once for each document?