7

I'm trying to parse and insert a big csv file into MongoDB but when the file extends 100'000 rows I get a bad response from the server. And the files I need to insert are usually above 200'000 rows.

I've tried both bulk insert (insertMany) and Babyparse(Papaparse) streaming approach to insert the file row by row. But with poor results.

Node api:

router.post('/csv-upload/:id', multipartMiddleware, function(req, res) {

    // Post vartiables
    var fileId = req.params.id;
    var csv = req.files.files.path;

    // create a queue object with concurrency 5
    var q = async.queue(function(row, callback) {
        var entry = new Entry(row);
        entry.save();
        callback();
    }, 5);

    baby.parseFiles(csv, {
        header: true, // Includes header in JSON
        skipEmptyLines: true,
        fastMode: true,
        step: function(results, parser) {
            results.data[0].id = fileId;

            q.push(results.data[0], function (err) {
                if (err) {throw err};
            });
        },
        complete: function(results, file) {
            console.log("Parsing complete:", results, file);
            q.drain = function() {
                console.log('All items have been processed');
                res.send("Completed!");
            };
        }
    });
});

This streaming approach results in: POST SERVER net::ERR_EMPTY_RESPONSE

Not sure if I'm using the async.queue correctly though.

Is there a better and more efficient way to do this OR am I doing something wrong?

Express Server:

// Dependencies
var express = require('express');
var path = require('path');
var bodyParser = require('body-parser');
var routes = require('./server/routes');
var mongoose = require("mongoose");
var babel = require("babel-core/register");
var compression = require('compression');
var PORT = process.env.PORT || 3000;
// Include the cluster module
var cluster = require('cluster');

mongoose.connect(process.env.MONGOLAB_URI || 'mongodb://localhost/routes');

  // Code to run if we're in the master process
 if (cluster.isMaster) {

    // Count the machine's CPUs
    var cpuCount = require('os').cpus().length;

    // Create a worker for each CPU
    for (var i = 0; i < cpuCount; i += 1) {
        cluster.fork();
    }

 // Code to run if we're in a worker process
 } else {
    // Express
    var app = express();

    app.use(bodyParser.json({limit: '50mb'}));
    app.use(bodyParser.urlencoded({limit: '50mb', extended: true}));

    // Compress responses
    app.use(compression());

    // Used for production build
    app.use(express.static(path.join(__dirname, 'public')));

    routes(app);

    // Routes
    app.use('/api', require('./server/routes/api'));

    app.all('/*', function(req, res) {
        res.sendFile(path.join(__dirname, 'public/index.html'));
    });

    // Start server
    app.listen(PORT, function() {
        console.log('Server ' + cluster.worker.id + ' running on ' + PORT);
    });
}
4
  • 1
    this solution i did for another question might be of help to you - stackoverflow.com/questions/32386118/… - the OP reported loading 700k records from csv in about 1 minute Commented Jan 28, 2017 at 20:39
  • Thank you @Robbie! I will take a look at that. Commented Jan 31, 2017 at 14:14
  • That thread helped me a lot @Robbie , Thank you. Commented Feb 9, 2017 at 17:12
  • glad it helped you Commented Feb 9, 2017 at 18:22

1 Answer 1

5

Handling the import:

Great question, from my experience by far the fastest way to insert a csv into mongo is via the command line:

mongoimport -d db_name -c collection_name --type csv --file file.csv --headerline 

I don't believe mongoose has a way of calling mongoimport (someone correct me if I'm wrong)

But it's simple enough to call via node directly:

var exec = require('child_process').exec;
var cmd = 'mongoimport -d db_name -c collection_name --type csv --file file.csv --headerline';

exec(cmd, function(error, stdout, stderr) {
  // do whatever you need during the callback
});

The above will have to be modified to be dynamic, but it should be self-explanatory.

Handling the upload:

Uploading the file from a front-end client is another challenge.

Most browsers will timeout if you make a request to a server and don't get a response within 60 seconds (probably what you are referring to above)

One solution would be to open a socket connection (search for socket.io in npm) for details. This will create a constant connection to the server and won't be subject to the timeout restrictions.

If uploading is not an issue, and the timeout is due to the slow parsing/inserting then you may not have to worry about this once you implement the above.

Other considerations:

I'm not sure exactly what you need to send back to the user, or what parsing needs to take place. But that can either be done outside of the normal request/response cycle, or can be handled during a socket connection if it's needed during one request/response cycle.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for your reply. That sounds interesting. To upload the file is not a problem right now. Do you have any idea how the mongoimport will handle the csv because the file will always look something like this: version task concept price 1 1 3 1 1 1 2 2 1 1 1 1 etc etc etc...
To see how it handles the import I would just test with a sample csv. Put the csv on your server and run the mongoimport command mentioned above referencing your sample csv. Then open mongo via the command line and view the data that was inserted. Also, the mongo documentation is a great reference docs.mongodb.com/manual/reference/program/mongoimport
Thanks. Another problem with this approach is that I need to add the id dynamically because that is not included in the file.
How are you using the id? Is each id a different collection, or is the id a field you add within the csv. If it's the former then use the id as the collection name, if it's the second option you could do a simple mongoose update in the callback of the mongoimport exec command.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.