C++ file parsing code

Question

I'm processing about 3.5gb/15minutes, and I was wondering if I could get this down to a better time...

Note that this is for a very specific CSV file, so I am not concerned about generality - only speed of implementation and optimization.

int main(int argc, char *argv[]) {
    std::string filename;
    std::cout << "Please type the file name: " << std::endl;
    std::cin >> filename;

    std::string ticker;
    std::cout << "Please enter the ticker: " << std::endl;
    std::cin >> ticker;

    std::ifstream instream(filename.c_str());
    std::string ask_filename = ticker + "_ASK.NIT";
    std::ofstream askstream(ask_filename.c_str());

    std::string bid_filename = ticker + "_BID.NIT";

    std::ofstream bidstream(bid_filename.c_str());

    std::string line;
    std::getline(instream,line);

    while(std::getline(instream,line))
    {
        std::stringstream lineStream(line);
        std::string cell;
        std::string new_line;
        std::vector<std::string> my_str_vec;
        while(std::getline(lineStream,cell,','))
        {
            my_str_vec.push_back(cell);
            //new_line.append(cell.append(";"));
        }
        // works on date
        std::string my_date = my_str_vec[0];
        std::string::iterator my_iter;
        std::string processed_date = "";
        for(my_iter = my_date.begin(); my_iter != my_date.end(); ++my_iter) {
            if(std::isalnum(*my_iter) || *my_iter == ' ')
                processed_date.append(1,(*my_iter));
        }

        my_str_vec[0] = processed_date;


        std::vector<std::string>::iterator my_vec_iter;
        for(my_vec_iter = my_str_vec.begin() + 1; my_vec_iter != my_str_vec.end(); ++my_vec_iter) {
            std::string my_semicol = ";";
            *my_vec_iter = my_semicol.append(*my_vec_iter);
        }

        askstream << my_str_vec[0] << my_str_vec[1] << my_str_vec[3] << std::endl;
        bidstream << my_str_vec[0] << my_str_vec[2] << my_str_vec[4] << std::endl;
    }

    askstream.close();
    bidstream.close();


    return 0;
}

Loki Astari · Accepted Answer · 2011-09-04 19:47:45Z

First off streams and specifically the stream operators >> and << are designed for robustness not speed. They make the code harder to write incorrectly (and provide some local specific conversions) that make expensive in terms of using (consider changing to C fscanf() if speed is paramount).

So if speed is you objective streams is not the best choice.

Assuming we are staying with streams (Ask another question about how to write this in C for speed):

Dead Code

Second delete dead code (it confuses a review) and costs in creation.

    std::string new_line;

Small Inefficiencies

There is no need to create the cell object each time through the loop create it once outside the loop. (Personally I would leave it in the loop, but if you are going for speed then this is an obvious simple save).

When scanning the date you are making a copy of the value:

    std::string my_date = my_str_vec[0];

No need just make a reference (you can even use a const reference for safety):

    std::string const&   my_date = my_str_vec[0];
          //    ^^^^^^

Use Reserve

Building the string character by character may not be the most efficient way of getting the date (as the string processed_date may be resized on each insert).

    for(my_iter = my_date.begin(); my_iter != my_date.end(); ++my_iter) {
        if(std::isalnum(*my_iter) || *my_iter == ' ')
            processed_date.append(1,(*my_iter));
    }

Use reserve so you know that it is not re-sized internally, then use a standard algorithm for optimal looping:

processed_date.reserve(my_date.size());  // Note: not resize()
std::remove_copy(my_date.begin(), my_date.end(), 
                 std::back_inserter(processed_date),
                 [](char const& my_char) {return !(std::isalnum(*my_char) || *my_char == ' ');}
                );
// Note if you don't have C++0x then lamda is easily replaced by a functor.

Standard algorithms.

Your code will be optimal when you use the standard algorithms. Its worth learning what is available.

http://www.sgi.com/tech/stl/table_of_contents.html
See section 5

I showed you how to copy all but the characters you want. But there is one to remove all elements in-place. I'll let you experiment.

Mike Dunlavey · Answer 2 · 2011-09-05 01:36:10Z

As a rule of thumb, parsing in a compiled language should be I/O bound. In other words, it shouldn't take much longer to parse the file than it takes to just copy it by any method.

Here's how I optimize code. I'm not looking for "slow routines". I want a much finer level of granularity.

I'm looking for individual statements or function call sites, that are on the call stack a substantial fraction of the time, because any such call, if it could be replaced by some code that did the same action in a lot less time, would reduce total running time by roughly that fraction.

It's extremely easy to find them. All you gotta do is pause the program a few times, and each time get out your magnifying glass and ask what it's doing and why.

Don't stop after you've found and fixed just one problem. Since you've reduced the total time, other problems now take a larger percentage of the time, and are thus easier to find. So, the more you do it, the more you find, and each speedup compounds on the previous ones. That's how you really make it fast.

Loki Astari · Answer 3 · 2011-09-04 22:15:43Z

Untest version using fscanf()

#include <iostream>
#include <stdio.h>

struct ValidDate
{
    bool operator()(char c) const { return !(std::isalnum(c) || c == ' ');}
};

int main(int argc, char *argv[])
{
    std::string filename;
    std::cout << "Please type the file name: " << std::endl;
    std::cin >> filename;

    std::string ticker;
    std::cout << "Please enter the ticker: " << std::endl;
    std::cin >> ticker;

    std::string ask_filename = ticker + "_ASK.NIT";
    std::string bid_filename = ticker + "_BID.NIT";

    FILE*   instream    = fopen(filename.c_str(), "r");
    FILE*   askstream   = fopen(ask_filename.c_str(), "w");
    FILE*   bidstream   = fopen(bid_filename.c_str(), "w");

    char    part[5][1000];
    char    term;
    int     dateLen;
    while(feof(instream))
    {
        term    = '\n';
        dateLen = 0;
        fscanf(instream, "%999[^,\n]%n", part[0], &dateLen);fscanf(instream,"%c", &term);
        // Note you do need two fscanf() in case there is an empty cell with no data
        if (term == ',')    {fscanf(instream, "%999[^,\n]", part[1]);fscanf(instream,"%c", &term);}
        if (term == ',')    {fscanf(instream, "%999[^,\n]", part[2]);fscanf(instream,"%c", &term);}
        if (term == ',')    {fscanf(instream, "%999[^,\n]", part[3]);fscanf(instream,"%c", &term);}
        if (term == ',')    {fscanf(instream, "%999[^,\n]", part[4]);fscanf(instream,"%c", &term);}
        // Read the remainder of the line
        // Actually need to use two fscans (just in case there is only a '\n' left)
        if (term == ',')   {fscanf(instream, "%*[^\n]"); fscanf(instream, "%c", &term);}

        if (term != '\n')
        {    abort(); // something very wrong happened
                      // like data was too big for a part buffer
        }

        // If we have reached the EOF then 
        // something went wrong with one of the reads (probably the first)
        // and we should not continue.
        if (feof(instream))
        {   break;
        }

        char* newLast   = std::remove_if(static_cast<char*>(part[0]), static_cast<char*>(part[0]) + dateLen, ValidDate());
        *newLast    = '\0';

        fprintf(askstream, "%s;%s;%s;\n", part[0], part[1], part[3]);
        fprintf(bidstream, "%s;%s;%s;\n", part[0], part[2], part[4]);
    }

    fclose(askstream);
    fclose(bidstream);
}

dex black · Answer 4 · 2011-09-07 04:51:47Z

Remember that std::string is allowed to have embedded NULL chars. Eliminate all unnecessary copying. e.g.

my_str_vec.push_back(cell);

std::stringstream lineStream(line);

std::string my_date = my_str_vec[0];

Use iterators everywhere.

typedef std::vector<std::string::iterator> field_positions;
field_positions field;
std::back_insert_iterator<field_positions> biitfp(field)
bool found = true;

for (std::string::iterator it = date; it != line.end(); ++it) {
  if (found) { *biitfp = it; found = false; }
  found = ',' == *it;
  if (found) *it = 0; // overwrite commas with NULL
}
*biitfp = it; // save the end point too see field.back() below

do everything in place in the copy of the line you already have

std::string::iterator it1 = field[0].begin(), it2 = field[0].begin();
for (;0 != *it2; ++it2)
{
    // overwrite bad char with good char
    if (std::isalnum(*my_iter) || *my_iter == ' ') *it1++ = *it2;
}
while (it1 != it2) *it1 = 0; // overwrite the left over chars in the field with NULL

int f = 0;
for (field_positions::const_iterator fpit = field.begin(), fpend = field.end();
     fpit != fpend; ++fpit, ++f)
{
    for (std::string::const_iterator it = *fpit; 0 != *it && it != field.back(); ++it)
    {
        switch (f) { // single character stream output is FAST
        case 0: askstream << *it; bidstream << *it; break;
        case 1: case 3: askstream << *it; break;
        case 2: case 4: bidstream << *it; break;
        default: break;
        }
    }
    askstream << ';';
    bidstream << ';';
}
askstream << std::endl; bidstream << std::endl;

asked	1 year ago
viewed	582 times
active	1 year ago

C++ file parsing code

4 Answers

Dead Code

Small Inefficiencies

Use Reserve

Standard algorithms.

Your Answer

Not the answer you're looking for? Browse other questions tagged c++ or ask your own question.

C++ file parsing code

4 Answers

Dead Code

Small Inefficiencies

Use Reserve

Standard algorithms.

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c++ or ask your own question.

Related