Log-reading & String-matching with hashtable for fastest execution speed

Question

I'm currently enjoying this exercise I'm working on for fun: I wish to write optimized code for the searching of particular strings per line in a (very large) file, counting how many exist, and producing a text file with a count of said strings.

The data format is actually very consistent, so when I do my indexof() on the lines of my logs, I'm searching against lines that all conform to this:

2011-05-13 00:00:00 195.249.159.77 GET /blahblah/blah/lol.png - 80 - 141.166.254.22 Mozilla/5.0+(Windows+NT+6.1;+rv:2.0b12)+Gecko/20100101+Firefox/4.0b12 http://lolcats.com/content/styles.css?x=a9d6c00 lolcats.com

(hence the specific indexes)

        private static void Main()
    {
        var sw = new Stopwatch();
        var ht = new Hashtable();
        var ts = new TimeSpan();
        var files = Directory.GetFiles(Directory.GetCurrentDirectory());
        var tempPath = Path.GetTempFileName();
        var tempOutput = new StreamWriter(tempPath);

        const int iterations = 5;

        for (int i = 0; i < iterations; i++)
        {
            sw.Start();
            using (var sr = new StreamReader("4.log"))
            {
                for (String line; (line = sr.ReadLine()) != null;)
                {
                    if (line.Length > 27)
                    {
                        var lineHit = line.IndexOf("get", 27, 13, StringComparison.OrdinalIgnoreCase);

                        if (lineHit > -1)
                        {
                            var relevantDataIndex = line.IndexOf(" - ", 27,
                                StringComparison.OrdinalIgnoreCase);

                            var relevantData = line.Substring(lineHit + 4, relevantDataIndex - (lineHit + 4));

                            if (ht.Contains(relevantData))
                            {
                                ht[relevantData] = (int) ht[relevantData] + 1;
                            }
                            else
                            {
                                ht.Add(relevantData, 1);
                            }
                        }
                    }
                }
            }
            sw.Stop();
            ts += sw.Elapsed;
            Console.WriteLine("time elapsed: {0}", sw.Elapsed);

            if (i == iterations - 1)
            {
                tempOutput.WriteLine("Count: " + ht.Keys.Count + Environment.NewLine);
                foreach (DictionaryEntry entry in ht)
                {
                    tempOutput.WriteLine("{0}, {1}", entry.Key, entry.Value);
                }
            }
            else
            {
                ht.Clear();
            }
            sw.Reset();
        }
        var average = (ts.Seconds + (ts.Milliseconds/1000.0))/iterations;

        tempOutput.Close();

        using (var output = new StreamWriter("outputWithOneSubstring.txt"))
        {
            output.WriteLine("Iterations: {0} {1}Average time / Iteration: {2}", iterations, Environment.NewLine,
                average);
        }
        using (Stream input = File.OpenRead(tempPath))
        using (
            Stream output = new FileStream("outputWithOneSubstring.txt", FileMode.Append, FileAccess.Write,
                FileShare.None))
        {
            input.CopyTo(output);
        }
    }

I wish to get the path in the log (after the GET request) and then count how many of those occur. I've written my code earlier and can read in a very large file and accurately count all GETs in about 2 seconds.

I wish to do this quicker. Also the next step in my exercise is to scan the directory and do it across all files.

What I've thought about so far - performance-wise, are asynchronous methods in C# and subbing out my hashtable with a Dictionary.

Could anyone give me some critique and/or suggestions? All are welcome. Also I've hastily written this so there may be bugs (and need for refactoring), but that can be cleaned afterward so not so important - I would simply be super grateful for critique on current code and suggestions to make this as lightning-fast as possible.

mjolka · Accepted Answer · 2015-02-16 01:35:45Z

Tallying items, like you're doing here, can be expressed more simply using Enumerable's GroupBy and ToDictionary methods.

var tally = lines.GroupBy(GetRelevantData)
                 .ToDictionary(group => group.Key, group => group.Count());

Here I'm assuming there's a method private static string GetRelevantData(string line) to extract the data as in your code.

It might be more performant to write your own extension method to do this for you. Here I've provided an overload that allows you to specify the comparison, but you might not need that. (You would also want to add argument validation if this is being used in the real world.)

public static class EnumerableExtensions
{
    public static IDictionary<T, int> Tally<T>(
        this IEnumerable<T> source)
    {
        return Tally(source, EqualityComparer<T>.Default);
    }

    public static IDictionary<T, int> Tally<T>(
        this IEnumerable<T> source,
        IEqualityComparer<T> comparer)
    {
        var tally = new Dictionary<T, int>(comparer);
        foreach (var item in source)
        {
            int count;
            tally.TryGetValue(item, out count);
            tally[item] = count + 1;
        }

        return tally;
    }
}

You can now use this as

var tally = lines.Select(GetRelevantData).Tally();

using (var sr = new StreamReader("4.log"))
{
    for (String line; (line = sr.ReadLine()) != null;)
    {
        ....

This construct can be replaced with File.ReadLines to make the code a bit cleaner.

The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.

You should also specify the encoding when reading/writing files.

foreach (var line in File.ReadLines(inputFile, Encoding.UTF8))
{
    ...

Putting these two parts together, the main part of your code might look like

var tally = File.ReadLines(inputFile, Encoding.UTF8)
                .Where(IsValidLine)
                .Select(GetRelevantData)
                .Tally();

Both of these posts helped a lot - massively grateful to you both! Will work on implementing some of these changes for aesthetics after work. Cheers! — notsoobvious, Feb 16 at 12:00

ChrisWue · Answer 2 · 2015-02-16 18:43:12Z

Try to break you code into different methods which take care of the individual things, like counting the words and copying the output. Otherwise your code will get messy quickly and you'll have a hard time trying different implementations easily (and compare them against each other).
Your use of a for loop with the StreamReader is rather unusual. Typically you'd use a while loop like:
```
string line;
while ((line = sr.ReadLine()) != null)
{
    ...
```
or
```
while (!sr.EndOfStream)
{
    var line = sr.ReadLine();
    ...
```
Both versions convey the semantics better than the for loop (imho).

You can reduce nesting a bit by using continue. E.g.

string line;
while ((line = sr.ReadLine()) != null)
{
    if (line.Length <= 27)
        continue;

    ...

Instead of a Hashtable you should use a Dictionary<string, int> this will avoid boxing of the value (a Hashtable operates on object which requires boxing for primitive types like int).
I'm not sure why you think asynchronous methods would make your code faster. Asynchronous processing is not free and incurs overhead.
StreamWriter is IDisposable so tempWriter should be wrapped in a using block.
Not entirely sure why you write the output to a temporary file first and then append to to the output file. You could simply append it to the final output file directly.
You are only measuring specific parts of the code. All the output writing and copying around is not free and uses time. In the end from an end-users perspective I don't care if you application can count specific lines in a file in less than a second when it spends much longer copying stuff around.

Thanks for the general tips, I wish I could make several posts the correct answer! Great confirming the Dictionary and your valid thoughts on my structure (it does need refactoring, I should've done that before I threw it up but I was excited!) I use a temporary file because I wish to pre-pend stuff (like average) to the start of my file, which obviously needs to be calculated after all the iterations. I guess I could've just stored it in memory and remove the temp file. The main point re stopwatch is this: I'm not going to write to output/console so I only care to optimize the DST part now — notsoobvious, Feb 16 at 11:56

asked	6 months ago
viewed	165 times
active	6 months ago

current community

your communities

more stack exchange communities

Log-reading & String-matching with hashtable for fastest execution speed

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged c# strings asynchronous logging or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Log-reading & String-matching with hashtable for fastest execution speed

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c# strings asynchronous logging or ask your own question.

Related

Hot Network Questions