I'm currently enjoying this exercise I'm working on for fun: I wish to write optimized code for the searching of particular strings per line in a (very large) file, counting how many exist, and producing a text file with a count of said strings.
The data format is actually very consistent, so when I do my indexof()
on the lines of my logs, I'm searching against lines that all conform to this:
2011-05-13 00:00:00 195.249.159.77 GET /blahblah/blah/lol.png - 80 - 141.166.254.22 Mozilla/5.0+(Windows+NT+6.1;+rv:2.0b12)+Gecko/20100101+Firefox/4.0b12 http://lolcats.com/content/styles.css?x=a9d6c00 lolcats.com
(hence the specific indexes)
private static void Main()
{
var sw = new Stopwatch();
var ht = new Hashtable();
var ts = new TimeSpan();
var files = Directory.GetFiles(Directory.GetCurrentDirectory());
var tempPath = Path.GetTempFileName();
var tempOutput = new StreamWriter(tempPath);
const int iterations = 5;
for (int i = 0; i < iterations; i++)
{
sw.Start();
using (var sr = new StreamReader("4.log"))
{
for (String line; (line = sr.ReadLine()) != null;)
{
if (line.Length > 27)
{
var lineHit = line.IndexOf("get", 27, 13, StringComparison.OrdinalIgnoreCase);
if (lineHit > -1)
{
var relevantDataIndex = line.IndexOf(" - ", 27,
StringComparison.OrdinalIgnoreCase);
var relevantData = line.Substring(lineHit + 4, relevantDataIndex - (lineHit + 4));
if (ht.Contains(relevantData))
{
ht[relevantData] = (int) ht[relevantData] + 1;
}
else
{
ht.Add(relevantData, 1);
}
}
}
}
}
sw.Stop();
ts += sw.Elapsed;
Console.WriteLine("time elapsed: {0}", sw.Elapsed);
if (i == iterations - 1)
{
tempOutput.WriteLine("Count: " + ht.Keys.Count + Environment.NewLine);
foreach (DictionaryEntry entry in ht)
{
tempOutput.WriteLine("{0}, {1}", entry.Key, entry.Value);
}
}
else
{
ht.Clear();
}
sw.Reset();
}
var average = (ts.Seconds + (ts.Milliseconds/1000.0))/iterations;
tempOutput.Close();
using (var output = new StreamWriter("outputWithOneSubstring.txt"))
{
output.WriteLine("Iterations: {0} {1}Average time / Iteration: {2}", iterations, Environment.NewLine,
average);
}
using (Stream input = File.OpenRead(tempPath))
using (
Stream output = new FileStream("outputWithOneSubstring.txt", FileMode.Append, FileAccess.Write,
FileShare.None))
{
input.CopyTo(output);
}
}
I wish to get the path in the log (after the GET request) and then count how many of those occur. I've written my code earlier and can read in a very large file and accurately count all GETs in about 2 seconds.
I wish to do this quicker. Also the next step in my exercise is to scan the directory and do it across all files.
What I've thought about so far - performance-wise, are asynchronous methods in C# and subbing out my hashtable with a Dictionary.
Could anyone give me some critique and/or suggestions? All are welcome. Also I've hastily written this so there may be bugs (and need for refactoring), but that can be cleaned afterward so not so important - I would simply be super grateful for critique on current code and suggestions to make this as lightning-fast as possible.