I'm trying to implement an algorithm able to search for multiple keys through ten huge files in Python (16 million of rows each one). I've got a sorted file with 62 million of keys, and I'm trying to scan each ten file in dataset to look for a key and its value.
Sorted file key:
a
b
c
d
...
Each dataset file appears like this:
a 1
d 10
g 3
...
Output (if key does not appear in file set val to 0):
<sum_vals>,a,<val_1_file>,<val_2_file>,...,<val_10_file>
<sum_vals>,b,<val_1_file>,<val_2_file>,...,<val_10_file>
....
Here it is my Python algorithm. I scan file containing key and for each match I take its value. For each line in dataset where dataset is an array containing name of files, I'm scanning and creating an output line with 0 if key is not in and <val_in_file>
if it matches. Everything is working fine on a small set of training data, but it takes long and epic time to process on real data.
import os,sys;
files = ["20140601","20140602","20140603","20140604","20140605","20140606","20140607","20140608","20140609","20140610"]
def printToFile(text):
f = open('processed/20140601','a')
f.write(text)
f.close()
def getKey(line):
data=line.split(" ")
return data[0]+ " "+ data[1]
def getPageSource(line):
data=line.split(" ")
return data[0]
def getPageTitle(line):
data=line.split(" ")
return data[1]
def getPageClicks(line):
data=line.split(" ")
return data[2]
def searchWindow(dataset,key):
isHere = 0;
total = 0;
acc = getPageTitle(key)
for f in dataset:
line_with_keyword = next((line for line in open("dataset/"+f) if key in line),None)
if line_with_keyword is not None:
isHere = 1
click = getPageClicks(line_with_keyword)
if(isHere == 1):
acc = acc.strip('\n') + "," + click
total += int(click)
isHere = 0
else:
acc = acc.strip('\n') + "," + str(0)
total += 0
isHere = 0
printToFile(str(total)+","+acc.strip('\n')+","+"\n")
with open("processed/sorted_keys") as inSortedKey:
for line in inSortedKey:
searchWindow(files,getKey(line).strip("\n"));
printToFile
, how does that look? And how heavy are thegetPageTitle()
andgetPageClicks()
? – holroy yesterdayinSortedKey
and keep it open until you are finished. Opening/closing files is expensive... – holroy yesterday