I have the following file containing over 500.000 lines. The lines look like the following:
0-0 0-1 1-2 1-3 2-4 3-5
0-1 0-2 1-3 2-4 3-5 4-6 5-7 6-7
0-9 1-8 2-14 3-7 5-6 4-7 5-8 6-10 7-11
For each tuple, the first digit represents the index of a word on line n in text a and the second digit the index of a word on the same line n but in text b. It also worth pointing out that the same word in text a may be connected to multiple words in the text b; as in the case of line at index 0, the word at position 0 in text a is connected to both words at position 0 and 1 in text b. Now I want to extract information out of the above line so it is easy retrieve which word in text a is connected to which word in text b. What I have thought is using dictionaries as in the following code:
#suppose that I have opened the file as f
for line in f.readlines():
#I create a dictionary to save my results
dict_st=dict()
#I split the line so to get items like '0-0', '0-1', etc.
items=line.split()
for item in align_spl:
#I split each item at the hyphen so to get the two digits that are now string.
als=item.split('-')
#I fill the dictionary
if dict_st.has_key(int(als[0]))==False:
dict_st[int(als[0])]=[int(als[1])]
else: dict_st[int(als[0])].append(int(als[1]))
After all the infromation related to words correspondence across texts are extracted, I then print the word that are aligned to each other. Now this method is very slow; especially if I have to repeat it from more than 500.000 sentences. I was wondering if there is a faster way to extract these information. Thank you.
has_key
.if int(als[0]) not in dict_st:
works fine – gnibbler 8 mins agoalign_spl
? – gnibbler 4 mins ago