Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I have the following file containing over 500.000 lines. The lines look like the following:

0-0 0-1 1-2 1-3 2-4 3-5
0-1 0-2 1-3 2-4 3-5 4-6 5-7 6-7
0-9 1-8 2-14 3-7 5-6 4-7 5-8 6-10 7-11

For each tuple, the first digit represents the index of a word on line n in text a and the second digit the index of a word on the same line n but in text b. It also worth pointing out that the same word in text a may be connected to multiple words in the text b; as in the case of line at index 0, the word at position 0 in text a is connected to both words at position 0 and 1 in text b. Now I want to extract information out of the above line so it is easy retrieve which word in text a is connected to which word in text b. What I have thought is using dictionaries as in the following code:

#suppose that I have opened the file as f
for line in f.readlines():
    #I create a dictionary to save my results
    dict_st=dict()
    #I split the line so to get items like '0-0', '0-1', etc.
    items=line.split()  
    for item in align_spl:
        #I split each item at the hyphen so to get the two digits that are now string.
        als=item.split('-')
        #I fill the dictionary
        if dict_st.has_key(int(als[0]))==False:
            dict_st[int(als[0])]=[int(als[1])]
        else: dict_st[int(als[0])].append(int(als[1]))

After all the infromation related to words correspondence across texts are extracted, I then print the word that are aligned to each other. Now this method is very slow; especially if I have to repeat it from more than 500.000 sentences. I was wondering if there is a faster way to extract these information. Thank you.

share|improve this question
Don't use has_key. if int(als[0]) not in dict_st: works fine – gnibbler 8 mins ago
What is align_spl? – gnibbler 4 mins ago

1 Answer

Hi I am not sure that this is what You need

If You need dictionary for each line:

for line in f.readlines():
    dict_st=dict()
    for item in line.split():
        k, v = map(int, item.split('-'))
        dict_st.setdefault(k, set()).add(v)

If You need dictionary for whole file:

dict_st={}
for line in f.readlines():
    for item in line.split():
        k, v = map(int, item.split('-'))
        dict_st.setdefault(k, set()).add(v)

I have used set instead of list to prevent value repeats. If You need these repeats please use 'list`

dict_st={}
for line in f.readlines():
    for item in line.split():
        k, v = map(int, item.split('-'))
        dict_st.setdefault(k, []).append(v)
share
1  
using defaultdict(set) would be neater. Also for line in f: doesn't need to read the whole file into memory at once – gnibbler 2 mins ago
Yes sorry You are right. I have copied this line and did not noticed readlines() – oleg 59 secs ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.