Creating dictionaries in a faster way - Python

Question

I have the following file containing over 500.000 lines. The lines look like the following:

0-0 0-1 1-2 1-3 2-4 3-5
0-1 0-2 1-3 2-4 3-5 4-6 5-7 6-7
0-9 1-8 2-14 3-7 5-6 4-7 5-8 6-10 7-11

For each tuple, the first digit represents the index of a word on line n in text a and the second digit the index of a word on the same line n but in text b. It also worth pointing out that the same word in text a may be connected to multiple words in the text b; as in the case of line at index 0, the word at position 0 in text a is connected to both words at position 0 and 1 in text b. Now I want to extract information out of the above line so it is easy retrieve which word in text a is connected to which word in text b. What I have thought is using dictionaries as in the following code:

#suppose that I have opened the file as f
for line in f.readlines():
    #I create a dictionary to save my results
    dict_st=dict()
    #I split the line so to get items like '0-0', '0-1', etc.
    items=line.split()  
    for item in align_spl:
        #I split each item at the hyphen so to get the two digits that are now string.
        als=item.split('-')
        #I fill the dictionary
        if dict_st.has_key(int(als[0]))==False:
            dict_st[int(als[0])]=[int(als[1])]
        else: dict_st[int(als[0])].append(int(als[1]))

After all the infromation related to words correspondence across texts are extracted, I then print the word that are aligned to each other. Now this method is very slow; especially if I have to repeat it from more than 500.000 sentences. I was wondering if there is a faster way to extract these information. Thank you.

Don't use has_key. if int(als[0]) not in dict_st: works fine

oleg · Answer 1 · 2013-06-13 11:38:58Z

Hi I am not sure that this is what You need

If You need dictionary for each line:

for line in f.readlines():
    dict_st=dict()
    for item in line.split():
        k, v = map(int, item.split('-'))
        dict_st.setdefault(k, set()).add(v)

If You need dictionary for whole file:

dict_st={}
for line in f.readlines():
    for item in line.split():
        k, v = map(int, item.split('-'))
        dict_st.setdefault(k, set()).add(v)

I have used set instead of list to prevent value repeats. If You need these repeats please use 'list`

dict_st={}
for line in f.readlines():
    for item in line.split():
        k, v = map(int, item.split('-'))
        dict_st.setdefault(k, []).append(v)

using defaultdict(set) would be neater. Also for line in f: doesn't need to read the whole file into memory at once
Yes sorry You are right. I have copied this line and did not noticed readlines()

asked	today
viewed	40 times
active	today

Creating dictionaries in a faster way - Python

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python dictionary or ask your own question.

Community Bulletin

Creating dictionaries in a faster way - Python

1 Answer

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python dictionary or ask your own question.

Community Bulletin

Related