Going through some old code, I found this function.

Given a string carrier and integer nsyms, it consumes nsyms distinct characters from the carrier string and returns a dictionary containing consumed characters and their respective counts, and the remainder of the carrier string.

def enc_consume(carrier, nsyms):
  chrs = defaultdict(int)

  i = 0  # see this?

  for i,c in enumerate(carrier):
    # `i` is not used in the loop
    chrs[c] += 1
    if len(chrs) == nsyms:
      break

  return chrs, carrier[i+1:]  # but it's used here

How would you rewrite this? I found the assignment i = 0 confusing, since it was followed by a for i..., which will of course do the assignment as well.

share|improve this question
1  
Yes, the i = 0 is effectively pointless because the name is reassigned on the next line. – Luke Sawczak 15 hours ago
    
It's not "effectively pointless", because if the assignment wasn't made, i would be out of scope by the time it is used in return. It's a local variable and its scope is limited to the for statement. After the for, it doesn't exist anymore. – kyrill 15 hours ago
3  
It does still exist (just tried your function in Python 2.7 and 3.5, removing that assignment line, for my sanity). Perhaps Python should automatically trash for loop variables, but it doesn't. – Luke Sawczak 13 hours ago
    
You're right! I must have thought it was Go or something... In that case, this whole answer is pointless, since the assignment was the only thing I found confusing. – kyrill 12 hours ago
1  
Yes, and originally I did accept that answer. But I prefer to iterate over the elements of a sequence instead of incrementing a counter and retrieving an element in each iteration. Especially in interpreted languages, it can be quite inefficient. – kyrill 12 hours ago
up vote 7 down vote accepted

I would avoid using indexes altogether and rely on iter to advance character by character and remember what's left in the string:

def enc_consume(carrier, nsyms):
    characters = defaultdict(int)
    stream = iter(carrier)

    for char in stream:
        characters[char] += 1
        if len(characters) >= nsyms:
             break

    return characters, ''.join(stream)
share|improve this answer
1  
Very elegant solution! As per looks. I wondered what the performance implications of ''.join(stream) were, since the carrier string can be thousands of characters long. A quick test with a string of length 65536 advanced by 10 characters shows it takes ~2000 times longer than returning a substring. However, to make the substring, you need the current position. That means you cannot avoid using an index, or can you? Yes, using the __length_hint__ method of the iterator. It's called "hint", but for underlying sequences with __len__ method, it returns exact length of the remaining part. – kyrill 13 hours ago
    
@kyrill Just tried it, with len(a) of 65536: %timeit enc_consume(a, 10) 10 loops, best of 3: 49 ms per loop and %timeit enc_consume2(a, 10) 10 loops, best of 3: 40.2 ms per loop. Version 2 being the one with join. – Mathias Ettinger 13 hours ago
    
What interpreter? I use CPython 3.5.1 on linux and the join-version is definitely order of magnitude slower. – kyrill 13 hours ago
    
@kyrill This was IPython 5.1.0 running Python 3.6.0 on windows. When I get the chance, I'll try on your setup as well. – Mathias Ettinger 13 hours ago
    
@kyrill Silly me, a don't have 10 different characters... So yeah, pretty much crap on speed. – Mathias Ettinger 12 hours ago
def enc_consume(carrier, nsyms):
    first_unused_idx = 0
    char_cnt = defaultdict(int)
    while first_unused_idx < len(carrier) and len(char_cnt) < nsyms:
        char_cnt[carrier[first_unused_idx]] += 1
        first_unused_idx += 1 
    return char_cnt, carrier[first_unused_idx:]

I'd suggest using a while loop here because it would make declaring the counter outside the loop natural and the number of iterations is not known beforehand (which is a common use case for a while loop).

share|improve this answer

Instead of a defaultdict(int), consider using a Counter instead. The code in this function would otherwise be identical, but the Counter would more clearly convey your intentions. Furthermore, Counter methods such as .most_common() might be useful to the caller.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.