Code Review Stack Exchange is a question and answer site for peer programmer code reviews. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

Here is the problem description (from an algorithm course: https://www.coursera.org/learn/algorithm-design-analysis):

The goal of this problem is to implement a variant of the 2-SUM algorithm (covered in the Week 6 lecture on hash table applications).

The file contains 1 million integers, both positive and negative (there might be some repetitions!). This is your array of integers, with the \$i\$th row of the file specifying the \$i\$th entry of the array.

Your task is to compute the number of target values \$t\$ in the interval [-10000,10000] (inclusive) such that there are distinct numbers \$x,y\$ in the input file that satisfy \$x+y=t\$.

I wrote the following Python program. It runs already 25 minutes and just calculated 6200 of 20000 numbers. This is too slow.

import sys

def has_two_sum(n,num_set,nums):
  res = any(((n-x) in num_set) and 2*x !=n for x in nums)
  return res

with open(sys.argv[1], 'r') as fp:
  nums = [int(line) for line in fp]
  num_set = set(nums)

two_sum = sum(1  for n in range(-10000,10001) if has_two_sum(n,num_set,nums))
print(two_sum)
share|improve this question
    
You're asking for help/providing partial solutions to a Coursera algorithms course. I'm pretty sure this violates the Coursera TOS... at the very least, you should probably cite your source for the problem if you're going to the trouble of putting the description in a quote block. – apnorton yesterday
    
Oops. I will cite it. Thanks a lot !!! – David Michael Gang yesterday
up vote 7 down vote accepted

Your code runs instantly on my system with 1 million randomly generated numbers. are you sure you are searching for n - x in num_set and not in nums?

Anyway, a couple of Python style pointers:

  1. You can iterate over the set directly, no need to pass both the set and the list:

    def has_two_sum(n, num_set):
        res = any(((n-x) in num_set) and 2*x != n for x in num_set)
        return res
    
  2. Not much point in storing a value to immediuately return it:

    def has_two_sum(n, num_set):
        return any(((n-x) in num_set) and 2*x != n for x in num_set)
    
  3. Boolean values are converted to 0 or 1 when used in an integer setting, so your adding of values is typically written as:

    two_sum = sum(has_two_sum(n, num_set) for n in range(-10000, 10001))
    
  4. If you are going to have a function that contains a single liner, you better give it a great, descriptive name. I don't think that has_two_sum really fits the bill. Perhaps has_two_items_that_sum_to_n or has_two_that_sum_to_n? Alternatively, I think my preferred option here would be to get rid of the function altogether and let code speak by itself, writing the whole thing as two nested comprehensions:

    two_sum = sum(any(n - x in num_set and 2*x != n for x in num_set)
                  for n in range(-10000, 10001))
    

    There is a school of thought that considers nested comprehensions to be confusing (e.g. the Google Python style guide doesn't allow them), but I think this case is simple enough, YMMV.

share|improve this answer
3  
If you already got rid of the need to keep num as a list, a set comprehension should build it faster than the detour over the list. – Graipher yesterday
    
Hi, I get much worse timings. I forgot to say that my integers are in the size of 11 digits. I also use python 3. Maybe this makes a difference – David Michael Gang yesterday
1  
Are you on a 32 bit system? If Python has to use its long, unlimited precision integers, then yes, things will get substantially slower. 11 digits is more than 2**31 - 1, but 2**63 - 1 is 18 digits, so on a 64 bit build of Python this should be a non issue. Python 2 or 3 I think should have no major effect on this, although my timings were on Python 2. – Jaime yesterday

To get a grasp of what to speedup I ran a profiler on your code. To get the 1 million integers I made a function to generate them, with the following:

def generate_random_numbers(amount=1000000):
    import random
    n = [str(int(random.normalvariate(0, amount))) for _ in range(amount)]
    with open('data', 'w') as f:
        f.write('\n'.join(n))

Yes I did put the sigma as one million too. I then copied your code into a function, removed the file part, and ran the profiler over it. Resulting in the following code:

import cProfile

def read_nums():
    with open('data', 'r') as fp:
        return [int(line) for line in fp]

def has_two_sum(n,num_set,nums):
    res = any(((n-x) in num_set) and 2*x !=n for x in nums)
    return res

def two_sum(nums):
    nums = [int(line) for line in nums]
    num_set = set(nums)
    return sum(1  for n in range(-10000,10001) if has_two_sum(n,num_set,nums))

nums = read_nums()
cProfile.run('two_sum({!r})'.format(nums))

Which resulted in the following profile of your code:

         166366 function calls in 6.966 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.362    0.362    0.900    0.900 2sum.py:11(two_sum)
        1    0.395    0.395    0.395    0.395 2sum.py:12(<listcomp>)
    20002    0.015    0.000    0.139    0.000 2sum.py:14(<genexpr>)
    20001    0.025    0.000    0.124    0.000 2sum.py:7(has_two_sum)
   106356    0.052    0.000    0.052    0.000 2sum.py:8(<genexpr>)
        1    0.216    0.216    1.116    1.116 <string>:1(<module>)
    20001    0.047    0.000    0.095    0.000 {built-in method builtins.any}
        1    5.851    5.851    6.966    6.966 {built-in method builtins.exec}
        1    0.004    0.004    0.143    0.143 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Which is not 25 minuets for 6200 of the 20000. Instead it's a lot less. It says it took 6.966, but cumtime of two_sums is 0.900. I then used a timer on it, which says it runs in ~0.899s. Using:

print(timeit('two_sum({!r})'.format(nums), 'from __main__ import two_sum', number=100)/100)

So performance wise, I'd say it's not Python's fault, and would be very hard to optimize. Instead I'd change your code to be a 'three line' function, but format the comprehensions to be more readable.

  • I'd remove un-needed brackets
  • Use a set compression.
  • Remove odd whitespace. ((1 for ...)
def two_sum(nums):
    nums = {int(n) for n in nums}
    return sum(
        1
        for n in range(-10000, 10001)
        if any(n - x in nums and 2 * x != n for x in nums)
    )
share|improve this answer
    
Since you profiled the code, is there an advantage in using x + x != n over 2 * x != n? – Mathias Ettinger yesterday
    
@MathiasEttinger I ran the profile code on the original twice and it had a 10% difference (0.896 - 0.784) in speed, when I changed it to use + it went to 0.804. Honestly I don't know if it's any good... – Joe Wallis yesterday
    
Ok, nice to know. It was just out of curiosity since the use case was pretty close to the toy example in this doc. – Mathias Ettinger yesterday

You aren't handling duplicates correctly, by my interpretation of the problem. Based on the sample code in the course notes, I interpret "distinct" to mean that you may not use the same entry twice, but you may use the same number twice if it occurs more than once in the file. That is, if the input file contains

3
3

… then I would expect it to be able to form a target sum of 6. Your program reports that 6 cannot be formed.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.