Code Review Stack Exchange is a question and answer site for peer programmer code reviews. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

Here's my script:

def index(seq, item):
    """Returns the index of an item in an increasingly-sorted
       sequence (if exists) otherwise returns the index where
       it would be inserted, using binary search.

       seq: a sequence.

       precondition: seq is sorted increasingly.
       """
    # Updates the sequence with the item using a
    # set to avoid duplicates.
    seq = set(seq)
    seq.update([item])

    # Converts the sequence to a list and re-sorts it. 
    sorted_seq = sorted(seq)

    middle = len(sorted_seq) // 2
    middle_value = sorted_seq[middle]

    first_half = sorted_seq[:middle]
    second_half = sorted_seq[middle:]

    if item == middle_value:
        return middle

    elif item < middle_value:
        return index(first_half, item)

    elif item > middle_value:
        return index(second_half, item) + len(first_half)

How can it be refactored and optimized?

Notes:

  • I'm a hobbyist and beginner (in programming in general, and Python in particular).

  • I'm new to binary search I just understand the basic idea.

share|improve this question
up vote 2 down vote accepted
seq = set(seq)
seq.update([item])

# Converts the sequence to a list and re-sorts it. 
sorted_seq = sorted(seq)

In worst case, sorted() will take \$\Theta(n \log n)\$, where \$n = \$ len(seq). Even worse, if you did not found the value, you recur to half of the original sequence, and so on.

So, instead of presorting the sequence, just proceed to doing the actual search; if the caller fails to input a sorted list, it's not your fault.

I have this in mind, runs in worst case logarithmic time and behaves exactly as your implementation:

def binary_search(lst, value):
    count = len(lst)
    first = 0

    while count:
        it = first
        step = count // 2
        it += step
        if value > lst[it]:
            it += 1
            first = it
            count -= step + 1
        else:
            count = step

    return first


def main():
    lst = [1, 2, 5, 10, 11, 17, 20]

    for value in range(0, 25):
        print("OP:", index(lst, value))
        print("cr:", binary_search(lst, value))
        print("---")

if __name__ == "__main__":
    main()

Hope that helps.

share|improve this answer
    
Thank you! That's far better than my algorithm, and not buggy as it. – Mahmud Muhammad Naguib 19 hours ago
    
Can we use it = first+step instead of initializing it with first and incrementing with step? I've got the total idea and the logic, but can you, please, explain the logic behind count -= step + 1 to me? – Mahmud Muhammad Naguib 19 hours ago
seq = set(seq)

This will cause index calculations to be incorrect. Consider this invocation:

index([1, 1, 2, 3], 2)

This should return 2 but your function returns 1. As a caller of your index method, I would expect the returned index to have item in seq.


seq.update([item])

This is a bug. It's causing print(index([1, 1, 2, 3], 15)) to return 3, when it should return -1 or some other sentinel value to denote absence of an item in seq.

EDIT: The function's documentation clearly says that index will return the final index after insertion if item does not exist in seq.


sorted_seq = sorted(seq)

This is unnecessary. Validate/assume your input seq to be sorted. This should be the standard expectation from any caller invoking your index function.


Be careful about:

first_half = sorted_seq[:middle]
second_half = sorted_seq[middle:]

Python fortunately will not create copies of the list but other languages may create copies of the original list for this operation.


After you have removed seq.update([item]), you will need to check for the base of your recursion, namely when seq is empty, causing

middle_value = sorted_seq[middle]

to fail.


As an aside, try implementing this as an iterative algorithm.

  1. It will run faster
  2. The recursive version will crash for large seq.

Totally aside, have a look at Binary Search tutorial on TopCoder Its a pretty good article.

share|improve this answer
    
Thank you! I've stated in the docstring otherwise returns the index where it would be inserted, theerfore I've used seq.update([item]). But Using seq = set(seq) is buggy as you've mentioned. Can you explain, please, what you mean by Python fortunately will not create copies of the list but other languages may create copies of the original list for this operation., ? – Mahmud Muhammad Naguib yesterday
    
@MahmudMuhammadNaguib I was pointing at the memory requirements of your algorithm. In each recursive step of your algorithm, you probably do not want to create fresh copies of the first and second halves of the arrays. Python slices do this by default but in a language other than python, you would probably want to pass the list by reference along with high/end and low/start pointers/indexes that mark the boundaries of the sequence to be searched for. This would incur the cost of 3 pointers in each recursive step, making extra memory requirements O(1). – j4nu5 yesterday
    
Thanks! I suggest you remove sorted_seq = sorted(seq) and the section about it. After inserting item the sequence won't be sorted anymore, so we have to re-sort it. – Mahmud Muhammad Naguib 20 hours ago
    
Please do not do that. As the accepted answer suggests, that would make the time complexity O(n lg n). Beats the point of binary search, which is supposed to run in O(lg n). Instead this should be your binary search criteria: "Find the index of the leftmost element in seq which is >= item". This will take care of duplicates and missing elements. – j4nu5 17 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.