Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

Motivation: I have a large number of large read-only dicts with large string keys mapped to tuples of two floats. They are taking up a large amount of heap space.

Proposed solution: a 3xn numpy array, with the first column being the hash value of the key, and sorted by that column. Only contains and getitem ops are required for my needs.

I realize that this solution is O(log(n)) rather than O(c) but I assume that this isn't that big a problem for my application as there aren't that many lookups per request and my primary problem is memory constraints.

I'm not sure how well searchsorted performs here though, or if I should try to change the stride somehow so that the first column is contiguous or something like that. I should also probably be using a record array.

import numpy

class FCD(object):

  def __init__(self, d):
    data = []
    for key, value in d.iteritems():
      data.append((hash(key), value[0], value[1]))
    data.sort()
    self.data = numpy.array(data)

  def __getitem__(self, key):
    hkey = hash(key)
    ix = numpy.searchsorted(self.data[:, 0], hkey)
    if ix < len(self.data) and self.data[ix][0] == hkey:
      return self.data[ix][1:]
    else:
      raise KeyError

  def __contains__(self, key):
    hkey = hash(key)
    ix = numpy.searchsorted(self.data[:, 0], hkey)
    return ix < len(self.data) and self.data[ix][0] == hkey
share|improve this question
    
I am a little confused as to why you do not just use a dict here. Can you prove that your solution is more memory efficient. The lookups will certainly be slower. –  ebarr Apr 13 at 12:27

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.