Bencode “Parser” in Python 3

Question

class BencodeObject:

def __init__(self,filename):
    with open(filename,'rb') as diskbytes:
        self.loads(diskbytes)

def _isdictionary(self,bits): return(bits == b"d")
def _isnumber(self,bits): return(bits > b"/" and bits < b":")
def _iscolon(self,bits): return(bits == b":")
def _isint(self,bits): return(bits == b"i")
def _islist(self,bits): return(bits == b"l")
def _isend(self,bits): return(bits == b"e")
def _isnull(self,bits): return(bits == b"")

def loads(self,bcbuffer):
    tokens = []
    null = b""
    numbuffer = null # a buffer for storing number bytes
    last = []
    while True:
        atom = bcbuffer.read(1)
        if self._isdictionary(atom): last.append(atom)
        elif self._isnumber(atom): 
            numbuffer += atom
        elif self._iscolon(atom):                              
            tokens.append(bcbuffer.read(int(numbuffer)))
            numbuffer = null 
        elif self._isint(atom): last.append(atom)
        elif self._islist(atom): last.append(atom)
        elif self._isend(atom): 
            char = last.pop()
            if self._isint(char):
                tokens.append(int(numbuffer))
                numbuffer = null
        else: break
    self.bytestrings = tokens
    self.trackerURL = str(self.bytestrings[1],"utf-8")

Looking for whatever input I can get.

Winston Ewert · Answer 1 · 2011-11-02 14:43:01Z

class BencodeObject:

Putting object inside Class names is generally not a good idea, because it doesn't tell me anything.

def __init__(self,filename):
    with open(filename,'rb') as diskbytes:
        self.loads(diskbytes)

def _isdictionary(self,bits): return(bits == b"d")

There is no need for the parens around the expression. They just clutter things

def _isnumber(self,bits): return(bits > b"/" and bits < b":")

That's a confusing way to express this. Instead use bits >= b'0' and bits <= '9'

def _iscolon(self,bits): return(bits == b":")
def _isint(self,bits): return(bits == b"i")
def _islist(self,bits): return(bits == b"l")
def _isend(self,bits): return(bits == b"e")
def _isnull(self,bits): return(bits == b"")

Its not clear how useful these functions are. You call them once, and calling them isn't shorter then just checking the string for equality in the first place. Also the pythons style guide recommends using underscores_to_seperate_words. As it is I keep reading isend as i send.

def loads(self,bcbuffer):
    tokens = []
    null = b""
    numbuffer = null # a buffer for storing number bytes

I'm not sure a constant for an empty string is neccessary. If you insist I suggest it should be a global constant.

    last = []

Last suggests, well, last. But this is more of a stack. I think it could have a better name

    while True:
        atom = bcbuffer.read(1)
        if self._isdictionary(atom): last.append(atom)

Mixing between on-the-same-line, and on-the-next lines within a string of if-elifs looks ugly and makes it harder to read. I'm generally not a fan of putting the body on the same line as the if expression. But it really doesn't work when you mix both styles.

        elif self._isnumber(atom): 
            numbuffer += atom
        elif self._iscolon(atom):                              
            tokens.append(bcbuffer.read(int(numbuffer)))
            numbuffer = null 
        elif self._isint(atom): last.append(atom)
        elif self._islist(atom): last.append(atom)

You last.append(atom) a lot. Consider or to combine the conditions

        elif self._isend(atom): 
            char = last.pop()
            if self._isint(char):
                tokens.append(int(numbuffer))
                numbuffer = null
        else: break

You don't distinguish between finding invalid characters and the end of the file. In both case you just quit. I'd recommend throwing an exception if you find an invalid character.

    self.bytestrings = tokens

But the tokens also include numbers. So storing them in bytestrings is kinda odd.

    self.trackerURL = str(self.bytestrings[1],"utf-8")

Your loop follows the common pattern of building data structures over several iterations of the loop and then consuming them. Don't do that. It ends up being hard to follow the logic of the function because I've got to figure out what happens over several iterations of the loop. Instead, each iteration of the loop should handle a logical unit.

Here is how I'd approach the problem:

def read_number(bcbuffer, terminator):
    read_byte = lambda: bcbuffer.read(1)
    bytes_iter = iter(read_byte, terminator)
    text = ''.join(bytes_iter)
    return int(text)


def parse_tokens(self, bcbuffer):
    tokens = []

    while True:
        atom = bcbuffer.read()
        if atom == b'i':
            tokens.append( read_number(bcbuffer, b'e') )
        elif atom >= b'0' and atom <= b'9':
            length = read_number(bcbuffer, b':')
            tokens.append( bcbuffer.read(length) )
        elif atom in b"lde":
            pass # ignore the structure which we don't care about
        elif atom == b'':
            break
        else:
            raise IOError('Did not expect %s' % atom)

   return tokens

Notice that I don't have a collection of variables to store partial results. Because each iteration of the loop handles a single "token" I don't need to store data other then my result across the iterations. It is also much clearer, I think, what is going on in the parsing.

I've not tested this code in anyway, use at own risk. I've ignored the possiblity of invalid bencodes. read_number will end up in an infinite loop if it cannot find the terminator. non-digits before the terminator will cause int to throw a ValueError. Various methods to handle those exist. I'd probably wrap file.read() to throw an exception when trying to read past the end of the file.

Can you elaborate a little more on the “class name” comment at the top? Perhaps I am missing some Pythonic OOP concept. Or just being a dumb dumb. :)
@veryfoolish, Your class's name is BencodeObject. Everything in python is an object. So its kinda useless to point that out in your class name. A better class named might be BencodeExtractor. My point is just that Object is a useless word that means almost nothing. So don't put it in your class names

asked	1 year ago
viewed	294 times
active	1 year ago

Bencode “Parser” in Python 3

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python or ask your own question.

Bencode “Parser” in Python 3

1 Answer

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python or ask your own question.

Related