Word Square generation in Python

Question

For reasons unknown I've recently taken up generating word squares, or more accurately double word squares. Below you can see my implementation in Python, in about 40 lines. The code uses this word list. It takes around 10ms for dim=2 (2x2 word squares) and around 13 seconds for dim=3 (3x3 squares). For dim=4 it explodes to something like 3.5 hours in cpython, and causes a MemoryError in pypy. For dim=20 and dim=21 it returns 0 solutions in a few seconds, but for dim=18 and dim=19 it causes a MemoryError in both pypy and cpython.

So, I am looking for improvements that will allow me to explore values of dim >4, but also to explore ways of solving the MemoryErrors for the values <20. Small improvements of a few %, improvements in how things are expressed, as well as large improvements in the algorithm are all welcome.

import time
start = time.clock()

dim = 3 #the dimension of our square
posmax = dim**2 #maximum positions on a dim*dim square

words = open("word.list").read().splitlines()
words = set([w for w in words if len(w)==dim])
print 'Words: %s' % len(words)

prefs = {}
for w in words:
    for i in xrange(0,dim):
        prefs[w[:i]] = prefs.get(w[:i], set())
        prefs[w[:i]].add(w[i])

sq, options = ['' for i in xrange(dim)], {}

for i in prefs: 
    for j in prefs:
        options[(i,j)] = [(i+o, j+o) 
            for o in prefs[i] & prefs[j]]

schedule = [(p/dim, p%dim) for p in xrange(posmax)]

def addone(square, isquare, position=0):
    if position == posmax: yield square
    else:
        x,y = schedule[position]
        square2, isquare2 = square[:], isquare[:]

        for o in options[(square[x], isquare[y])]:
            square2[x], isquare2[y] = o
            for s in addone(square2, isquare2, position+1):
                yield s

print sum(1 for s in addone(sq, sq[:]))

print (time.clock() - start)

Would you like to find all word squares or just the first one?
it's currently geared towards finding all squares. Finding one would be acceptable if all wasn't feasible.

Paul Crowley · Answer 1 · 2012-06-23 14:59:43Z

up vote 3 down vote

Am working on a few minor improvements, but I think the major saving to be had is in removing the line

square2, isquare2 = square[:], isquare[:]

Instead, do

sofar = square[x], isquare[y]
for o in options[sofar]:
    square[x], isquare[y] = o
    for s in addone(square, isquare, position+1):
        yield s
square[x], isquare[y] = sofar

edited Jun 23 '12 at 14:59

answered Jun 23 '12 at 14:52

Paul Crowley
1312

great idea Paul! Gave a nice ~10% boost. – Alexandros Marinos Jun 23 '12 at 18:33

Winston Ewert · Answer 2 · 2012-06-23 16:53:31Z

import time
start = time.clock()

dim = 3 #the dimension of our square
posmax = dim**2 #maximum positions on a dim*dim square

Python convention is to have constants be in ALL_CAPS

words = open("word.list").read().splitlines()

Actually, a file iterates over its lines so you can do words = list(open("words.list"))

words = set([w for w in words if len(w)==dim])

I'd make it a generator rather then a list and combine the previous two lines

print 'Words: %s' % len(words)

It is generally preferred to do any actual logic inside a function. Its a bit faster and cleaner

prefs = {}
for w in words:

I'd suggest spelling out word

    for i in xrange(0,dim):
        prefs[w[:i]] = prefs.get(w[:i], set())
        prefs[w[:i]].add(w[i])

Actually, you can do prefs.setdefault(w[:i],set()).add(w[i]) for the same effect.

sq, options = ['' for i in xrange(dim)], {}

You can do sq, options = [''] * dim, {} for the same effect

for i in prefs: 
for j in prefs:
    options[(i,j)] = [(i+o, j+o) 
        for o in prefs[i] & prefs[j]]

schedule = [(p/dim, p%dim) for p in xrange(posmax)]

This can be written as schedule = map(divmod, xrange(posmax))

def addone(square, isquare, position=0):
#for r in square: print r #prints all square-states

Don't leave dead code as comments, kill it!

if position == posmax: yield square

I'd put the yield on the next line, I think its easier to read especially if you have an else condition

else:
    x,y = schedule[position]
    square2, isquare2 = square[:], isquare[:]

In the one line you don't have a space after the comma, in the next line you do. I suggest always including the space.

    for o in options[(square[x], isquare[y])]:
        square2[x], isquare2[y] = o
        for s in addone(square2, isquare2, position+1):
            yield s

print sum(1 for s in addone(sq, sq[:]))

print (time.clock() - start)

minor corrections - schedule = map(divmod, xrange(POS_MAX), [DIM]*POS_MAX) also, list(open("words.list")) included the \n at the end of each word which needed some working around. In the end I fif it in 65 chars precisely: words = set(w[:-1] for w in open("words.list") if len(w)==DIM+1)
@AlexandrosMarinos, ok, my baew. I'd use [divmod(x, DIM) for x in xrange(POS_MAX)] I'd also use w.strip() instead of w[:-1] but that only IMO.
w.strip() is cleaner, but it'd take me over the 65 char limit.. choices, choices :)
@AlexandrosMarinos, for what its worth: the official python style guide recommends a limit of 79 characters.

Paul Crowley · Answer 3 · 2012-06-23 18:41:39Z

up vote 0 down vote

I tried a different schedule, it made a very small difference to the time though!

schedule = [(b-y,y)
    for b in range(DIM*2)
    for y in range(min(b+1, DIM))
    if b-y< DIM]

assert len(schedule) == POSMAX

answered Jun 23 '12 at 18:41

Paul Crowley
1312

schedules are so fascinating.. I've tried a few, but the timing remains roughly the same. I do have a suspicion that a good schedule could make a difference, but I also fear that they make no difference. I was thinking of trying to generate random schedules then time them and see if anything strange comes up, or even do some genetic algos to discover interesting schedules if the random stuff points to significant gains.. – Alexandros Marinos Jun 23 '12 at 19:54

so bizarre.. schedule = [tuple(reversed(divmod(x, DIM))) for x in xrange(POS_MAX)] (effectively the basic schedule but with x and y reversed) seems to be the best performing schedule on a 3x3 out of 24 possibilities. It is a slight improvement on the order of 100ms but for the life of me can't figure out why. – Alexandros Marinos Jun 24 '12 at 11:38

Paul Crowley · Answer 4 · 2012-06-23 20:21:44Z

We fill column by column with the current schedule. I tried adding a check whether we're going to be able to put anything in the next row before filling the rest of the column, but it results in a slight slowdown.

if x+1 < DIM and len(FOLLOWSBOTH[(square[x+1], transpose[y])]) == 0:
    continue

I am still tempted to think that something like this but more thorough could still save time: checking not just this one spot, but everything remaining in the row and the column, to ensure that there are compatible letters. That needs a new, more complex data structure though!

maybe a 'deadends' dictionary, so we can check (square[x+1], transpose[y]) in deadends would speed this up? off to try it I am.
wait. I think this wouldn't offer anything with the normal schedule, since square[x+1] is always '', and therefore it comes down to options[transpose[y]], which is always something, if you've gotten that far. Have I missed something?

asked	8 months ago
viewed	299 times
active	8 months ago

Word Square generation in Python

4 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged python optimization or ask your own question.

Welcome!

Word Square generation in Python

4 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged python optimization or ask your own question.

Welcome!

Related