6
\$\begingroup\$

I've tried this problem on Codeforces. Given a set of string patterns, all of which are the same length, and which consist of literal lowercase ASCII characters and ? wildcards, find a pattern that satisfies them all.

I want to optimize and simplify it further and make it a more elegant solution, if possible.

import sys

def main():
    n = int(sys.stdin.readline())
    t = sys.stdin.readlines()
    l = len(t[0])
    result = ''
    if n==1:
        result = t[0].replace('?', 'x')
    else:
        for y in range(0,l-1):
            let = t[0][y]
            for x in range(1,n):
                if let == '?' and t[x][y]!= '?':
                    let = t[x][y]
                if t[x][y] != let and t[x][y] != '?':
                    result += '?'
                    break
                elif x == n-1:
                    if let == '?':
                        result += 'x'
                    else:
                        result += let
    print result

main()
\$\endgroup\$

2 Answers 2

6
+50
\$\begingroup\$

(This answer is in Python 3. Feel free to make the necessary adjustments to run it on 2.x)

Introduction

The review posted by janos underscores the need for good names, following coding conventions. The post suggests incremental improvements, which is a step in the right direction.

To make more radical improvements to your code, you need to recognize the deeper structural problems, which arise because you're not using the full flexibility of Python and because you aren't using the right data structure to simplify your algorithm.

Handling the input

The only responsibility of main()should be to collect and sanitize the input and subsequently output the result:

def main():
    pattern_count = int(sys.stdin.readline())
    patterns = itertools.islice(sys.stdin, pattern_count)
    result = intersect_patterns(p.strip() for p in patterns)
    print(result)

The calculation should be kept separate, in the intersect_patterns function.

Join instead of +=

It would be more elegant to separate the concatenation of the resulting string from the calculation of its contents. You can achieve this by using Python's yield keyword to create a generator whose elements can be joined like so:

def intersect_patterns(lines):
    return ''.join(_intersect_patterns(lines))

Iterating the right way

You are making your algorithm a lot more complex by iterating over the input in the traditional line-by-line fashion when you are in fact interested in examining one column at a time.

The solution is to think of the lines as rows and the characters as columns in a matrix. To iterate over the columns instead of the rows, transpose it using the built-in zip function with the * operator, as shown in this answer on StackOverflow.

def _intersect_patterns(lines, wildcard='?', fill='x'):
    for column in zip(*lines):
        literals = {char for char in column if char != wildcard}
        if not literals:
            yield fill
        elif len(literals) == 1:
            yield literals.pop()
        else:
            yield wildcard

The right data structure for the job

Wow, where did all the code go? It turns out that there is a data structure which can do most of the work for you: the set (which we create using a set comprehension, or {...}), because for each column, we only need to examine the unique literals, disregarding any wildcards, to determine what to put in the intersecting pattern we are calculating.

There are only three possible cases

The column we are examining contains...

  1. No literals → only wildcards, so we need to insert a literal (for instance x) into the output.
  2. Exactly one unique literal, so we need to insert that literal into the output.
  3. More than one unique literal, so we need to insert a wildcard into the output.

We simply yield the correct character on every iteration. Our caller can then take care of assembling the result string and printing it.

Conclusion

  • Before racing to implement a solution, think about the algorithms and data structures that might help you. For example, iterate over the data in a way that makes sense and use what the standard library has to offer.

  • Analyze the possible scenarios by writing them down before coding them, which will help you discover the simplest solution.

  • Separate your concerns.

  • If you need to write if statements for special cases, you might be doing it wrong.

\$\endgroup\$
2
  • 2
    \$\begingroup\$ It's elegant and beautiful, great post! Just one drawback compared to the original ugly one is the way you build literals is wasteful, because you iterate over all characters in the column when it would be enough until you find two different. That's of course easy to fix, and it will still be elegant. \$\endgroup\$
    – janos
    Commented Apr 21, 2014 at 6:59
  • 1
    \$\begingroup\$ Very good @Adam I wanted to award you a bounty for this answer. \$\endgroup\$
    – Caridorc
    Commented Sep 22, 2017 at 18:02
7
\$\begingroup\$

Use more meaningful variable names, for example:

  • lines instead of t
  • length instead of l
  • Since y is a character position, I would rename to pos
  • let to letter, because it's not much longer and a bit more readable (since let is keyword in some other programming languages, I find it confusing to read)

Since lines[0] is kind of special, and you make many references to it, it would be good to give it a name:

first = lines[0]

Your input lines end with a trailing newline character, but in the main logic of pattern matching this is just noise, not an important detail. I recommend this:

first = lines[0].strip()

This will get rid of the trailing newline character, and in the main logic you don't have to worry about checking letters until first[-1], you can consider the letters in first, which is simpler.


If you make changes to the script, you have to repeat all the tests to see if it's still working. That's very troublesome. To make that easier, add some unit tests, and refactor the main method to make it more testable, for example:

import unittest
from test.test_support import run_unittest

def main():
    n = int(sys.stdin.readline())
    lines = sys.stdin.readlines()[0:n]
    print find_common_pattern(lines)

def find_common_pattern(lines):
    # the rest of the code from the old version of `main`

class FindPatterns(unittest.TestCase):
    def test_given_examples(self):
        self.assertEqual('xab', find_common_pattern(['?ab\n', '??b\n']))
        self.assertEqual('?', find_common_pattern(['a\n', 'b\n']))
        self.assertEqual('xaxb', find_common_pattern(['?a?b\n']))

run_unittest(FindPatterns)

After this, it will be easy to follow my suggestions below and verify that the script is still working correctly, passing all test cases. (Feel free to add more!)


Instead of range(0, length-1), you can write simply range(length-1).


if let == '?' and t[x][y]!= '?':
    let = t[x][y]
if t[x][y] != let and t[x][y] != '?':
    result += '?'

If the first if was true, the second will not be true, and it's a waste to evaluate it again. You could make the second if an elif instead.


The last elif in your for loop is a bit dirty: it's for checking if you're on the last line. The condition will be evaluated for every line where the current position is ?, which is a bit wasteful. It would be better to refactor that loop to perform that check only once at the end, for example:

for pos in range(length):
    letter = first[pos]
    newletter = letter
    for x in range(1, len(lines)):
        if letter == '?' and lines[x][pos] != '?':
            letter = lines[x][pos]
        elif lines[x][pos] != letter and lines[x][pos] != '?':
            newletter = '?'
            break
    if letter == '?':
        result += 'x'
    else:
        result += newletter

After this refactoring, you don't need the value of x inside the loop anymore, so you could rewrite more intuitively as:

for line in lines[1:]:
    if letter == '?' and line[pos] != '?':
        letter = line[pos]
    elif line[pos] != letter and line[pos] != '?':
        newletter = '?'
        break

Since the characters ? and x appear several times in the code, I would treat them as constants and name them for example:

  • ANYCHAR = '?'
  • SAMPLECHAR = 'x'

Instead of running main, the common way is like this:

if __name__ == '__main__':
    main()

This way, your script will still run as before when you do python script.py, and at the same time it will be possible to import it from another python script as a library without automatically running it. It's a good practice to get used to, even if you don't intend to make this a library.


Putting it all together, and some other minor improvements, this would be more Pythonic, and slightly more readable and efficient:

ANYCHAR = '?'
SAMPLECHAR = 'x'

def find_common_pattern(lines):
    first = lines[0].strip()
    lines = lines[1:]
    if not lines:
        return first.replace(ANYCHAR, SAMPLECHAR)
    result = ''
    for pos in range(len(first)):
        letter = first[pos]
        newletter = letter
        for line in lines:
            if line[pos] != ANYCHAR:
                if letter == ANYCHAR:
                    letter = line[pos]
                elif line[pos] != letter:
                    newletter = ANYCHAR
                    break
        if letter == ANYCHAR:
            result += SAMPLECHAR
        else:
            result += newletter
    return result
\$\endgroup\$
2
  • \$\begingroup\$ Thanks very much for such an informative answer. Just one question; I'm a bit confused on how "if not lines:" works. Could you expand please? \$\endgroup\$
    – dgun
    Commented Apr 19, 2014 at 18:34
  • \$\begingroup\$ Sure! An empty list is false in Python, that's why it works. This technique is actually recommended by PEP8 \$\endgroup\$
    – janos
    Commented Apr 19, 2014 at 18:36

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.