csvshuf: a tool to shuffle CSV columns written in Python

Question

Edit: After reading Gareth's answer I have pushed an updated version of the code to github.

I needed to shuffle cells from specific columns in several CSV files. One of the requirements was to be able to do derangement (shuffling that leaves no element in its original place), I read about Sattolo's algorithm.

So I started writing csvshuf this afternoon. Code was originally based in csvcut. I tried to make it good but I am sure it can be improved as I have little experience programming in Python. I would be grateful if I get some code review.

Code:

import csv
import sys
import random
import getopt


# From http://programmers.stackexchange.com/q/218255/149749
def shuffle_kfy(items):
    i = len(items) - 1
    while i > 0:
        j = random.randrange(i + 1)  # 0 <= j <= i
        items[j], items[i] = items[i], items[j]
        i = i - 1
    return items

def shuffle_sattolo(items):
    i = len(items)
    while i > 1:
        i = i - 1
        j = random.randrange(i)  # 0 <= j <= i-1
        items[j], items[i] = items[i], items[j]
    return items

def shuffle(items, mode):
    if mode == 'kfy':
        return shuffle_kfy(items)
    if mode == 'sattolo':
        return shuffle_sattolo(items)

    random.shuffle(items)
    return items


opts, args = getopt.getopt(sys.argv[1:], "c:C:d:o:q:tks", [])
if args:
    i = open(args[0], 'U')
else:
    i = sys.stdin

delimiter = ','
output_delimiter = ','
cols = None
no_cols = None
quotechar = None
search_mode = ''

if opts:
    opts = dict(opts)
    if '-c' in opts:
        cols = map(int, opts['-c'].split(','))
    elif '-C' in opts:
        no_cols = map(int, opts['-C'].split(','))
    if '-k' in opts:
        search_mode = 'kfy'
    elif '-s' in opts:
        search_mode = 'sattolo'
    if '-t' in opts:
        delimiter = "\t"
    elif '-d' in opts:
        delimiter = opts['-d']
    if '-o' in opts:
        output_delimiter = opts['-o']
    if '-q' in opts:
        quotechar = opts['-q']

if cols and 0 in cols or no_cols and 0 in no_cols:
    print("Invalid column 0. Columns are 1-based")
    exit(1)

reader = csv.reader(i, delimiter=delimiter, quotechar=quotechar)
headers = next(reader)

table = []
for c in range(len(headers)):
    table.append([])

for row in reader:
    for c in range(len(headers)):
        table[c].append(row[c])

if not cols and not no_cols:
    cols = range(len(headers))
elif no_cols:
    cols = list(set(range(len(headers))) - set(no_cols))

for c in cols:
    if c > len(headers):
        print('Invalid column {}. Last column is {}').format(c, len(headers))
        exit(1)
    table[c - 1] = shuffle(table[c - 1], search_mode)

table = zip(*table)

writer = csv.writer(sys.stdout, delimiter=output_delimiter)
writer.writerow(headers)
for row in table:
    writer.writerow(row)

Usage:

csvshuf -c1 foobar.csv
(shuffles the first column of each row of foobar.csv using Python's shuffle())

svshuf -c2 -k foobar.csv
(shuffles the second column of each row using Knuth-Fischer-Yeats algorithm.)

svshuf -c3 -s foobar.csv
(shuffles the third column of each row using Sattolo's algorithm.)

csvshuf foobar.csv
(shuffles all the columns of foobar.csv)

csvshuf -C1 foobar.csv
(shuffles all the columns but the first of foobar.csv)

head -10 foobar.csv | csvshuf -c 1,3
(shuffles the first and third columns of the first ten lines of foobar.csv)

csvshuf -c1,3 -d "|" foobar.csv
(shuffles the first and third columns of the pipe-delimited foobar.csv)

csvshuf -c 1,3 -t foobar.csv
(shuffles the first and third columns of the tab-delimited foobar.csv if present, the -d option will be ignored.)

csvshuf -c 1,2,3 -d "|" -o , foobar.csv
(shuffles the first three columns of the pipe-delimited foobar.csv; output will be comma-delimited.)

csvshuf -c 1,2,3 -o "|" foobar.csv
(shuffles the first three columns of the comma-delimited foobar.csv; output will be pipe-delimited.)

csvshuf -c 1,2 -d "," -q "|" foobar.csv
(shuffles the first two columns of the comma-delimited, pipe-quoted foobar.csv.)

Gareth Rees · Accepted Answer · 2016-06-01 13:00:14Z

There are no docstrings. It's hard to review code when we don't know what it is supposed to do.
The various shuffle functions both modify their items argument and return it. It would be clearer if these functions either modified their argument without returning it (like random.shuffle) or returned a new list without modifying the original (like random.sample). Doing both is redundant and confusing.

The shuffle functions can be simplified using reversed and range:

def shuffle_sattolo(items):
    """Shuffle items in place using Sattolo's algorithm."""
    _randrange = random.randrange
    for i in reversed(range(1, len(items))):
        j = _randrange(i)  # 0 <= j < i
        items[j], items[i] = items[i], items[j]

Note that I've cached random.randrange in a local variable. This is to avoid having to look it up again each time around the loop.

The shuffle_kfy function is redundant because random.shuffle already implements the Fisher–Yates shuffle.
Using a string argument to select a mode of operation is error-prone. If you mistyped the string, for example shuffle(items, 'satolo'), then you would not get an error, it would just silently do the wrong thing.

The shuffle function is redundant. Instead of assigning a search mode while parsing the arguments, assign a shuffle function:

shuffle = random.shuffle

# ...

if '-k' in opts:
    shuffle = shuffle_kfy
elif '-s' in opts:
    shuffle = shuffle_sattolo

Having code at the top level of a module makes it hard to test — for example, you can't test the shuffle function by itself, because you can't import the module without running all the top-level code. It is a good idea to put top-level code into a main function and guard it with if __name__ == '__main__':.
The code processes the command-line options by transforming them into a dictionary and then looking up each valid argument. But this means that there are no error messages for invalid arguments. If you look at the getopt examples in the manual, you'll see that they do:
```
for o, a in opts:
    if o == '-v':
        # process option
    else:
        # report error
```
The name i is normally used for index variables, so it's misleading to use it for an input file.
Error messages should go to standard error, not standard output:
```
sys.stderr.write("Invalid column 0. Columns are 1-based.\n")
```
This is particularly important here, because this program writes its results to standard output, and so probably the user is redirecting standard output to a file. If you write error messages to standard output, they will go to the file and the user will not see them.

It's a better idea to use the argparse module for parsing command-line arguments. With argparse the code is more explicit, it automatically emits error messages for invalid arguments, it has various types of built-in argument validation and conversion, and it has built-in support for the --help option. Something like this:

def column_list(string):
    """Validate and convert comma-separated list of column numbers."""
    try:
        columns = list(map(int, string.split(',')))
    except ValueError as e:
        raise argparse.ArgumentTypeError(*e.args)
    for column in columns:
        if column < 1:
            raise argparse.ArgumentTypeError(
                "Invalid column {!r}: column numbers start at 1."
                .format(column))
    return columns

def main():
    parser = argparse.ArgumentParser(
        description="Shuffle columns in a CSV file")
    parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
                        default=sys.stdin, help='Input CSV file')
    parser.add_argument('-s', '--sattolo',
                        action='store_const', const=shuffle_sattolo,
                        dest='shuffle', default=random.shuffle,
                        help="Use Sattolo's shuffle.")
    col_group = parser.add_mutually_exclusive_group()
    col_group.add_argument('-c', '--columns', type=column_list,
                           help="Comma-separated list of columns to include.")
    col_group.add_argument('-C', '--no-columns', type=column_list,
                           help="Comma-separated list of columns to exclude.")
    delim_group = parser.add_mutually_exclusive_group()
    delim_group.add_argument('-d', '--delimiter', type=str, default=',',
                             help="Input column delimiter.")
    delim_group.add_argument('-t', '--tabbed', dest='delimiter',
                             action='store_const', const='\t',
                             help="Delimit input with tabs.")
    parser.add_argument('-q', '--quotechar', type=str, default='"',
                        help="Quote character.")
    parser.add_argument('-o', '--output-delimiter', type=str, default=',',
                        help="Output column delimiter.")
    args = parser.parse_args()

    reader = csv.reader(args.infile, delimiter=args.delimiter,
                        quotechar=args.quotechar)

    # ... and so on ...

Many thanks! I've done most of the changes and uploaded the code in github.com/pereorga/csvshuf/blob/master/csvshuf — Pere, Jun 1 at 20:00

asked	3 months ago
viewed	129 times
active	3 months ago

current community

your communities

more stack exchange communities

csvshuf: a tool to shuffle CSV columns written in Python

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python random csv shuffle or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

csvshuf: a tool to shuffle CSV columns written in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python random csv shuffle or ask your own question.

Related

Hot Network Questions