Ad-hoc analytics script to module/package

Question

I have a text analytics process that I complete for work. This process mines text for different Twitter accounts and finds patterns in the tweets. The actual machine learning pipeline is much more complicated than what's shown below, but this will have to suffice as my minimum reproducible example. In reality, there are a lot more instances when I asked for user input in the script, so I don't think things like user input should be stored in the same script as the functions.

Right now, I run this script on an ad-hoc basis, and a lot of the things are hard-coded. I want to be able to package the code I've written to make it more robust and stable.

How should the functions before __main__ and the procedures in __main__ be split up to best make this code a package? Should the functions be saved in a different script altogether, and then be imported when needed to a different script? Right now, I just run these scripts from the command line and am sort of lost as to where to go in terms of packaging it.

Text data cleaning script:

import pandas as pd

from nltk.corpus import stopwords

import re
import string

import os

def strip_punct(string, regex):
    return regex.sub(' ', string)

def clean_string(string, regex, stopwords):
    words_to_keep = []
    stripped_string = strip_punct(string, regex)
    string_words = stripped_string.split()
    for word in string_words:
        if word not in stopwords:
            words_to_keep.append(word)
    cleaned_string = " ".join(words_to_keep)
    return cleaned_string

def clean_and_map_column(column, df, regex, stopwords):
    text_set = set(df[column])
    text_dict = dict((string, clean_string(row, regex, stopwords)) for row in text_set)
    df[column] = df[column].map(text_dict)
    return df[column]

if __name__ == '__main__':

    os.chdir("/path/to/my/folder")

    # regular expression containing punctuation to be removed from strings
    regexp = re.compile('[%s]' % re.escape(''.join(p for p in string.punctuation)))

    # set of stopwords to remove from text from NLTK
    stops = stopwords.words('english')

    # my data frame, containing only columns of text
    df = pd.read_csv("file.csv")

    # clean each column
    for col in df.columns.values:
        df[col] = clean_and_map_column(col, df, regexp)

    df.to_csv("cleaned_file.csv")

KMeans clustering script:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans

import pandas as pd

import os

if __name__ == '__main__':

    os.chdir("/path/to/my/folder")

    df = pd.read_csv("cleaned_file.csv")

    corpus_column = 'twitter_data'

    corpus = df[corpus_column]

    vectorizer = CountVectorizer()

    X = vectorizer.fit_transform(corpus)

    model = KMeans(k = 5)

    model.fit(X)

    df['cluster_id'] = model.labels_

    df.to_csv('kmeans_file.csv")

Jamal · Answer 1 · 2016-10-14 01:35:26Z

You may want to start by converting the process under __main__ into a function called main and then have it take command line arguments. That way, you separate user-configurable items from the actual program.

For example:

import re
import string
import sys
import os

import pandas as pd
from nltk.corpus import stopwords

def main(cmd_line_args):
  path_to_data_folder = cmd_line_args[1]
  os.chdir(path_to_data_folder)      

  # regular expression containing punctuation to be removed from strings
  regexp = re.compile('[%s]' % re.escape(''.join(p for p in   string.punctuation)))

  # set of stopwords to remove from text from NLTK
  stops = stopwords.words('english')

  # my data frame, containing only columns of text
  data_frame_filename =  cmd_line_args[2]
  df = pd.read_csv(data_frame_filename+".csv")

  # clean each column
  for col in df.columns.values:
      df[col] = clean_and_map_column(col, df, regexp)

  cleaned_filename = cmd_line_args[3]
  df.to_csv(cleaned_filename+".csv")


if __name__ == '__main__':
  main(sys.argv)

How about separating the functions from the main script? Is it a good idea to create a .py file with just functions and then import those into main? — not_a_robot, Oct 13 at 21:09
@not_a_robot depends on if you think you'd like to use them elsewhere. You don't have to if they are really just used by your main script. — user83808, Oct 13 at 23:54

asked	1 month ago
viewed	40 times
active	1 month ago

current community

your communities

more stack exchange communities

Ad-hoc analytics script to module/package

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python modules file-structure or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Ad-hoc analytics script to module/package

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python modules file-structure or ask your own question.

Related

Hot Network Questions