I have a text analytics process that I complete for work. This process mines text for different Twitter accounts and finds patterns in the tweets. The actual machine learning pipeline is much more complicated than what's shown below, but this will have to suffice as my minimum reproducible example. In reality, there are a lot more instances when I asked for user input in the script, so I don't think things like user input should be stored in the same script as the functions.
Right now, I run this script on an ad-hoc basis, and a lot of the things are hard-coded. I want to be able to package the code I've written to make it more robust and stable.
How should the functions before __main__
and the procedures in __main__
be split up to best make this code a package? Should the functions be saved in a different script altogether, and then be imported when needed to a different script? Right now, I just run these scripts from the command line and am sort of lost as to where to go in terms of packaging it.
Text data cleaning script:
import pandas as pd
from nltk.corpus import stopwords
import re
import string
import os
def strip_punct(string, regex):
return regex.sub(' ', string)
def clean_string(string, regex, stopwords):
words_to_keep = []
stripped_string = strip_punct(string, regex)
string_words = stripped_string.split()
for word in string_words:
if word not in stopwords:
words_to_keep.append(word)
cleaned_string = " ".join(words_to_keep)
return cleaned_string
def clean_and_map_column(column, df, regex, stopwords):
text_set = set(df[column])
text_dict = dict((string, clean_string(row, regex, stopwords)) for row in text_set)
df[column] = df[column].map(text_dict)
return df[column]
if __name__ == '__main__':
os.chdir("/path/to/my/folder")
# regular expression containing punctuation to be removed from strings
regexp = re.compile('[%s]' % re.escape(''.join(p for p in string.punctuation)))
# set of stopwords to remove from text from NLTK
stops = stopwords.words('english')
# my data frame, containing only columns of text
df = pd.read_csv("file.csv")
# clean each column
for col in df.columns.values:
df[col] = clean_and_map_column(col, df, regexp)
df.to_csv("cleaned_file.csv")
KMeans clustering script:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import pandas as pd
import os
if __name__ == '__main__':
os.chdir("/path/to/my/folder")
df = pd.read_csv("cleaned_file.csv")
corpus_column = 'twitter_data'
corpus = df[corpus_column]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
model = KMeans(k = 5)
model.fit(X)
df['cluster_id'] = model.labels_
df.to_csv('kmeans_file.csv")