Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.

README.md

NaiveBayesClassifier

Made by Chris Kormaris

Programming Language: Python

Unzip the compressed file "LingspamDataset.zip" in the same directory where the Python files are. 700 train documents and 260 test documents will reside inside the uncompressed folder.

Feature Selection with Information Gain

Let's begin by denoting the variable C, which takes the values: C=1 for spam documents and C=0 for ham documents. First, run the python file "FeatureSelectionUsingIG.py" to generate the features tokens that we'll use. Run:

python FeatureSelectionUsingIG.py

Feature selection for the most useful feature tokens, using Information Gain (IG) has been implemented. The feature tokens are boolean, as in they take boolean values, 0 if the token does not appear in a text or 1 if the token appears in a text. The boolean values are assigned while generating the feature vectors of each text. At the start of the program, all the train files of the corpus are being parsed and we count in how many spam or ham documents in total, each word appears. The results are being saved in dictionary data structures with the names: "feature_spam_frequency", "feature_ham_frequency" and "feature_frequency" respectively, with feature tokens being the keys and and frequencies being the values. These dictionary variables are used for the calculation of the probabilities of the Information Gain algorithm. We calculate the entropy H(C) and print it. We proceed by adding each candidate feature token to the "IG" dictionary, while calculating its IG score. Finally, we select the top m feature tokens, on terms of the highest Information Gain (IG) scores. The Information Gain score of each feature token is calculated using the formula:

Information Gain

where i=0 and i=1 are the boolean values that a feature token may take, indicating if it appears or not in a text. Concretely, the feature X that reduces the entropy less is the most desired candidate feature because it can discriminate the category of a document more efficiently. The number of feature tokens that the feature selection algorithm returns is set to 1000. The number of feature tokens to use depends on the classification task we want to execute. For our Naive-Bayes spam-ham classifier a number of 1000 feature tokens is a good choice.

In addition, inside of block comments, there is an alternative way for calculating the Information Gain score, by using the following formula:

Information Gain alternative

which is the absolute difference between the conditional probabilities for the 2 classes (spam or ham). The tokens for which this absolute difference is bigger are selected as feature tokens of the feature vectors, based on which we will classify our corpus files. Using the feature tokens of this formula, slightly worse accuracy has been observed. In the end of the program "FeatureSelectionWithIG.py", two dictionaries will be created, which contain the feature tokens that will be used for the Naive-Bayes classifier and they are saved in the files: "spam_feature_dictionary.txt" and "ham_feature_dictionary.txt".

Perform Naive-Bayes Classification

After the feature selection step, run "NaiveBayesClassifier.py" to start the classification process. Run:

python NaiveBayesClassifier.py

First, the classifier counts the frequency of each feature token in the spam documents, and then in the ham documents. We use boolean features. The total frequency of each spam and ham feature token is saved in two dictionaries: "token_frequencies_in_spam_class" and "token_frequencies_in_ham_class". We also how many words the documents of the spam class have and the number of words the ham documents have. We use the acquired frequencies to estimate the Laplace probabilities that will determine the correct category of the test data. The Laplace probability estimate classification calculates the conditional probability of each test document's feature vector for each 2 categories (spam or ham) and categorizes the document in the class for which the conditional probability is higher i.e. for the feature vector Xi of the test document i we calculate the probabilities p(Xi|C=1) and p(Xi|C=0) (reminder: C=1 for spam, C=0 for ham). The class that has the higher probability will be the predicted class of the train document. For the calculation of the probabilies we also use Laplace smoothing, adding 1 in the enumerator and |V| in the denominator (where |V| is the number of distinct words in the corpus, i.e. is the vocabulary size of the corpus). This is the formula that is used for calculating the probability of the feature token i, belonging to a spam document:

Laplace Smoothing token

To calculate the probability of the entire feature vector belonging to the spam class we multiply the probability of each separate feature token belonging to the spam class. The exact formula is:

Laplace Smoothing vector

We do the same for the ham class. We can omit the term P(featureVector) since it's the same for both 2 classes. The probability of the 2 which is bigger, indicates the category that the given test document and its feature vector is more likely to belong to. Also, it is a good idea to use the numerically stable "logsumexp trick", thus taking sum of exponentials, rather than multiplications of probabilities.

The execution results of the classifier delivered an accuracy of 96.92 %, while using 1000 features tokens, and also the same accuracy, while using 100 features tokens.

Notes

  • Statistics show that the Naive-Bayes classifier has a higher accuracy on a small corpus, rather than a big corpus.
  • Console execution results are included in the "console_outputs" folder.

About

A manual Naive Bayes Classifier for classifying spam and ham emails. Written in Python.

Topics

Resources

License

Releases

No releases published

Packages

No packages published

Languages

You can’t perform that action at this time.