Python implementation of an N-gram language model with Laplace smoothing and sentence generation.
python
nlp
ngram
ngrams
language-models
language-model
ngram-language-model
laplace-smoothing
perplexity
smoothing-methods
-
Updated
Feb 9, 2018 - Python
Right now the tokenize() function is splitting whenever a ' . ' character is found. Most of the time it's a correct approach to split a fine into sentences but sometimes the abbreviation like Dr., Mr., Mrs, etc. appear in a middle of a sentence and hence splits the sentence right there. I want to enhance the regex to not to spit the sentences on abbreviations.