perplexity

Right now the tokenize() function is splitting whenever a ' . ' character is found. Most of the time it's a correct approach to split a fine into sentences but sometimes the abbreviation like Dr., Mr., Mrs, etc. appear in a middle of a sentence and hence splits the sentence right there. I want to enhance the regex to not to spit the sentences on abbreviations.

perplexity

Here are 29 public repositories matching this topic...

joshualoehr / ngram-language-model

baojunshan / nlp-fluency

sudhanshusks / twitter_bot

tapos12 / N-gram-Language-model

Abhishekmamidi123 / Natural-Language-Processing

sayhitosandy / Chatbot

ez-sherlock / N-Gram-Language-Model

The tokenize() function shouldn't split sentences on abbreviations like Dr. Fahad, Mr. Wayne etc

satyakamacodes / Early-estimation-of-protest-time-spans-Using-NLP-Topic-Modeling

ADHIKSHA / Essay-Grading-IELTS

ApurbaSengupta / Text-Generation

SNUDerek / lm_perplexity_bootstrapping

afiliot / Text-Generation

konkyrkos / bigram-trigram-language-models

shaharpit809 / Latent-Dirichlet-allocation-LDA-on-YELP-dataset-using-Apache-Spark

screddy1313 / Language-modelling

ErolOZKAN- / Language-Modelling

Develop-Packt / t-Distributed-Stochastic-Neighbor-Embedding

avi-jit / numeracy-literacy

zahramajd / shannon-game

Ankit152 / MNIST-digit-recognition

ankit013 / Projects-Python

adodangeh / NLP01-statisticalNLP

davidmasse / word-predict

devildances / NLP_AutoCompletionText

vagdevik / NLP-Perplexity-CMI

syldekker / n-gram-modeling

motazsaad / language-modeling

ADHIKSHA / Link-PHP-to-Backend

salimandre / n-gram-language-models

Improve this page

Add this topic to your repo