google / sentencepiece

Star

Open

Tutorial to train a cross-language model with sentencepiece

3

loretoparisi commented Jan 23, 2019

It would be worth to provide a tutorial how to train a simple cross-language classification model using sentencepiece. Supposed to have a given training set and have chosen a model (let'say a simple Word2Vec plus softmax or a LSTM model, etc), how to use the created sentencepiece model (vocabulary/codes) to feed this model for train and inference?

Guidance on how to implement subword sampling at train time

2

wolfgarbe / SymSpell

Star

SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm

spellcheck fuzzy-search fuzzy-matching edit-distance levenshtein levenshtein-distance spelling spell-check chinese-text-segmentation word-segmentation approximate-string-matching spelling-correction damerau-levenshtein text-segmentation chinese-word-segmentation symspell

Updated Mar 17, 2020
C#

baidu / lac

Star

Baidu's open-source lexical analysis tool for Chinese, including word segmentation, part-of-speech tagging & named entity recognition.

named-entity-recognition lexical-analysis chinese-nlp word-segmentation part-of-speech-tagger chinese-word-segmentation

Updated May 11, 2020
C++

VKCOM / YouTokenToMe

Star

Unsupervised text tokenizer focused on computational efficiency

nlp natural-language-processing word-segmentation tokenization bpe

Updated Feb 13, 2020
C++

PyThaiNLP / pythainlp

Star

Open

PyThaiNLP 2.2 change log

2

bact commented Dec 10, 2019

[WORK IN PROGRESS]
Schedule

Dev version release date: 1 May 2020
Beta version release date: 15 June 2020
Tentative release date: 20 June 2020

See 2.2 Milestone.

#333 Tokenization: Add graph size limit in newmm's _onecut() to avoid long wait for ambiguous text (also back ported to [2.1.1](https://github.com/PyThaiNLP

Add training script for language models

3

Open

Website improvement

Find more good first issues →

cbaziotis / ekphrasis

Star

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

nlp tokenizer text-processing semeval nlp-library word-segmentation spelling-correction tokenization text-segmentation spell-corrector word-normalization

Updated Oct 22, 2019
Python

mammothb / symspellpy

Star

Python port of SymSpell

python spellcheck fuzzy-search fuzzy-matching edit-distance levenshtein levenshtein-distance spelling spell-check chinese-text-segmentation word-segmentation approximate-string-matching spelling-correction damerau-levenshtein text-segmentation chinese-word-segmentation symspell

Updated Apr 28, 2020
Python

JayYip / bert-multitask-learning

Star

BERT for Multitask Learning

nlp text-classification transformer named-entity-recognition pretrained-models part-of-speech ner word-segmentation bert cws encoder-decoder multi-task-learning multitask-learning

Updated Jan 28, 2020
Python

vncorenlp / VnCoreNLP

Star

A Vietnamese natural language processing toolkit (NAACL 2018)

java nlp natural-language-processing parsing vietnamese python3 named-entity-recognition ner word-segmentation pos-tagging dependency-parsing pos-tagger vietnamese-nlp sentence-segmentation vietnamese-tokenizer vncorenlp word-segmenter rdrsegmenter vnmarmot

Updated Apr 6, 2020
Java

ku-nlp / jumanpp

Star

Open

Design and document a stable C API

15

DoumanAsh commented Jan 4, 2018

For Juman++ to be widely usable, we want to have a documented and stable C API and an option to have a dynamically linked library.
That library probably should use -fvisibility=hidden and explicit visibility on exported symbols on Unixes and __declspec(dllimport/dllexport) on Windows.

The minimal API should be: