#
word-segmentation
Here are 85 public repositories matching this topic...
SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm
spellcheck
fuzzy-search
fuzzy-matching
edit-distance
levenshtein
levenshtein-distance
spelling
spell-check
chinese-text-segmentation
word-segmentation
approximate-string-matching
spelling-correction
damerau-levenshtein
text-segmentation
chinese-word-segmentation
symspell
-
Updated
Mar 17, 2020 - C#
Baidu's open-source lexical analysis tool for Chinese, including word segmentation, part-of-speech tagging & named entity recognition.
named-entity-recognition
lexical-analysis
chinese-nlp
word-segmentation
part-of-speech-tagger
chinese-word-segmentation
-
Updated
May 11, 2020 - C++
Unsupervised text tokenizer focused on computational efficiency
-
Updated
Feb 13, 2020 - C++
2
bact
commented
Dec 10, 2019
[WORK IN PROGRESS]
Schedule
- Dev version release date: 1 May 2020
- Beta version release date: 15 June 2020
- Tentative release date: 20 June 2020
See 2.2 Milestone.
- #333 Tokenization: Add graph size limit in newmm's
_onecut()to avoid long wait for ambiguous text (also back ported to [2.1.1](https://github.com/PyThaiNLP
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
nlp
tokenizer
text-processing
semeval
nlp-library
word-segmentation
spelling-correction
tokenization
text-segmentation
spell-corrector
word-normalization
-
Updated
Oct 22, 2019 - Python
Python port of SymSpell
python
spellcheck
fuzzy-search
fuzzy-matching
edit-distance
levenshtein
levenshtein-distance
spelling
spell-check
chinese-text-segmentation
word-segmentation
approximate-string-matching
spelling-correction
damerau-levenshtein
text-segmentation
chinese-word-segmentation
symspell
-
Updated
Apr 28, 2020 - Python
BERT for Multitask Learning
nlp
text-classification
transformer
named-entity-recognition
pretrained-models
part-of-speech
ner
word-segmentation
bert
cws
encoder-decoder
multi-task-learning
multitask-learning
-
Updated
Jan 28, 2020 - Python
A Vietnamese natural language processing toolkit (NAACL 2018)
java
nlp
natural-language-processing
parsing
vietnamese
python3
named-entity-recognition
ner
word-segmentation
pos-tagging
dependency-parsing
pos-tagger
vietnamese-nlp
sentence-segmentation
vietnamese-tokenizer
vncorenlp
word-segmenter
rdrsegmenter
vnmarmot
-
Updated
Apr 6, 2020 - Java
DoumanAsh
commented
Jan 4, 2018
For Juman++ to be widely usable, we want to have a documented and stable C API and an option to have a dynamically linked library.
That library probably should use -fvisibility=hidden and explicit visibility on exported symbols on Unixes and __declspec(dllimport/dllexport) on Windows.
The minimal API should be:
- Loading a model using a config file
- Analyzing a sentence
- Accessing
A Japanese tokenizer based on recurrent neural networks
-
Updated
Apr 17, 2020 - Python
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
nlp
named-entity-recognition
pos
ner
word-segmentation
albert
bert
pos-tagging
chinese-word-segmentation
-
Updated
Apr 15, 2020 - Python
Source codes for paper "Neural Networks Incorporating Dictionaries for Chinese Word Segmentation", AAAI 2018
-
Updated
Feb 1, 2018 - Python
Source code for an ACL2016 paper of Chinese word segmentation
-
Updated
Jan 8, 2019 - Python
Kiwi(지능형 한국어 형태소 분석기)
nlp
morphology
korean
word-segmentation
morphological-analysis
korean-text-processing
korean-tokenizer
korean-nlp
-
Updated
Apr 1, 2020 - C++
This repository is for building Windows 64-bit MeCab binary and improving MeCab Python binding.
-
Updated
Feb 14, 2020 - C++
A toolkit for Vietnamese word segmentation
-
Updated
Apr 26, 2017 - Java
A Fast and Accurate Vietnamese Word Segmenter (LREC 2018)
-
Updated
Jun 26, 2019 - Java
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning
-
Updated
May 26, 2017 - Python
Paper: A Simple and Effective Neural Model for Joint Word Segmentation and POS Tagging
-
Updated
Mar 7, 2019 - Python
Syllable segmentation tool for Myanmar language (Burmese) by Ye.
-
Updated
Jan 19, 2020 - HTML
Vietnamese Word Tokenize
-
Updated
Aug 16, 2019 - Python
Converts from Chinese characters to pinyin, between simplified and traditional, and does word segmentation.
-
Updated
May 8, 2020 - JavaScript
A PyTorch implementation of the BI-LSTM-CRF model.
nlp
crf
pytorch
ner
word-segmentation
pos-tagging
sequence-labeling
bi-lstm-crf
bilstm
crf-model
lstm-crf
bilstm-crf
sequence-tagging
-
Updated
Apr 26, 2020 - Python
Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm
levenshtein-distance
java-8
word-segmentation
spelling-correction
damerau-levenshtein
spellchecker
symspell
qwerty-based-char-distance
weighted-damerau-levenshtein
-
Updated
May 11, 2020 - Java
Fast Word Segmentation with Triangular Matrix
spellcheck
spell-check
spelling-checker
spell-checker
word-segmentation
spelling-correction
spelling-corrector
spellchecker
text-segmentation
spell-corrector
symspell
-
Updated
May 6, 2018 - C#
A Python wrapper for VnCoreNLP using a bidirectional communication channel.
nlp
parser
tokenizer
named-entity-recognition
dependency-parser
ner
word-segmentation
pos-tagger
vietnamese-nlp
postagger
vncorenlp
python-vncorenlp
-
Updated
Aug 10, 2018 - Python
Chinese Word Segmention Base on the Deep Learning and LSTM Neural Network
-
Updated
Nov 22, 2016 - Python
Improve this page
Add a description, image, and links to the word-segmentation topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the word-segmentation topic, visit your repo's landing page and select "manage topics."
It would be worth to provide a tutorial how to train a simple cross-language classification model using sentencepiece. Supposed to have a given training set and have chosen a model (let'say a simple Word2Vec plus softmax or a LSTM model, etc), how to use the created sentencepiece model (vocabulary/codes) to feed this model for train and inference?