Skip to content
#

natural-language-understanding

Here are 585 public repositories matching this topic...

transformers
ikergarcia1996
ikergarcia1996 commented Dec 10, 2021

🚀 Feature request

Fast Tokenizer for DeBERTA-V3 and mDeBERTa-V3

Motivation

DeBERTa V3 is an improved version of DeBERTa. With the V3 version, the authors also released a multilingual model "mDeBERTa-base" that outperforms XLM-R-base. However, DeBERTa V3 currently lacks a FastTokenizer implementation which makes it impossible to use with some of the example scripts (They require a Fa

gluon-nlp
preeyank5
preeyank5 commented Dec 3, 2020

Description

While using tokenizers.create with the model and vocab file for a custom corpus, the code throws an error and is not able to generate the BERT vocab file

Error Message

ValueError: Mismatch vocabulary! All special tokens specified must be control tokens in the sentencepiece vocabulary.

To Reproduce

from gluonnlp.data import tokenizers
tokenizers.create('spm', model_p

ArticutAPI

API of Articut 中文斷詞 (兼具語意詞性標記):「斷詞」又稱「分詞」,是中文資訊處理的基礎。Articut 不用機器學習,不需資料模型,只用現代白話中文語法規則,即能達到 SIGHAN 2005 F1-measure 94% 以上,Recall 96% 以上的成績。

  • Updated Jan 11, 2022
  • Python

Improve this page

Add a description, image, and links to the natural-language-understanding topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the natural-language-understanding topic, visit your repo's landing page and select "manage topics."

Learn more