bert

@anmolsjoshi

🚀 Add missing tokenizer test files

Several tokenizers currently have no associated tests. I think that adding the test file for one of these tokenizers could be a very good way to make a first contribution to transformers.

Tokenizers concerned

not yet claimed

LED
RemBert
RetriBert

claimed

Flaubert @anmolsjoshi
Longformer @tgadeliya
[

From paper, it mentioned

Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my
dog is hairy it chooses hairy.

It means that 15% of token will be choose for sure.

From https://github.com/codertimo/BERT-pytorch/blob/master/bert_pytorch/dataset/dataset.py#L68,
for every single token, it has 15% of chance that go though the followup procedure.

Problem
Currently FARMReader will ask users to raise max_seq_length every time some samples are longer than the value set to it. However, this can be confusing if max_seq_length is already set to the maximum value allowed by the model, because raising it further will cause hard-to-read CUDA errors.

See #2177.

Solution
We should find a way to query the model for the maximum va

欢迎您反馈PaddleNLP使用问题，非常感谢您对PaddleNLP的贡献！
在留下您的问题时，辛苦您同步提供如下信息：

版本、环境信息
1）PaddleNLP和PaddlePaddle版本：请提供您的PaddleNLP和PaddlePaddle版本号，例如PaddleNLP 2.0.4，PaddlePaddle2.1.1
2）系统环境：请您描述系统类型，例如Linux/Windows/MacOS/，python版本
复现信息：如为报错，请给出复现环境、复现步骤
paddle版本2.0.8 paddlenlp版本2.1.0
建议，能否在paddlenlp文档中，整理列出各个模型的tokenizer是基于什么类别的based，如bert tokenizer是word piece的，xlnet tokenizer是sentence piece的，以及对应的输入输出样例

bert

Here are 2,196 public repositories matching this topic...

huggingface / transformers

🚀 Add missing tokenizer test files

Tokenizers concerned

not yet claimed

claimed

graykode / nlp-tutorial

jina-ai / clip-as-service

brightmart / nlp_chinese_corpus

ymcui / Chinese-BERT-wwm

huggingface / tokenizers

codertimo / BERT-pytorch

PaddlePaddle / ERNIE

deepset-ai / haystack

macanv / BERT-BiLSTM-CRF-NER

jessevig / bertviz

brightmart / albert_zh

bentrevett / pytorch-sentiment-analysis

PaddlePaddle / PaddleNLP

shibing624 / pycorrector

IntelLabs / nlp-architect

JohnSnowLabs / spark-nlp

CLUEbenchmark / CLUE

CyberZHG / keras-bert

km1994 / nlp_paper_study

BrikerMan / Kashgari

asyml / texar

MaartenGr / BERTopic

bytedance / lightseq

brightmart / roberta_zh

dbiir / UER-py

Separius / awesome-sentence-embedding

namisan / mt-dnn

thunlp / PromptPapers

Jiakui / awesome-bert

Improve this page

Add this topic to your repo