bert

Related to #5142, AlbertTokenizer (which uses SentencePiece) doesn't decode special tokens (like [CLS], [MASK]) properly. This issue was discovered when adding the Nystromformer model (#14659), which uses this tokenizer.

To reproduce (Transformers v4.15 or below):

!pip install -q transformers sentencepiece

from transformers import AlbertTokenizer

tokenizer = AlbertTokenizer.from

From paper, it mentioned

Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my
dog is hairy it chooses hairy.

It means that 15% of token will be choose for sure.

From https://github.com/codertimo/BERT-pytorch/blob/master/bert_pytorch/dataset/dataset.py#L68,
for every single token, it has 15% of chance that go though the followup procedure.

_handle_duplicate_documents and _drop_duplicate_documents in the elastic search document store will always report self.index as the index with the conflict, which is obviously incorrect.

Edit: Upon further investigation, this is actually a lot worse. Using multiple indices with the ElasticSearch DocumentStore is completely broken due to the fact, that this is used in `_handle_duplicate_do

欢迎您反馈PaddleNLP使用问题，非常感谢您对PaddleNLP的贡献！
在留下您的问题时，辛苦您同步提供如下信息：

版本、环境信息
1）PaddleNLP和PaddlePaddle版本：请提供您的PaddleNLP和PaddlePaddle版本号，例如PaddleNLP 2.0.4，PaddlePaddle2.1.1
2）系统环境：请您描述系统类型，例如Linux/Windows/MacOS/，python版本
复现信息：如为报错，请给出复现环境、复现步骤
paddle版本2.0.8 paddlenlp版本2.1.0
建议，能否在paddlenlp文档中，整理列出各个模型的tokenizer是基于什么类别的based，如bert tokenizer是word piece的，xlnet tokenizer是sentence piece的，以及对应的输入输出样例

bert

Here are 2,074 public repositories matching this topic...

huggingface / transformers

graykode / nlp-tutorial

hanxiao / bert-as-service

brightmart / nlp_chinese_corpus

ymcui / Chinese-BERT-wwm

huggingface / tokenizers

PaddlePaddle / ERNIE

codertimo / BERT-pytorch

deepset-ai / haystack

macanv / BERT-BiLSTM-CRF-NER

jessevig / bertviz

brightmart / albert_zh

bentrevett / pytorch-sentiment-analysis

shibing624 / pycorrector

PaddlePaddle / PaddleNLP

IntelLabs / nlp-architect

JohnSnowLabs / spark-nlp

CLUEbenchmark / CLUE

CyberZHG / keras-bert

BrikerMan / Kashgari

asyml / texar

km1994 / nlp_paper_study

bytedance / lightseq

brightmart / roberta_zh

Separius / awesome-sentence-embedding

MaartenGr / BERTopic

dbiir / UER-py

namisan / mt-dnn

Jiakui / awesome-bert

utterworks / fast-bert

Improve this page

Add this topic to your repo