Google Research Datasets

Pinned repositories

natural-questions

Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

Python 544 103
conceptual-captions

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

Shell 216 12
ToTTo

ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, p…

105 4
dakshina

The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia tex…

99 10
tydiqa

TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the trai…

Python 116 24
gap-coreference

GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia for the evaluation of coreference resolution in practica…

Python 171 68

Repositories

common-crawl-domain-names

Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").

MIT 0 0 0 0 Updated Oct 2, 2020
natural-questions

Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems.

Python Apache-2.0 103 544 10 2 Updated Oct 1, 2020
Textual-Entailment-New-Protocols

This data release is meant to accompany and document the paper: https://arxiv.org/abs/2004.11997 Collecting Entailment Data for Pretraining: New Protocols and Negative Results by Samuel R. Bowman, Jennimaria Palomaki, Livio Baldini Soares, and Emily Pitler

1 6 0 0 Updated Sep 29, 2020
QED

QED: A Framework and Dataset for Explanations in Question Answering

Python 13 43 0 0 Updated Sep 15, 2020
turkish-treebanks

A human-annotated morphosyntactic treebank for Turkish.

Python Apache-2.0 0 7 0 0 Updated Sep 7, 2020
Crisscrossed-Captions

Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

Python 0 1 0 0 Updated Sep 3, 2020
bam

Python Apache-2.0 5 25 1 0 Updated Aug 29, 2020
seq2act

This repository contains the opensource version of the datasets were used for different parts of training and testing of models that ground natural language to UI actions as described in the paper: "Mapping Natural Language Instructions to Mobile UI Action Sequences" by Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge, which is acc…

0 0 0 0 Updated Aug 20, 2020
great

The dataset for the variable-misuse task, used in the ICLR 2020 paper 'Global Relational Models of Source Code' [https://openreview.net/forum?id=B1lnbRNtwr]

2 9 0 0 Updated Aug 19, 2020
eth_py150_open

A redistributable subset of the ETH Py150 corpus [https://www.sri.inf.ethz.ch/py150], introduced in the ICML 2020 paper 'Learning and Evaluating Contextual Embedding of Source Code' [https://proceedings.icml.cc/static/paper_files/icml/2020/5401-Paper.pdf].

0 7 0 0 Updated Aug 11, 2020
MultiReQA

We are creating a challenging new benchmark MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering Models. Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus. MultiReQA is a new multi-domain ReQA evaluation suite composed of eight retrieval QA tasks drawn from pu…

0 12 0 0 Updated Jul 9, 2020
xsum_hallucination_annotations

Faithfulness and factuality annotations of XSum summaries from our paper "On Faithfulness and Factuality in Abstractive Summarization" (https://www.aclweb.org/anthology/2020.acl-main.173.pdf).

1 17 1 0 Updated Jul 9, 2020
NewSHead

The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

1 9 0 0 Updated Jun 9, 2020
tydiqa

TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the training and evaluation of automatic question answering systems. This repository provides evaluation code and a baseline system for the dataset.

Python Apache-2.0 24 116 0 1 Updated May 28, 2020
dakshina

The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia text, a romanization lexicon of words in the native script with attested romanizations, and some full sentence parallel data in both a native script of t…

10 99 0 0 Updated May 27, 2020
ToTTo

ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.

4 105 0 0 Updated May 26, 2020
discofuse

3 20 0 0 Updated May 21, 2020
lareqa

LAReQA is a challenging benchmark for evaluating language agnostic answer retrieval from a multilingual candidate pool. This repository contains a dataset we release as part of the LAReQA evaluation.

0 0 0 0 Updated May 19, 2020
dstc8-schema-guided-dialogue

The Schema-Guided Dialogue Dataset

dialogue assistant dataset nlp-machine-learning dialogue-systems

CC-BY-SA-4.0 59 247 1 0 Updated May 10, 2020
Taskmaster

Please see the readme file as well as our 2019 EMNLP paper linked here -->

24 85 2 0 Updated Mar 12, 2020
word_sense_disambigation_corpora

SemCor and Masc documents annotated with NOAD word senses.

46 171 2 2 Updated Feb 17, 2020
paws

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.

Python 32 277 5 1 Updated Jan 8, 2020
Image-Caption-Quality-Dataset

A dataset of crowdsourced ratings for machine-generated image captions

3 8 0 0 Updated Sep 20, 2019
birds-to-words

CC-BY-SA-4.0 1 9 0 0 Updated Sep 11, 2019
distribution-over-quantities

1 12 1 0 Updated Jul 22, 2019
wiki-split

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.

nlp deep-neural-networks deep-learning wikipedia nlp-machine-learning

3 85 2 0 Updated Jun 3, 2019
boolean-questions

6 39 7 0 Updated May 28, 2019
wiki-atomic-edits

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.

nlp deep-neural-networks deep-learning wikipedia nlp-machine-learning

6 58 0 1 Updated May 6, 2019
noun-verb

This dataset contains naturally-occurring English sentences that feature non-trivial noun-verb ambiguity.

4 20 1 0 Updated Apr 26, 2019
gap-coreference

GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia for the evaluation of coreference resolution in practical applications.

Python Apache-2.0 68 171 2 0 Updated Apr 20, 2019

Top languages

Loading…

Most used topics

Loading…

Grow your team on GitHub

Pinned repositories

Repositories

Top languages

Most used topics

People