Skip to content
@google-research-datasets

Google Research Datasets

Datasets released by Google Research

Pinned

  1. Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

    Python 754 139

  2. Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

    Shell 388 20

  3. ToTTo Public

    ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, p…

    359 32

  4. dakshina Public

    The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia tex…

    155 16

  5. tydiqa Public

    TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the trai…

    Python 224 34

  6. GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia for the evaluation of coreference resolution in practica…

    Python 211 83

Repositories

  • cvss Public

    CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

    108 CC-BY-4.0 7 0 0 Updated Aug 26, 2022
  • dstc8-schema-guided-dialogue Public

    The Schema-Guided Dialogue Dataset

    Python 412 CC-BY-SA-4.0 98 3 0 Updated Aug 23, 2022
  • hiertext Public

    The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.

    Jupyter Notebook 115 CC-BY-SA-4.0 8 0 0 Updated Aug 17, 2022
  • Objectron Public

    Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box descri…

    Jupyter Notebook 1,979 246 21 0 Updated Jul 20, 2022
  • clang8 Public

    cLang-8 is a dataset for grammatical error correction.

    Python 51 3 6 0 Updated Jul 19, 2022
  • maverics Public

    MAVERICS (Manually-vAlidated Vq^2a Examples fRom Image-Caption datasetS) is a suite of test-only benchmarks for visual question answering (VQA).

    3 0 0 0 Updated Jul 7, 2022
  • wit Public

    WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

    709 29 3 1 Updated Jun 9, 2022
  • informal Public

    InFormal is a formality style transfer dataset for four Indic Languages. The dataset is made up of a pair of sentences and corresponding human-annotated labels identifying the more formal sentence as well the pair’s semantic similarity. This dataset can be used as an evaluation set for style transfer tasks in Indic Languages. InFormal contains s…

    0 Apache-2.0 0 0 0 Updated May 24, 2022
  • RxR Public

    Room-across-Room (RxR) is a large-scale, multilingual dataset for Vision-and-Language Navigation (VLN) in Matterport3D environments. It contains 126k navigation instructions in English, Hindi and Telugu, and 126k navigation following demonstrations. Both annotation types include dense spatiotemporal alignments between the text and the visual per…

    Python 81 CC-BY-4.0 10 1 0 Updated Apr 8, 2022
  • TF-IDF-IIF-top100-wordlists Public

    These are lists for a variety of languages containing words that are distinctive to each language.

    22 3 1 0 Updated Apr 6, 2022

People

This organization has no public members. You must be a member to see who’s a part of this organization.