Datasets

Search
all
verified
XQuAD
XQuAD

This dataset is a great resource for researchers who want to evaluate cross-lingual question answering performance.

CommonGen
CommonGen

Building machines with commonsense to compose realistically plausible sentences is challenging. CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts; the task is to generate a coherent sentence describing an everyday sce- nario using these concepts.

BLiMP
BLiMP

The Benchmark of Linguistic Minimal Pairs, a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English, finds that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena.

TAL-SCQ5K
TAL-SCQ5K

TAL-SCQ5K are high-quality mathematical competition datasets created by TAL Education Group.

X-CSR
X-CSR

To create these datasets, the authors automatically translated the original CSQA and CODAH datasets, originally available only in English, into 15 other languages.

DOCCI
DOCCI

The DOCCI dataset consists of comprehensive descriptions on 15k images specifically taken with the objective of evaluating T2I and I2T models. These cover a lot of key details in the images, as illustrated below.

AI2_Reasoning_Challenge
AI2 Reasoning Challenge

The ARC dataset consists of 7,787 science exam questions drawn from a variety of sources, including science questions provided under license by a research partner affiliated with AI2.

MNIST
MNIST

MNIST is used to train and evaluate image classification models in complex tasks.

NIH_Chest_X_ray
NIH Chest X-Ray

NIH Chest X-Ray is a large dataset containing chest X-ray images of patients collected by the National Institutes of Health (NIH) of the United States.

PLOD_An_Abbreviation_Detection_Dataset
PLOD: An Abbreviation Detection Dataset

This is the repository for PLOD Dataset subset being used for CW in NLP module 2023-2024 at University of Surrey.

moodeng-dataset-pro-1.42
Moodeng Dataset Pro-v1.42

moodeng-dataset-pro-1.42

data-training
Data Training Ver01

"Meta Llama 3" means the foundational large language models and software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Meta. META LLAMA 3 COMMUNITY LICENSE AGREEMENT. Meta Llama 3 Version Release Date: April 18, 2024

super-dataset-in-the-world
Super Dataset In The World

The Super Dataset in the World is a groundbreaking, all-encompassing data repository designed to empower researchers, developers, and industry professionals with an unparalleled resource for machine learning, data analytics, and AI innovation. Meticulously curated from diverse, high-quality sources across multiple domains, this dataset sets a new benchmark in data comprehensiveness, accuracy, and scalability

data-check-001
data-check-001

META LLAMA 3 COMMUNITY LICENSE AGREEMENT

1