Datasets

Search
all
verified
Social_Bias_Frames
Social Bias Frames

The main aim for this dataset is to cover a wide variety of social biases that are implied in text, both subtle and overt, and make the biases representative of real world discrimination that people experience RWJF 2017.

PathVQA
PathVQA

PathVQA consists of 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness.

FaithDial
FaithDial

The goal of information-seeking dialogue is to respond to user queries with natural language utterances that are grounded on knowledge sources.

VQA-RAD
VQA-RAD

VQA-RAD consists of 3,515 question–answer pairs on 315 radiology images.

DocRED
DocRED

DocRED (Document-Level Relation Extraction Dataset) is a relation extraction dataset constructed from Wikipedia and Wikidata. Each document in the dataset is human-annotated with named entity mentions, coreference information, intra- and inter-sentence relations, and supporting evidence.

TextVQA
TextVQA

TextVQA is a dataset to benchmark visual reasoning based on text in images. TextVQA requires models to read and reason about text in images to answer questions about them.

XQuAD
XQuAD

This dataset is a great resource for researchers who want to evaluate cross-lingual question answering performance.

CommonGen
CommonGen

Building machines with commonsense to compose realistically plausible sentences is challenging. CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts; the task is to generate a coherent sentence describing an everyday sce- nario using these concepts.

Vibe-Eval
Vibe-Eval

Vibe-Eval is comprised of 269 ultra high quality image-text prompts and their ground truth responses. The quality of prompts and responses has been extensively checked multiple times by our team.

BLiMP
BLiMP

The Benchmark of Linguistic Minimal Pairs, a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English, finds that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena.

TAL-SCQ5K
TAL-SCQ5K

TAL-SCQ5K are high-quality mathematical competition datasets created by TAL Education Group.

X-CSR
X-CSR

To create these datasets, the authors automatically translated the original CSQA and CODAH datasets, originally available only in English, into 15 other languages.

KLUE
KLUE

Korean Language Understanding Evaluation (KLUE) benchmark is a series of datasets to evaluate natural language understanding capability of Korean language models.

DOCCI
DOCCI

The DOCCI dataset consists of comprehensive descriptions on 15k images specifically taken with the objective of evaluating T2I and I2T models. These cover a lot of key details in the images, as illustrated below.

BIG-bench
BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. Big-bench include more than 200 tasks.

MMLU
MMLU

MMLU is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings.

PLOD_An_Abbreviation_Detection_Dataset
PLOD: An Abbreviation Detection Dataset

This is the repository for PLOD Dataset subset being used for CW in NLP module 2023-2024 at University of Surrey.

AI2_Reasoning_Challenge
AI2 Reasoning Challenge

The ARC dataset consists of 7,787 science exam questions drawn from a variety of sources, including science questions provided under license by a research partner affiliated with AI2.

WikiText
WikiText

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

MathVista
MathVista

MathVista: Diverse benchmark for mathematical reasoning in visual contexts. Includes 6,141 examples from 31 datasets.

1
2