Datasets
The main aim for this dataset is to cover a wide variety of social biases that are implied in text, both subtle and overt, and make the biases representative of real world discrimination that people experience RWJF 2017.
by @AIOZNetwork
PathVQA consists of 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness.
by @AIOZNetwork
The goal of information-seeking dialogue is to respond to user queries with natural language utterances that are grounded on knowledge sources.
by @AIOZNetwork
VQA-RAD consists of 3,515 question–answer pairs on 315 radiology images.
by @AIOZNetwork
DocRED (Document-Level Relation Extraction Dataset) is a relation extraction dataset constructed from Wikipedia and Wikidata. Each document in the dataset is human-annotated with named entity mentions, coreference information, intra- and inter-sentence relations, and supporting evidence.
by @AIOZNetwork
TextVQA is a dataset to benchmark visual reasoning based on text in images. TextVQA requires models to read and reason about text in images to answer questions about them.
by @AIOZNetwork
This dataset is a great resource for researchers who want to evaluate cross-lingual question answering performance.
by @AIOZNetwork
Building machines with commonsense to compose realistically plausible sentences is challenging. CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts; the task is to generate a coherent sentence describing an everyday sce- nario using these concepts.
by @AIOZNetwork
Vibe-Eval is comprised of 269 ultra high quality image-text prompts and their ground truth responses. The quality of prompts and responses has been extensively checked multiple times by our team.
by @AIOZNetwork
The Benchmark of Linguistic Minimal Pairs, a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English, finds that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena.
by @AIOZNetwork
TAL-SCQ5K are high-quality mathematical competition datasets created by TAL Education Group.
by @AIOZNetwork
To create these datasets, the authors automatically translated the original CSQA and CODAH datasets, originally available only in English, into 15 other languages.
by @AIOZNetwork
Korean Language Understanding Evaluation (KLUE) benchmark is a series of datasets to evaluate natural language understanding capability of Korean language models.
by @AIOZNetwork
The DOCCI dataset consists of comprehensive descriptions on 15k images specifically taken with the objective of evaluating T2I and I2T models. These cover a lot of key details in the images, as illustrated below.
by @AIOZNetwork
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. Big-bench include more than 200 tasks.
by @AIOZNetwork
MMLU is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings.
by @AIOZNetwork
This is the repository for PLOD Dataset subset being used for CW in NLP module 2023-2024 at University of Surrey.
by @AIOZNetwork
The ARC dataset consists of 7,787 science exam questions drawn from a variety of sources, including science questions provided under license by a research partner affiliated with AI2.
by @AIOZNetwork
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
by @AIOZNetwork
MathVista: Diverse benchmark for mathematical reasoning in visual contexts. Includes 6,141 examples from 31 datasets.
by @AIOZNetwork