DataComp

Welcome to DataComp, the machine learning benchmark where the models are fixed and the challenge is to find the best possible data! Select a setting to learn more about how to participate

CLIP

Contrastive Language Image Pre-training

Select the best subset of image/text pairs from a large pool to train a CLIP model. Evaluate your training set by testing the model on a set of downstream vision tasks

LM

Language Modeling

Select the best subset of text data from a large pool to train a language model. Evaluate your training set by testing the model on a set of downstream language tasks

Reasoning

Chain of Thought and Reasoning

Generate the best question - answer pairs to teach base models to reason. Evaluate these models by testing them on a set of downstream reasoning domains

VLMs

Vision-Language Models

Filter and mix the best subset of multimodal data from a large pool to train a vision-language model. Evaluate your training set by testing the model on a set of downstream multimodal tasks