DataComp

Getting Started

Overview

DataComp is a competition about designing multimodal datasets. As a participant, your task is to create a pre-training dataset with image-text pairs that produces a CLIP model high accuracy on downstream tasks. Unlike traditional benchmarks, in DataComp the model architecture and hyperparameters are fixed, and your task is to innovate on the dataset design. As part of the benchmark, we provide CommonPool, a large collection of 12.8B image-text pairs crawled from the public internet. Our benchmark offers two tracks: one where participants must use only samples from the pools we provide (filtering), and another where participants can use external data, including samples from our pool (Bring your own data, BYOD). DataComp is designed to accommodate verious levels of computational resources: each track is broken down into four scales, spanning several orders of magnitude of compute.

Our codebase is available at github.com/mlfoundations/datacomp.

Install dependencies

To start, clone the repository and install the dependencies.

conda env create -f environment.yml

To activate the environment:

conda activate datacomp

Downloding CommonPool

To download CommonPool, run the following command, replacing $scale with the competition scale (i.e. small, medium, large or xlarge) and $data_dir with the output directory where you want the data to be stored.

python download_upstream.py --scale $scale --data_dir $data_dir

There are four scales in our competition:

small: 12.8M pool size, 12.8M examples seen
medium: 128M pool size, 128M examples seen
large: 1.28B pool size, 1.28B examples seen
xlarge: 12.8B pool size, 12.8B examples seen

The data is stored in shards, which are tar files with the images and captions to be consumed by webdataset. Once the download finishes, the data will be available at $data_dir/shards. We offer options for selecting subsets of the downloaded pool here.

Training

To train, run the following command:

torchrun --nproc_per_node $num_gpus train.py --scale $scale --data_dir $data_dir --output_dir $output_dir --exp_name $exp_name

Evaluating

To evaluate, run the following command:

python evaluate.py  --train_output_dir $train_output_dir/$exp_name

Submitting

See our submission instructions. Good luck!