DataComp is a competition about designing multimodal datasets.
As a participant, your task is to create a pre-training dataset with image-text pairs that produces a CLIP model high accuracy on downstream tasks.
Unlike traditional benchmarks, in DataComp the model architecture and hyperparameters are fixed, and your task is to innovate on the dataset design.
As part of the benchmark, we provide CommonPool, a large collection of 12.8B image-text pairs crawled from the public internet.
Our benchmark offers two tracks: one where participants must use only samples from the pools we provide (filtering), and another where participants can use external data, including samples from our pool (Bring your own data, BYOD).
DataComp is designed to accommodate verious levels of computational resources: each track is broken down into four scales, spanning several orders of magnitude of compute.
Our codebase is available at github.com/mlfoundations/datacomp.
To start, clone the repository and install the dependencies.
conda env create -f environment.yml
To activate the environment:
conda activate datacomp
To download CommonPool, run the following command, replacing
$scale with the competition scale (i.e.
$data_dir with the output directory where you want the data to be stored.
python download_upstream.py --scale $scale --data_dir $data_dir
There are four scales in our competition:
small: 12.8M pool size, 12.8M examples seen
medium: 128M pool size, 128M examples seen
large: 1.28B pool size, 1.28B examples seen
xlarge: 12.8B pool size, 12.8B examples seen
The data is stored in shards, which are tar files with the images and captions to be consumed by webdataset.
Once the download finishes, the data will be available at
We offer options for selecting subsets of the downloaded pool here.
To train, run the following command:
torchrun --nproc_per_node $num_gpus train.py --scale $scale --data_dir $data_dir --output_dir $output_dir --exp_name $exp_name
To evaluate, run the following command:
python evaluate.py --train_output_dir $train_output_dir/$exp_name
See our submission instructions. Good luck!