DataComp


Welcome to DataComp, the machine learning benchmark where the models are fixed and the challenge is to find the best possible data!

Prior competitions in machine learning have focused on finding the best model, with a fixed set of training and test data. However, many recent advances (CLIP, DALL-E, Stable Diffusion, or Flamingo) are due in part to large multimodal datasets. DataComp centers the role that data plays by fixing the training code, and encouraging researchers to innovate by proposing new training sets.

We provide an experimental testbed centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate them by running our standardized CLIP training code followed by an evaluation on 38 downstream datasets. Our benchmark consists of multiple scales, with four candidate pool sizes and associated compute budgets ranging from 12.8 million to 12.8 billion. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources. More information can be found on this blog post.

Get Started   Paper   Code

How to Participate

A. Choose which scale is best for you based on your resources: small, medium, large and xlarge. Each scale corresponds to a differently sized data pool and model. You may sumbit to multiple scales. As a reference, the training cost associated with each scale can be approximated with the following: small is similar to fine-tuning on ImageNet-1k, medium is similar to training on ImageNet-1k from scratch, large is similar to training on ImageNet-21k from scratch, xlarge is similar to training the OpenAI CLIP model.

B. Select data to create a candidate dataset. To do this, choose one of two tracks: Filtering, where only image-text pairs from the CommonPool we provide are allowed; or BYOD, where any data source (including our pool) is permitted. You may sumbit to multiple tracks.

C. Train a CLIP model on your candidate dataset. CLIP size, architecture, and hyperparameters are fixed for each scale.

D. Evaluate the trained model on a suite of 38 diverse downstream tasks to measure the effectiveness of your candidate training dataset.

E. Submit to our leaderboard to compare to baseline methods and other teams! Please ensure that the dataset, trained model, and evaluation results are public.

Tracks

Rules for the filtering track

1) Participants can enter submissions for one or many different scales: small, medium, large or xlarge, which represent the raw number of image-text pairs in a CommonPool that should be filtered.

2) After choosing a scale, participants generate a list of uids, where each uid refers to a CommonPool sample. The list of uids is used to recover image-text pairs from the pool, which is used for downstream CLIP training. Click to download an example uid file.

3) Duplicate uids are allowed.

4) Participants are not allowed to modify the training procedure. Hence, changing hyperparameters, model architecture, optimizer, compute budget, or number of training steps is not allowed. Changing any other training details is also not allowed.

5) Participants are strongly encouraged to submit and open-source both the list of uids and the code used to generate this list; however, this is not required.

6) To avoid overfitting, we do not permit running any code or algorithmic dependence on the test images of the evaluation tasks. However, use of other images associated with these tasks (e.g., supervised training sets) is permitted.

7) Participants can use templates or class labels from the downstream tasks to bootstrap their filtering algorithms.

Amendments for the BYOD track

1) Participants are allowed to augment CommonPool data with existing datasets, so long as these data sources do not contain test images from the evaluation tasks. Participants can use data from any CommonPool; however, they are not required to do so.

2) Assembling one's own dataset is allowed; however, test images from the evaluation tasks can neither be contained nor otherwise used to construct said dataset. We encourage releasing the image urls or the images themselves in addition to the text for each image. We also encourage rigorous documentation of face-blurring and other data safety checks. We reserve the right to run our own safety code on participant provided data and disqualify entries that do not meet adequate safety standards.

FAQ

Can I include a piece of data more than once in training?
Yes! For the CommonPool track you can do this by simply including a uid multiple times.

Can we use the same filtering algorithm to enter multiple tracks/scales?
Yes! We encourage participation in both tracks and multiple scales.

Is any data forbidden from use in the Bring Your Own Data track?
The only data that is explicitly forbidden is the test images from the evaluation tasks and data that cannot be released publicly. However, we additionally require that external data meets our own safety standards and may be excluded if it does not.