Data

Preparation, Curation, Deduplication, and Optimization of Natural and Synthetic Scientific Text and Data.

Portraits of group leaders

Group leaders Ian T. Foster and Robert Underwood

Ian Foster and Robert Underwood lead the data team which focuses on data preparation, data preparation ablation studies, generation of synthetic training datasets, and training validation experiments. Outcomes will include publicly available data preparation workflows and trillions of tokens of data for training.