
Group leaders Ian T. Foster and Robert Underwood
Preparation, Curation, Deduplication, and Optimization of Natural and Synthetic Scientific Text and Data.
Ian Foster and Robert Underwood lead the data team which focuses on data preparation, data preparation ablation studies, generation of synthetic training datasets, and training validation experiments. Outcomes will include publicly available data preparation workflows and trillions of tokens of data for training.