
Group leaders Franck Cappello, Bo Li, and Sandeep Madireddy
Multifaceted evaluation and benchmarking of scientific skill and safety.
The team is composed of Franck Cappello, Argonne (lead); Sandeep Madireddy, Argonne (co-lead);Bo Li, U. Chicago (co-lead); Robert Underwood, Argonne; Neil Getty, Argonne; Angel Yanguas-Gil, Argonne; Nesar Ramachandra, Argonne; Murat Keceli, Argonne; Marieme Ngom, Argonne; Chenhui Zhang, MIT; Josh Nguyen, U. Penn; Tanwi Mallick, Argonne; Zilinghan Li, Argonne; Minyang Tian, Argonne; Yufeng Du, Argonne; Eliu Huerta, Argonne; and Nick Chia, Argonne.
The capabilities of large language models such as ChatGPT, Claude, Gemini, and Llama have progressed dramatically in the past 2-3 years, raising the question of using them in a research context as research assistants. The AuroraGPT Evaluation team aims to assess the performance and safety of scientific foundation models like AuroraGPT. While many benchmarks exist to assess the general language skills of these models, there is no established method to evaluate their scientific skills. The AuroraGPT Evaluation team is developing a multifaceted evaluation methodology specifically designed to assess AI models scientific knowledge, task proficiency, and safety. This effort includes establishing rigorous multiple complementary manual and automatic evaluation methods, developing fair and reproducible scoring methods considering uncertainty quantification, and building reliable infrastructure for large-scale automated experiments.
The AuroraGPT Evaluation team develops five key areas: 1) Scientific benchmarks: the AI4S science MCQ benchmark, Astronomy Benchmarking, and SciCode Benchmark. These benchmarks are designed to test the scientific skills, reasoning, and problem-solving capabilities of LLMs across various domains. 2) Infrastructure: the STaR framework for scalable benchmark evaluations on HPC systems. 3) lab-style experiments to evaluate AI models as research assistants on open and recently solved research problems for tasks including literature search, experiment planning, hypothesis generation, hypothesis testing design through derivation of close forms, simulations, experiments and observations, and test analysis for hypothesis validation. 4) "in the wild" studies to gain deeper insights into the real-world use of AI models in scientific research and to identify and analyze AI models' strengths and weaknesses at a large scale on real research problems. 5) Comprehensive AI Safety Frameworks: Creating holistic approaches to AI safety that involve collaboration across national labs, academia, and industry, integrating red-teaming, blue-teaming, safety alignment, input-output content moderation, and probabilistic guarantees. 6) Advancing Scientific AI Safety Evaluation: Improve AI safety evaluations in specialized domains like CBRNE and cybersecurity, as well as general issues like bias and toxicity, by developing advanced tools and methodologies—including multimodal and continually learning evaluation frameworks, and statistical approaches to uncertainty quantification for hallucination detection.
Future plans include expanding evaluation capabilities to support retrieval-augmented generation and multimodal assessments, as well as developing novel uncertainty quantification methods tailored for evolving inference landscapes.