OpenAI just dropped what amounts to a final exam for AI models trying to do real science. LifeSciBench, published on June 17, is a benchmarking tool built to measure how well AI systems handle actual life sciences research, not the sanitized textbook version, but the messy, multi-step, figure-laden work that PhD scientists do every day.
The benchmark includes 750 tasks spanning seven distinct research workflows, from evidence handling and analysis to experimental design, scientific reasoning, and communication.
What makes LifeSciBench different
The 750 tasks were authored and reviewed by 173 PhD-level scientists with backgrounds in biotechnology and pharmaceuticals. An additional 453 expert reviewers helped validate them. Each task averaged six automated review cycles, and expert consensus required at least 90% agreement before a task made it into the final set.
The tasks come loaded with 1,062 attached artifacts, including figures, PDFs, and datasets. That matters because real research doesn’t happen in clean text boxes. It happens in spreadsheets with missing columns, in blurry gel images, in 40-page supplementary files that nobody wants to read. LifeSciBench forces AI models to deal with all of it.








