Sourced from the ml-jku leaderboard

The Tox21
Leaderboard

Ten years after deep learning's “ImageNet moment” in drug discovery, methods are still ranked on the original challenge: from a molecule's chemical structure alone, predict whether it trips twelve biological alarms linked to toxicity — the sort of screening that otherwise demands slow, expensive lab and animal testing. Models are scored by mean AUC across those twelve endpoints. This is an independent, plain-language view of the reproducible leaderboard maintained by ml-jku on Hugging Face (see the 2025 paper). The striking part: the methods on top today are largely the same ones from 2015–2017.

Endpoints scored

12 assays

Metric

AUC mean ↑

Methods ranked

—

Updated

—

Standings

Click a column to sort · click a row for per-assay AUC

#	Method	Type	Year	Mean AUC ▼

Tox21, in plain terms

What is Tox21?

A US federal research program — a collaboration of the EPA, FDA, and NIH (through NCATS and the NIEHS National Toxicology Program), running since 2008. Using robotic, cell-based lab tests, it screens roughly 10,000 chemicals for early signs of harm far faster and cheaper than traditional animal testing, to flag which ones deserve a closer look.

What are the models predicting?

Given nothing but a molecule's chemical structure, each model guesses whether the compound will set off specific biological responses tied to toxicity — no lab, no animals, just the structure. In 2014 a slice of the Tox21 data was released as a public competition to test how well machines could do exactly this.

The twelve assays

Each scored column is one lab test. Seven probe nuclear receptors (hormone-signaling switches, such as the estrogen and androgen receptors) and five probe stress-response pathways (cellular alarms for DNA damage, oxidative stress, and mitochondrial harm). A “hit” points to a route by which a chemical might cause harm.

How to read the score

Performance is AUC, from 0.5 to 1.0. Picture it this way: shown one toxic and one safe molecule, an AUC of 0.85 means the model ranks the toxic one higher about 85% of the time. 0.5 is a coin flip; 1.0 is perfect. The Type tag notes how a model was built — trained from scratch, pre-trained, or run zero-shot.

Why it matters

Conventional toxicity testing leans heavily on animal studies — slow, costly, and ethically fraught. Cutting that reliance is an explicit Tox21 goal: move first-pass screening to cell-based tests, and further still to computational models like these, which judge a molecule from its structure alone — no animals, and no new wet-lab assay at all. Used to decide which handful of chemicals genuinely warrant deeper study, sharper predictions mean fewer animal tests overall — the core promise behind toxicology's shift toward New Approach Methodologies and the long-standing goal of replacing, reducing, and refining animal use.

Learn more

Tox21 program (official)tox21.gov Tox21 data browser & challengetripod.nih.gov The challenge datasethuggingface.co The ml-jku leaderboardhuggingface.co The 2025 reproducibility paperarXiv DeepTox & the original databioinf.jku.at