全部一手来源 URL¶

调研时间:2026-04-20;方法:WebSearch + 逐个 benchmark 关键词。

生物 / 湿实验¶

LAB-Bench¶

paper: https://arxiv.org/html/2407.10362v1
arXiv abs: https://arxiv.org/abs/2407.10362
repo: https://github.com/Future-House/LAB-Bench
HF dataset: https://huggingface.co/datasets/futurehouse/lab-bench
FutureHouse blog: https://www.futurehouse.org/research-announcements/lab-bench-measuring-capabilities-of-language-models-for-biology-research

BixBench¶

paper: https://arxiv.org/abs/2503.00096
HTML: https://arxiv.org/html/2503.00096v1
repo: https://github.com/Future-House/BixBench
blog: https://www.futurehouse.org/research-announcements/bixbench

CompBioBench¶

bioRxiv: https://www.biorxiv.org/content/10.64898/2026.04.06.716850v1

BioML-bench¶

bioRxiv (v2): https://www.biorxiv.org/content/10.1101/2025.09.01.673319v2
repo: https://github.com/science-machine/biomlbench

Biomni¶

bioRxiv: https://www.biorxiv.org/content/10.1101/2025.05.30.656746v1
官网: https://biomni.stanford.edu/
PDF: https://biomni.stanford.edu/paper.pdf
repo: https://github.com/snap-stanford/Biomni
MarkTechPost 解读: https://www.marktechpost.com/2025/05/30/stanford-researchers-introduced-biomni-a-biomedical-ai-agent-for-automation-across-diverse-tasks-and-data-types/

BioProBench¶

paper: https://arxiv.org/abs/2505.07889
HTML: https://arxiv.org/html/2505.07889v2
repo: https://github.com/YuyangSunshine/bioprotocolbench
HF: https://huggingface.co/datasets/GreatCaptainNemo/BioProBench

ExpVid¶

ICLR 2026 paper: https://openreview.net/pdf/050bd6a4d5906f9bf809dfbc3677f111268bd7d5.pdf

PaperQA2¶

arXiv: https://arxiv.org/pdf/2409.13740
FutureHouse blog (RAG-QA Arena): https://www.futurehouse.org/research-announcements/paperqa2-achieves-sota-performance-on-rag-qa-arena-science-benchmark
FutureHouse blog (WikiCrow): https://www.futurehouse.org/research-announcements/wikicrow
repo: https://github.com/Future-House/paper-qa
LinkedIn 公告: https://www.linkedin.com/posts/andrewdwhite_today-futurehouse-has-released-paperqa2-activity-7239691606931488774-N9Lj

Aviary¶

arXiv HTML: https://arxiv.org/html/2412.21154v1

ProteinGym¶

官网: https://proteingym.org/
repo: https://github.com/OATML-Markslab/ProteinGym
PMC: https://pmc.ncbi.nlm.nih.gov/articles/PMC10723403/
NeurIPS paper: https://papers.nips.cc/paper_files/paper/2023/file/cac723e5ff29f65e3fcbb0739ae91bee-Paper-Datasets_and_Benchmarks.pdf

科学 Q&A / 推理¶

GPQA¶

paper: https://openreview.net/pdf?id=Ti67584b98
arXiv: https://arxiv.org/pdf/2311.12022
abs: https://arxiv.org/abs/2311.12022
Epoch leaderboard: https://epoch.ai/benchmarks/gpqa-diamond
AA leaderboard: https://artificialanalysis.ai/evaluations/gpqa-diamond
vals.ai: https://www.vals.ai/benchmarks/gpqa
IntuitionLabs: https://intuitionlabs.ai/articles/gpqa-diamond-ai-benchmark

SuperGPQA¶

paper: https://arxiv.org/abs/2502.14739
HF papers: https://huggingface.co/papers/2502.14739
ByteDance Seed blog: https://seed.bytedance.com/en/blog/doubao-seed-team-launched-supergpqa-an-open-source-benchmark-test-set-covering-285-disciplines
repo: https://github.com/SuperGPQA/SuperGPQA

HLE¶

paper: https://arxiv.org/abs/2501.14249
官网: https://agi.safe.ai/
Scale leaderboard: https://labs.scale.com/leaderboard/humanitys_last_exam
repo: https://github.com/centerforaisafety/hle
Epoch: https://epoch.ai/benchmarks/hle
Nature: https://www.nature.com/articles/s41586-025-09962-4
Wikipedia: https://en.wikipedia.org/wiki/Humanity's_Last_Exam
AA: https://artificialanalysis.ai/evaluations/humanitys-last-exam

CURIE¶

paper: https://arxiv.org/abs/2503.13517
HTML: https://arxiv.org/html/2503.13517v1
OpenReview: https://openreview.net/forum?id=jw2fC6REUB
repo: https://github.com/google/curie
Google Research blog: https://research.google/blog/evaluating-progress-of-llms-on-scientific-problem-solving/
AI track article: https://theaitrack.com/llms-scientific-benchmarks-curie-spiqa-feabench/

FrontierMath¶

官网: https://epoch.ai/frontiermath
paper: https://arxiv.org/abs/2411.04872
HTML: https://arxiv.org/html/2411.04872v1
Tier 4: https://epoch.ai/benchmarks/frontiermath-tier-4
Open Problems: https://epoch.ai/frontiermath/open-problems
Tiers 1-4: https://epoch.ai/frontiermath/tiers-1-4/the-benchmark/
VB: https://venturebeat.com/ai/ais-math-problem-frontiermath-benchmark-shows-how-far-technology-still-has-to-go
OurWorldInData: https://ourworldindata.org/grapher/ai-frontiermath-over-time

Agent / Coding¶

SciCode¶

NeurIPS PDF: https://proceedings.neurips.cc/paper_files/paper/2024/file/36850592258c8c41cecdaa3dea5ff7de-Paper-Datasets_and_Benchmarks_Track.pdf
官网: https://scicode-bench.github.io/
repo: https://github.com/scicode-bench/SciCode
AA leaderboard: https://artificialanalysis.ai/evaluations/scicode
HF: https://huggingface.co/papers/2407.13168
llm-stats: https://llm-stats.com/benchmarks/scicode
Steel: https://leaderboard.steel.dev/registry/benchmarks/scicode

ScienceAgentBench¶

paper: https://arxiv.org/abs/2410.05080
HTML: https://arxiv.org/html/2410.05080v1
官网: https://osu-nlp-group.github.io/ScienceAgentBench/
repo: https://github.com/OSU-NLP-Group/ScienceAgentBench
OpenReview: https://openreview.net/forum?id=6z4YKr0GK6

DataSciBench¶

官网: https://datascibench.github.io/
paper: https://arxiv.org/abs/2502.13897
HTML: https://arxiv.org/html/2502.13897v1
repo: https://github.com/THUDM/DataSciBench
OpenReview: https://openreview.net/forum?id=BltaWJZMeR

MLE-Bench¶

OpenAI: https://openai.com/index/mle-bench/
paper: https://arxiv.org/abs/2410.07095
PDF: https://arxiv.org/pdf/2410.07095
repo: https://github.com/openai/mle-bench
OpenReview: https://openreview.net/forum?id=6s5uXNWGIh
OECD: https://oecd.ai/en/catalogue/tools/mle-bench
Inspect Evals: https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/mle_bench/

PaperBench¶

OpenAI: https://openai.com/index/paperbench/
paper: https://arxiv.org/abs/2504.01848
PDF: https://cdn.openai.com/papers/22265bac-3191-44e5-b057-7aaacd8e90cd/paperbench.pdf
OpenReview: https://openreview.net/forum?id=xF5PuTLPbn&noteId=1VgzDxh2V3
ICML 2025 poster: https://icml.cc/virtual/2025/poster/43586
Inspect Evals: https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/paperbench/
WandB report: https://wandb.ai/byyoung3/ml-news/reports/OpenAI-s-new-research-agent-benchmark-Paperbench---VmlldzoxMjEyODQ5NA

RE-Bench¶

paper: https://arxiv.org/abs/2411.15114
HTML: https://arxiv.org/html/2411.15114v1
METR blog: https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/
METR report PDF: https://metr.org/AI_R_D_Evaluation_Report.pdf
OpenReview: https://openreview.net/forum?id=3rB0bVU6z6&noteId=lOCHc0u2a6
ICML poster: https://icml.cc/virtual/2025/poster/46519
Proceedings: https://proceedings.mlr.press/v267/wijk25a.html

LMR-Bench¶

paper: https://aclanthology.org/2025.emnlp-main.314.pdf

医学¶

MedQA¶

vals.ai: https://www.vals.ai/benchmarks/medqa
LLM-MedQA paper: https://arxiv.org/abs/2501.05464
生命科学综述: https://intuitionlabs.ai/articles/large-language-model-benchmarks-life-sciences-overview
pricepertoken leaderboard: https://pricepertoken.com/leaderboards/benchmark/medqa
systematic review: https://www.jmir.org/2025/1/e84120

HealthBench¶

OpenAI: https://openai.com/index/healthbench/
paper: https://arxiv.org/abs/2505.08775
PDF: https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf
HF dataset: https://huggingface.co/datasets/openai/healthbench
PMC 分析: https://pmc.ncbi.nlm.nih.gov/articles/PMC12547120/
MobiHealth: https://www.mobihealthnews.com/news/openai-unveils-healthbench-evaluate-llms-safety-healthcare
HLTH: https://hlth.com/insights/news/openai-launches-healthbench-to-evaluate-healthcare-ai-safety-2025-05-16

MedAgentBench¶

paper: https://arxiv.org/abs/2501.14654
HTML v2: https://arxiv.org/html/2501.14654v2
官网: https://stanfordmlgroup.github.io/projects/medagentbench/
repo: https://github.com/stanfordmlgroup/MedAgentBench
NEJM AI: https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

AgentClinic¶

paper: https://arxiv.org/abs/2405.07960
官网: https://agentclinic.github.io/
OpenReview: https://openreview.net/forum?id=ak7r4He1qH

化学 / 材料¶

ChemBench¶

官网: https://lamalab-org.github.io/chembench/
repo: https://github.com/lamalab-org/chembench
How-To: https://lamalab-org.github.io/chembench/getting_started/

MaCBench¶

笔记: https://hunterheidenreich.com/notes/computational-chemistry/llms-for-chemistry/macbench-multimodal-chemistry-benchmark/

MatBench Discovery¶

官网: https://matbench-discovery.materialsproject.org/
Nature MI: https://www.nature.com/articles/s42256-025-01055-1
arXiv: https://arxiv.org/html/2308.14920v3
ICLR 2023 page: https://iclr.cc/virtual/2023/14184

LLM4Mat-Bench¶

HTML: https://arxiv.org/html/2411.00177v3
OpenReview: https://openreview.net/pdf?id=TSAeQSv9RI

MatBench(传统 ML)¶

官网: https://matbench.materialsproject.org/

发现 / 模拟¶

ScienceWorld¶

paper: https://arxiv.org/abs/2203.07540
官网: https://sciworld.apps.allenai.org/
repo: https://github.com/allenai/ScienceWorld
AI2 blog: https://allenai.org/blog/evaluating-scientific-discovery-agents

DiscoveryWorld¶

paper: https://arxiv.org/abs/2406.06769
官网: https://allenai.github.io/discoveryworld/

LLM-SRBench¶

OpenReview: https://openreview.net/forum?id=SyQPiZJVWY

SciGym¶

paper HTML: https://arxiv.org/html/2507.02083

综合 leaderboard / 排行站¶

Vellum AI: https://www.vellum.ai/llm-leaderboard
llm-stats: https://llm-stats.com/benchmarks
pricepertoken: https://pricepertoken.com/leaderboards/benchmark
LM Council: https://lmcouncil.ai/benchmarks
Scale Labs: https://labs.scale.com/leaderboard
Klu: https://klu.ai/llm-leaderboard
Artificial Analysis: https://artificialanalysis.ai/leaderboards/models
BenchLM: https://benchlm.ai/knowledge
Onyx: https://onyx.app/llm-leaderboard

Anthropic / OpenAI 官方发布¶

Anthropic healthcare & life sciences: https://www.anthropic.com/news/healthcare-life-sciences
Anthropic 加速科研公告: https://www.anthropic.com/news/accelerating-scientific-research
Claude Sonnet 4.6: https://www.anthropic.com/news/claude-sonnet-4-6
Anthropic research 总页: https://www.anthropic.com/research
FutureHouse @ Anthropic customer story: https://claude.com/customers/futurehouse