全部一手来源 URL¶
调研时间:2026-04-20;方法:WebSearch + 逐个 benchmark 关键词。
生物 / 湿实验¶
LAB-Bench¶
- paper: https://arxiv.org/html/2407.10362v1
- arXiv abs: https://arxiv.org/abs/2407.10362
- repo: https://github.com/Future-House/LAB-Bench
- HF dataset: https://huggingface.co/datasets/futurehouse/lab-bench
- FutureHouse blog: https://www.futurehouse.org/research-announcements/lab-bench-measuring-capabilities-of-language-models-for-biology-research
BixBench¶
- paper: https://arxiv.org/abs/2503.00096
- HTML: https://arxiv.org/html/2503.00096v1
- repo: https://github.com/Future-House/BixBench
- blog: https://www.futurehouse.org/research-announcements/bixbench
CompBioBench¶
- bioRxiv: https://www.biorxiv.org/content/10.64898/2026.04.06.716850v1
BioML-bench¶
- bioRxiv (v2): https://www.biorxiv.org/content/10.1101/2025.09.01.673319v2
- repo: https://github.com/science-machine/biomlbench
Biomni¶
- bioRxiv: https://www.biorxiv.org/content/10.1101/2025.05.30.656746v1
- 官网: https://biomni.stanford.edu/
- PDF: https://biomni.stanford.edu/paper.pdf
- repo: https://github.com/snap-stanford/Biomni
- MarkTechPost 解读: https://www.marktechpost.com/2025/05/30/stanford-researchers-introduced-biomni-a-biomedical-ai-agent-for-automation-across-diverse-tasks-and-data-types/
BioProBench¶
- paper: https://arxiv.org/abs/2505.07889
- HTML: https://arxiv.org/html/2505.07889v2
- repo: https://github.com/YuyangSunshine/bioprotocolbench
- HF: https://huggingface.co/datasets/GreatCaptainNemo/BioProBench
ExpVid¶
- ICLR 2026 paper: https://openreview.net/pdf/050bd6a4d5906f9bf809dfbc3677f111268bd7d5.pdf
PaperQA2¶
- arXiv: https://arxiv.org/pdf/2409.13740
- FutureHouse blog (RAG-QA Arena): https://www.futurehouse.org/research-announcements/paperqa2-achieves-sota-performance-on-rag-qa-arena-science-benchmark
- FutureHouse blog (WikiCrow): https://www.futurehouse.org/research-announcements/wikicrow
- repo: https://github.com/Future-House/paper-qa
- LinkedIn 公告: https://www.linkedin.com/posts/andrewdwhite_today-futurehouse-has-released-paperqa2-activity-7239691606931488774-N9Lj
Aviary¶
- arXiv HTML: https://arxiv.org/html/2412.21154v1
ProteinGym¶
- 官网: https://proteingym.org/
- repo: https://github.com/OATML-Markslab/ProteinGym
- PMC: https://pmc.ncbi.nlm.nih.gov/articles/PMC10723403/
- NeurIPS paper: https://papers.nips.cc/paper_files/paper/2023/file/cac723e5ff29f65e3fcbb0739ae91bee-Paper-Datasets_and_Benchmarks.pdf
科学 Q&A / 推理¶
GPQA¶
- paper: https://openreview.net/pdf?id=Ti67584b98
- arXiv: https://arxiv.org/pdf/2311.12022
- abs: https://arxiv.org/abs/2311.12022
- Epoch leaderboard: https://epoch.ai/benchmarks/gpqa-diamond
- AA leaderboard: https://artificialanalysis.ai/evaluations/gpqa-diamond
- vals.ai: https://www.vals.ai/benchmarks/gpqa
- IntuitionLabs: https://intuitionlabs.ai/articles/gpqa-diamond-ai-benchmark
SuperGPQA¶
- paper: https://arxiv.org/abs/2502.14739
- HF papers: https://huggingface.co/papers/2502.14739
- ByteDance Seed blog: https://seed.bytedance.com/en/blog/doubao-seed-team-launched-supergpqa-an-open-source-benchmark-test-set-covering-285-disciplines
- repo: https://github.com/SuperGPQA/SuperGPQA
HLE¶
- paper: https://arxiv.org/abs/2501.14249
- 官网: https://agi.safe.ai/
- Scale leaderboard: https://labs.scale.com/leaderboard/humanitys_last_exam
- repo: https://github.com/centerforaisafety/hle
- Epoch: https://epoch.ai/benchmarks/hle
- Nature: https://www.nature.com/articles/s41586-025-09962-4
- Wikipedia: https://en.wikipedia.org/wiki/Humanity's_Last_Exam
- AA: https://artificialanalysis.ai/evaluations/humanitys-last-exam
CURIE¶
- paper: https://arxiv.org/abs/2503.13517
- HTML: https://arxiv.org/html/2503.13517v1
- OpenReview: https://openreview.net/forum?id=jw2fC6REUB
- repo: https://github.com/google/curie
- Google Research blog: https://research.google/blog/evaluating-progress-of-llms-on-scientific-problem-solving/
- AI track article: https://theaitrack.com/llms-scientific-benchmarks-curie-spiqa-feabench/
FrontierMath¶
- 官网: https://epoch.ai/frontiermath
- paper: https://arxiv.org/abs/2411.04872
- HTML: https://arxiv.org/html/2411.04872v1
- Tier 4: https://epoch.ai/benchmarks/frontiermath-tier-4
- Open Problems: https://epoch.ai/frontiermath/open-problems
- Tiers 1-4: https://epoch.ai/frontiermath/tiers-1-4/the-benchmark/
- VB: https://venturebeat.com/ai/ais-math-problem-frontiermath-benchmark-shows-how-far-technology-still-has-to-go
- OurWorldInData: https://ourworldindata.org/grapher/ai-frontiermath-over-time
Agent / Coding¶
SciCode¶
- NeurIPS PDF: https://proceedings.neurips.cc/paper_files/paper/2024/file/36850592258c8c41cecdaa3dea5ff7de-Paper-Datasets_and_Benchmarks_Track.pdf
- 官网: https://scicode-bench.github.io/
- repo: https://github.com/scicode-bench/SciCode
- AA leaderboard: https://artificialanalysis.ai/evaluations/scicode
- HF: https://huggingface.co/papers/2407.13168
- llm-stats: https://llm-stats.com/benchmarks/scicode
- Steel: https://leaderboard.steel.dev/registry/benchmarks/scicode
ScienceAgentBench¶
- paper: https://arxiv.org/abs/2410.05080
- HTML: https://arxiv.org/html/2410.05080v1
- 官网: https://osu-nlp-group.github.io/ScienceAgentBench/
- repo: https://github.com/OSU-NLP-Group/ScienceAgentBench
- OpenReview: https://openreview.net/forum?id=6z4YKr0GK6
DataSciBench¶
- 官网: https://datascibench.github.io/
- paper: https://arxiv.org/abs/2502.13897
- HTML: https://arxiv.org/html/2502.13897v1
- repo: https://github.com/THUDM/DataSciBench
- OpenReview: https://openreview.net/forum?id=BltaWJZMeR
MLE-Bench¶
- OpenAI: https://openai.com/index/mle-bench/
- paper: https://arxiv.org/abs/2410.07095
- PDF: https://arxiv.org/pdf/2410.07095
- repo: https://github.com/openai/mle-bench
- OpenReview: https://openreview.net/forum?id=6s5uXNWGIh
- OECD: https://oecd.ai/en/catalogue/tools/mle-bench
- Inspect Evals: https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/mle_bench/
PaperBench¶
- OpenAI: https://openai.com/index/paperbench/
- paper: https://arxiv.org/abs/2504.01848
- PDF: https://cdn.openai.com/papers/22265bac-3191-44e5-b057-7aaacd8e90cd/paperbench.pdf
- OpenReview: https://openreview.net/forum?id=xF5PuTLPbn¬eId=1VgzDxh2V3
- ICML 2025 poster: https://icml.cc/virtual/2025/poster/43586
- Inspect Evals: https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/paperbench/
- WandB report: https://wandb.ai/byyoung3/ml-news/reports/OpenAI-s-new-research-agent-benchmark-Paperbench---VmlldzoxMjEyODQ5NA
RE-Bench¶
- paper: https://arxiv.org/abs/2411.15114
- HTML: https://arxiv.org/html/2411.15114v1
- METR blog: https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/
- METR report PDF: https://metr.org/AI_R_D_Evaluation_Report.pdf
- OpenReview: https://openreview.net/forum?id=3rB0bVU6z6¬eId=lOCHc0u2a6
- ICML poster: https://icml.cc/virtual/2025/poster/46519
- Proceedings: https://proceedings.mlr.press/v267/wijk25a.html
LMR-Bench¶
- paper: https://aclanthology.org/2025.emnlp-main.314.pdf
医学¶
MedQA¶
- vals.ai: https://www.vals.ai/benchmarks/medqa
- LLM-MedQA paper: https://arxiv.org/abs/2501.05464
- 生命科学综述: https://intuitionlabs.ai/articles/large-language-model-benchmarks-life-sciences-overview
- pricepertoken leaderboard: https://pricepertoken.com/leaderboards/benchmark/medqa
- systematic review: https://www.jmir.org/2025/1/e84120
HealthBench¶
- OpenAI: https://openai.com/index/healthbench/
- paper: https://arxiv.org/abs/2505.08775
- PDF: https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf
- HF dataset: https://huggingface.co/datasets/openai/healthbench
- PMC 分析: https://pmc.ncbi.nlm.nih.gov/articles/PMC12547120/
- MobiHealth: https://www.mobihealthnews.com/news/openai-unveils-healthbench-evaluate-llms-safety-healthcare
- HLTH: https://hlth.com/insights/news/openai-launches-healthbench-to-evaluate-healthcare-ai-safety-2025-05-16
MedAgentBench¶
- paper: https://arxiv.org/abs/2501.14654
- HTML v2: https://arxiv.org/html/2501.14654v2
- 官网: https://stanfordmlgroup.github.io/projects/medagentbench/
- repo: https://github.com/stanfordmlgroup/MedAgentBench
- NEJM AI: https://ai.nejm.org/doi/full/10.1056/AIdbp2500144
AgentClinic¶
- paper: https://arxiv.org/abs/2405.07960
- 官网: https://agentclinic.github.io/
- OpenReview: https://openreview.net/forum?id=ak7r4He1qH
化学 / 材料¶
ChemBench¶
- 官网: https://lamalab-org.github.io/chembench/
- repo: https://github.com/lamalab-org/chembench
- How-To: https://lamalab-org.github.io/chembench/getting_started/
MaCBench¶
- 笔记: https://hunterheidenreich.com/notes/computational-chemistry/llms-for-chemistry/macbench-multimodal-chemistry-benchmark/
MatBench Discovery¶
- 官网: https://matbench-discovery.materialsproject.org/
- Nature MI: https://www.nature.com/articles/s42256-025-01055-1
- arXiv: https://arxiv.org/html/2308.14920v3
- ICLR 2023 page: https://iclr.cc/virtual/2023/14184
LLM4Mat-Bench¶
- HTML: https://arxiv.org/html/2411.00177v3
- OpenReview: https://openreview.net/pdf?id=TSAeQSv9RI
MatBench(传统 ML)¶
- 官网: https://matbench.materialsproject.org/
发现 / 模拟¶
ScienceWorld¶
- paper: https://arxiv.org/abs/2203.07540
- 官网: https://sciworld.apps.allenai.org/
- repo: https://github.com/allenai/ScienceWorld
- AI2 blog: https://allenai.org/blog/evaluating-scientific-discovery-agents
DiscoveryWorld¶
- paper: https://arxiv.org/abs/2406.06769
- 官网: https://allenai.github.io/discoveryworld/
LLM-SRBench¶
- OpenReview: https://openreview.net/forum?id=SyQPiZJVWY
SciGym¶
- paper HTML: https://arxiv.org/html/2507.02083
综合 leaderboard / 排行站¶
- Vellum AI: https://www.vellum.ai/llm-leaderboard
- llm-stats: https://llm-stats.com/benchmarks
- pricepertoken: https://pricepertoken.com/leaderboards/benchmark
- LM Council: https://lmcouncil.ai/benchmarks
- Scale Labs: https://labs.scale.com/leaderboard
- Klu: https://klu.ai/llm-leaderboard
- Artificial Analysis: https://artificialanalysis.ai/leaderboards/models
- BenchLM: https://benchlm.ai/knowledge
- Onyx: https://onyx.app/llm-leaderboard
Anthropic / OpenAI 官方发布¶
- Anthropic healthcare & life sciences: https://www.anthropic.com/news/healthcare-life-sciences
- Anthropic 加速科研公告: https://www.anthropic.com/news/accelerating-scientific-research
- Claude Sonnet 4.6: https://www.anthropic.com/news/claude-sonnet-4-6
- Anthropic research 总页: https://www.anthropic.com/research
- FutureHouse @ Anthropic customer story: https://claude.com/customers/futurehouse