跳转至

全部一手来源 URL

调研时间:2026-04-20;方法:WebSearch + 逐个 benchmark 关键词。


生物 / 湿实验

LAB-Bench

  • paper: https://arxiv.org/html/2407.10362v1
  • arXiv abs: https://arxiv.org/abs/2407.10362
  • repo: https://github.com/Future-House/LAB-Bench
  • HF dataset: https://huggingface.co/datasets/futurehouse/lab-bench
  • FutureHouse blog: https://www.futurehouse.org/research-announcements/lab-bench-measuring-capabilities-of-language-models-for-biology-research

BixBench

  • paper: https://arxiv.org/abs/2503.00096
  • HTML: https://arxiv.org/html/2503.00096v1
  • repo: https://github.com/Future-House/BixBench
  • blog: https://www.futurehouse.org/research-announcements/bixbench

CompBioBench

  • bioRxiv: https://www.biorxiv.org/content/10.64898/2026.04.06.716850v1

BioML-bench

  • bioRxiv (v2): https://www.biorxiv.org/content/10.1101/2025.09.01.673319v2
  • repo: https://github.com/science-machine/biomlbench

Biomni

  • bioRxiv: https://www.biorxiv.org/content/10.1101/2025.05.30.656746v1
  • 官网: https://biomni.stanford.edu/
  • PDF: https://biomni.stanford.edu/paper.pdf
  • repo: https://github.com/snap-stanford/Biomni
  • MarkTechPost 解读: https://www.marktechpost.com/2025/05/30/stanford-researchers-introduced-biomni-a-biomedical-ai-agent-for-automation-across-diverse-tasks-and-data-types/

BioProBench

  • paper: https://arxiv.org/abs/2505.07889
  • HTML: https://arxiv.org/html/2505.07889v2
  • repo: https://github.com/YuyangSunshine/bioprotocolbench
  • HF: https://huggingface.co/datasets/GreatCaptainNemo/BioProBench

ExpVid

  • ICLR 2026 paper: https://openreview.net/pdf/050bd6a4d5906f9bf809dfbc3677f111268bd7d5.pdf

PaperQA2

  • arXiv: https://arxiv.org/pdf/2409.13740
  • FutureHouse blog (RAG-QA Arena): https://www.futurehouse.org/research-announcements/paperqa2-achieves-sota-performance-on-rag-qa-arena-science-benchmark
  • FutureHouse blog (WikiCrow): https://www.futurehouse.org/research-announcements/wikicrow
  • repo: https://github.com/Future-House/paper-qa
  • LinkedIn 公告: https://www.linkedin.com/posts/andrewdwhite_today-futurehouse-has-released-paperqa2-activity-7239691606931488774-N9Lj

Aviary

  • arXiv HTML: https://arxiv.org/html/2412.21154v1

ProteinGym

  • 官网: https://proteingym.org/
  • repo: https://github.com/OATML-Markslab/ProteinGym
  • PMC: https://pmc.ncbi.nlm.nih.gov/articles/PMC10723403/
  • NeurIPS paper: https://papers.nips.cc/paper_files/paper/2023/file/cac723e5ff29f65e3fcbb0739ae91bee-Paper-Datasets_and_Benchmarks.pdf

科学 Q&A / 推理

GPQA

  • paper: https://openreview.net/pdf?id=Ti67584b98
  • arXiv: https://arxiv.org/pdf/2311.12022
  • abs: https://arxiv.org/abs/2311.12022
  • Epoch leaderboard: https://epoch.ai/benchmarks/gpqa-diamond
  • AA leaderboard: https://artificialanalysis.ai/evaluations/gpqa-diamond
  • vals.ai: https://www.vals.ai/benchmarks/gpqa
  • IntuitionLabs: https://intuitionlabs.ai/articles/gpqa-diamond-ai-benchmark

SuperGPQA

  • paper: https://arxiv.org/abs/2502.14739
  • HF papers: https://huggingface.co/papers/2502.14739
  • ByteDance Seed blog: https://seed.bytedance.com/en/blog/doubao-seed-team-launched-supergpqa-an-open-source-benchmark-test-set-covering-285-disciplines
  • repo: https://github.com/SuperGPQA/SuperGPQA

HLE

  • paper: https://arxiv.org/abs/2501.14249
  • 官网: https://agi.safe.ai/
  • Scale leaderboard: https://labs.scale.com/leaderboard/humanitys_last_exam
  • repo: https://github.com/centerforaisafety/hle
  • Epoch: https://epoch.ai/benchmarks/hle
  • Nature: https://www.nature.com/articles/s41586-025-09962-4
  • Wikipedia: https://en.wikipedia.org/wiki/Humanity's_Last_Exam
  • AA: https://artificialanalysis.ai/evaluations/humanitys-last-exam

CURIE

  • paper: https://arxiv.org/abs/2503.13517
  • HTML: https://arxiv.org/html/2503.13517v1
  • OpenReview: https://openreview.net/forum?id=jw2fC6REUB
  • repo: https://github.com/google/curie
  • Google Research blog: https://research.google/blog/evaluating-progress-of-llms-on-scientific-problem-solving/
  • AI track article: https://theaitrack.com/llms-scientific-benchmarks-curie-spiqa-feabench/

FrontierMath

  • 官网: https://epoch.ai/frontiermath
  • paper: https://arxiv.org/abs/2411.04872
  • HTML: https://arxiv.org/html/2411.04872v1
  • Tier 4: https://epoch.ai/benchmarks/frontiermath-tier-4
  • Open Problems: https://epoch.ai/frontiermath/open-problems
  • Tiers 1-4: https://epoch.ai/frontiermath/tiers-1-4/the-benchmark/
  • VB: https://venturebeat.com/ai/ais-math-problem-frontiermath-benchmark-shows-how-far-technology-still-has-to-go
  • OurWorldInData: https://ourworldindata.org/grapher/ai-frontiermath-over-time

Agent / Coding

SciCode

  • NeurIPS PDF: https://proceedings.neurips.cc/paper_files/paper/2024/file/36850592258c8c41cecdaa3dea5ff7de-Paper-Datasets_and_Benchmarks_Track.pdf
  • 官网: https://scicode-bench.github.io/
  • repo: https://github.com/scicode-bench/SciCode
  • AA leaderboard: https://artificialanalysis.ai/evaluations/scicode
  • HF: https://huggingface.co/papers/2407.13168
  • llm-stats: https://llm-stats.com/benchmarks/scicode
  • Steel: https://leaderboard.steel.dev/registry/benchmarks/scicode

ScienceAgentBench

  • paper: https://arxiv.org/abs/2410.05080
  • HTML: https://arxiv.org/html/2410.05080v1
  • 官网: https://osu-nlp-group.github.io/ScienceAgentBench/
  • repo: https://github.com/OSU-NLP-Group/ScienceAgentBench
  • OpenReview: https://openreview.net/forum?id=6z4YKr0GK6

DataSciBench

  • 官网: https://datascibench.github.io/
  • paper: https://arxiv.org/abs/2502.13897
  • HTML: https://arxiv.org/html/2502.13897v1
  • repo: https://github.com/THUDM/DataSciBench
  • OpenReview: https://openreview.net/forum?id=BltaWJZMeR

MLE-Bench

  • OpenAI: https://openai.com/index/mle-bench/
  • paper: https://arxiv.org/abs/2410.07095
  • PDF: https://arxiv.org/pdf/2410.07095
  • repo: https://github.com/openai/mle-bench
  • OpenReview: https://openreview.net/forum?id=6s5uXNWGIh
  • OECD: https://oecd.ai/en/catalogue/tools/mle-bench
  • Inspect Evals: https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/mle_bench/

PaperBench

  • OpenAI: https://openai.com/index/paperbench/
  • paper: https://arxiv.org/abs/2504.01848
  • PDF: https://cdn.openai.com/papers/22265bac-3191-44e5-b057-7aaacd8e90cd/paperbench.pdf
  • OpenReview: https://openreview.net/forum?id=xF5PuTLPbn&noteId=1VgzDxh2V3
  • ICML 2025 poster: https://icml.cc/virtual/2025/poster/43586
  • Inspect Evals: https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/paperbench/
  • WandB report: https://wandb.ai/byyoung3/ml-news/reports/OpenAI-s-new-research-agent-benchmark-Paperbench---VmlldzoxMjEyODQ5NA

RE-Bench

  • paper: https://arxiv.org/abs/2411.15114
  • HTML: https://arxiv.org/html/2411.15114v1
  • METR blog: https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/
  • METR report PDF: https://metr.org/AI_R_D_Evaluation_Report.pdf
  • OpenReview: https://openreview.net/forum?id=3rB0bVU6z6&noteId=lOCHc0u2a6
  • ICML poster: https://icml.cc/virtual/2025/poster/46519
  • Proceedings: https://proceedings.mlr.press/v267/wijk25a.html

LMR-Bench

  • paper: https://aclanthology.org/2025.emnlp-main.314.pdf

医学

MedQA

  • vals.ai: https://www.vals.ai/benchmarks/medqa
  • LLM-MedQA paper: https://arxiv.org/abs/2501.05464
  • 生命科学综述: https://intuitionlabs.ai/articles/large-language-model-benchmarks-life-sciences-overview
  • pricepertoken leaderboard: https://pricepertoken.com/leaderboards/benchmark/medqa
  • systematic review: https://www.jmir.org/2025/1/e84120

HealthBench

  • OpenAI: https://openai.com/index/healthbench/
  • paper: https://arxiv.org/abs/2505.08775
  • PDF: https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf
  • HF dataset: https://huggingface.co/datasets/openai/healthbench
  • PMC 分析: https://pmc.ncbi.nlm.nih.gov/articles/PMC12547120/
  • MobiHealth: https://www.mobihealthnews.com/news/openai-unveils-healthbench-evaluate-llms-safety-healthcare
  • HLTH: https://hlth.com/insights/news/openai-launches-healthbench-to-evaluate-healthcare-ai-safety-2025-05-16

MedAgentBench

  • paper: https://arxiv.org/abs/2501.14654
  • HTML v2: https://arxiv.org/html/2501.14654v2
  • 官网: https://stanfordmlgroup.github.io/projects/medagentbench/
  • repo: https://github.com/stanfordmlgroup/MedAgentBench
  • NEJM AI: https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

AgentClinic

  • paper: https://arxiv.org/abs/2405.07960
  • 官网: https://agentclinic.github.io/
  • OpenReview: https://openreview.net/forum?id=ak7r4He1qH

化学 / 材料

ChemBench

  • 官网: https://lamalab-org.github.io/chembench/
  • repo: https://github.com/lamalab-org/chembench
  • How-To: https://lamalab-org.github.io/chembench/getting_started/

MaCBench

  • 笔记: https://hunterheidenreich.com/notes/computational-chemistry/llms-for-chemistry/macbench-multimodal-chemistry-benchmark/

MatBench Discovery

  • 官网: https://matbench-discovery.materialsproject.org/
  • Nature MI: https://www.nature.com/articles/s42256-025-01055-1
  • arXiv: https://arxiv.org/html/2308.14920v3
  • ICLR 2023 page: https://iclr.cc/virtual/2023/14184

LLM4Mat-Bench

  • HTML: https://arxiv.org/html/2411.00177v3
  • OpenReview: https://openreview.net/pdf?id=TSAeQSv9RI

MatBench(传统 ML)

  • 官网: https://matbench.materialsproject.org/

发现 / 模拟

ScienceWorld

  • paper: https://arxiv.org/abs/2203.07540
  • 官网: https://sciworld.apps.allenai.org/
  • repo: https://github.com/allenai/ScienceWorld
  • AI2 blog: https://allenai.org/blog/evaluating-scientific-discovery-agents

DiscoveryWorld

  • paper: https://arxiv.org/abs/2406.06769
  • 官网: https://allenai.github.io/discoveryworld/

LLM-SRBench

  • OpenReview: https://openreview.net/forum?id=SyQPiZJVWY

SciGym

  • paper HTML: https://arxiv.org/html/2507.02083

综合 leaderboard / 排行站

  • Vellum AI: https://www.vellum.ai/llm-leaderboard
  • llm-stats: https://llm-stats.com/benchmarks
  • pricepertoken: https://pricepertoken.com/leaderboards/benchmark
  • LM Council: https://lmcouncil.ai/benchmarks
  • Scale Labs: https://labs.scale.com/leaderboard
  • Klu: https://klu.ai/llm-leaderboard
  • Artificial Analysis: https://artificialanalysis.ai/leaderboards/models
  • BenchLM: https://benchlm.ai/knowledge
  • Onyx: https://onyx.app/llm-leaderboard

Anthropic / OpenAI 官方发布

  • Anthropic healthcare & life sciences: https://www.anthropic.com/news/healthcare-life-sciences
  • Anthropic 加速科研公告: https://www.anthropic.com/news/accelerating-scientific-research
  • Claude Sonnet 4.6: https://www.anthropic.com/news/claude-sonnet-4-6
  • Anthropic research 总页: https://www.anthropic.com/research
  • FutureHouse @ Anthropic customer story: https://claude.com/customers/futurehouse