AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science
Abstract
AgentDS benchmark evaluates AI agents and human-AI collaboration in domain-specific data science tasks, revealing continued necessity of human expertise despite advances in large language models and AI agents.
Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .
Community
Our AgentDS technical report is now live on arXiv.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents (2026)
- DataCross: A Unified Benchmark and Agent Framework for Cross-Modal Heterogeneous Data Analysis (2026)
- Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark (2026)
- Understanding Human-AI Collaboration in Cybersecurity Competitions (2026)
- AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents (2026)
- HLER: Human-in-the-Loop Economic Research via Multi-Agent Pipelines for Empirical Discovery (2026)
- DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 7
Browse 7 datasets citing this paperSpaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper