Agents

kaizuberbuehler 's Collections

Reasoning, Thinking, RL and Test-Time Scaling

Benchmarks

Vision Language Models

Foundation Models

Synthetic Data and Self-Improvement

Agents

LM Inference

LM Training

LM Prompt Engineering

LM Capabilities and Scaling

EXL2 Quantized Models

GGUF Models

updated Sep 26, 2025

Upvote

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper • 2412.14161 • Published Dec 18, 2024 • 51
Training Software Engineering Agents and Verifiers with SWE-Gym

Paper • 2412.21139 • Published Dec 30, 2024 • 27
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Paper • 2412.19723 • Published Dec 27, 2024 • 87
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation

Paper • 2408.00764 • Published Aug 1, 2024 • 1
More Agents Is All You Need

Paper • 2402.05120 • Published Feb 3, 2024 • 57
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

Paper • 2402.07456 • Published Feb 12, 2024 • 46
Generative Agents: Interactive Simulacra of Human Behavior

Paper • 2304.03442 • Published Apr 7, 2023 • 15
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Paper • 2310.04406 • Published Oct 6, 2023 • 10
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Paper • 2312.13010 • Published Dec 20, 2023 • 6
GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 249
LLM Agent Operating System

Paper • 2403.16971 • Published Mar 25, 2024 • 73
Octopus v2: On-device language model for super agent

Paper • 2404.01744 • Published Apr 2, 2024 • 59
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation

Paper • 2404.12753 • Published Apr 19, 2024 • 43
Scaling Instructable Agents Across Many Simulated Worlds

Paper • 2404.10179 • Published Mar 13, 2024 • 28
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Paper • 2404.07972 • Published Apr 11, 2024 • 52
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents

Paper • 2404.05902 • Published Apr 8, 2024 • 22
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Paper • 2404.05719 • Published Apr 8, 2024 • 83
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent

Paper • 2404.03648 • Published Apr 4, 2024 • 29
Voyager: An Open-Ended Embodied Agent with Large Language Models

Paper • 2305.16291 • Published May 25, 2023 • 13
LASER: LLM Agent with State-Space Exploration for Web Navigation

Paper • 2309.08172 • Published Sep 15, 2023 • 14
The Rise and Potential of Large Language Model Based Agents: A Survey

Paper • 2309.07864 • Published Sep 14, 2023 • 8
Reflexion: Language Agents with Verbal Reinforcement Learning

Paper • 2303.11366 • Published Mar 20, 2023 • 9
LEGENT: Open Platform for Embodied Agents

Paper • 2404.18243 • Published Apr 28, 2024 • 22
Diffusion for World Modeling: Visual Details Matter in Atari

Paper • 2405.12399 • Published May 20, 2024 • 30
OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13, 2024 • 47
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks

Paper • 2305.17390 • Published May 27, 2023 • 4
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Paper • 2407.18961 • Published Jul 18, 2024 • 40
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Paper • 2407.18901 • Published Jul 26, 2024 • 36
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Paper • 2407.21787 • Published Jul 31, 2024 • 14
OmniParser for Pure Vision Based GUI Agent

Paper • 2408.00203 • Published Aug 1, 2024 • 24
WebArena: A Realistic Web Environment for Building Autonomous Agents

Paper • 2307.13854 • Published Jul 25, 2023 • 27
Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning

Paper • 2407.20798 • Published Jul 30, 2024 • 24
Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

Paper • 2408.07060 • Published Aug 13, 2024 • 41
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Paper • 2408.06292 • Published Aug 12, 2024 • 128
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

Paper • 2408.14354 • Published Aug 26, 2024 • 41
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Paper • 2405.07960 • Published May 13, 2024 • 1
On the limits of agency in agent-based models

Paper • 2409.10568 • Published Sep 14, 2024 • 14
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

Paper • 2409.07703 • Published Sep 12, 2024 • 66
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

Paper • 2409.16299 • Published Sep 9, 2024 • 11
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use

Paper • 2411.10323 • Published Nov 15, 2024 • 34
Generative World Explorer

Paper • 2411.11844 • Published Nov 18, 2024 • 77
GUI Agents: A Survey

Paper • 2412.13501 • Published Dec 18, 2024 • 30
Large Action Models: From Inception to Implementation

Paper • 2412.10047 • Published Dec 13, 2024 • 36
A3: Android Agent Arena for Mobile GUI Agents

Paper • 2501.01149 • Published Jan 2, 2025 • 22
ResearchTown: Simulator of Human Research Community

Paper • 2412.17767 • Published Dec 23, 2024 • 14
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World

Paper • 2412.17589 • Published Dec 23, 2024 • 14
Agent-SafetyBench: Evaluating the Safety of LLM Agents

Paper • 2412.14470 • Published Dec 19, 2024 • 12
GenEx: Generating an Explorable World

Paper • 2412.09624 • Published Dec 12, 2024 • 98
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

Paper • 2412.09605 • Published Dec 12, 2024 • 31
The BrowserGym Ecosystem for Web Agent Research

Paper • 2412.05467 • Published Dec 6, 2024 • 24
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Paper • 2412.04454 • Published Dec 5, 2024 • 71
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

Paper • 2412.04455 • Published Dec 5, 2024 • 37
MALT: Improving Reasoning with Multi-Agent LLM Training

Paper • 2412.01928 • Published Dec 2, 2024 • 46
Mars-PO: Multi-Agent Reasoning System Preference Optimization

Paper • 2411.19039 • Published Nov 28, 2024 • 1
Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

Paper • 2410.22304 • Published Oct 29, 2024 • 18
MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation

Paper • 2411.17636 • Published Nov 26, 2024 • 2
Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

Paper • 2410.20007 • Published Oct 25, 2024 • 1
Enhancing LLM Agents for Code Generation with Possibility and Pass-rate Prioritized Experience Replay

Paper • 2410.12236 • Published Oct 16, 2024 • 1
Large Language Model-Brained GUI Agents: A Survey

Paper • 2411.18279 • Published Nov 27, 2024 • 30
ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 90
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Paper • 2411.06559 • Published Nov 10, 2024 • 16
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

Paper • 2411.04999 • Published Nov 7, 2024 • 18
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Paper • 2411.03562 • Published Nov 5, 2024 • 70
Agent Laboratory: Using LLM Agents as Research Assistants

Paper • 2501.04227 • Published Jan 8, 2025 • 96
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

Paper • 2501.04575 • Published Jan 8, 2025 • 25
SDPO: Segment-Level Direct Preference Optimization for Social Agents

Paper • 2501.01821 • Published Jan 3, 2025 • 20
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

Paper • 2310.11667 • Published Oct 18, 2023 • 4
WebWalker: Benchmarking LLMs in Web Traversal

Paper • 2501.07572 • Published Jan 13, 2025 • 24
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Paper • 2501.05707 • Published Jan 10, 2025 • 20
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution

Paper • 2501.05040 • Published Jan 9, 2025 • 15
FAST: Efficient Action Tokenization for Vision-Language-Action Models

Paper • 2501.09747 • Published Jan 16, 2025 • 29
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Paper • 2406.11896 • Published Jun 14, 2024 • 20
From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning

Paper • 2411.03817 • Published Nov 6, 2024 • 1
PaSa: An LLM Agent for Comprehensive Academic Paper Search

Paper • 2501.10120 • Published Jan 17, 2025 • 55
UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Paper • 2501.12326 • Published Jan 21, 2025 • 65
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

Paper • 2501.11425 • Published Jan 20, 2025 • 109
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Paper • 2501.11733 • Published Jan 20, 2025 • 28
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

Paper • 2501.10893 • Published Jan 18, 2025 • 26
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

Paper • 2501.12909 • Published Jan 22, 2025 • 74
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

Paper • 2501.11067 • Published Jan 19, 2025 • 13
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

Paper • 2501.13200 • Published Jan 22, 2025 • 70
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Paper • 2502.02584 • Published Feb 4, 2025 • 16
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

Paper • 2502.00674 • Published Feb 2, 2025 • 13
TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

Paper • 2502.01506 • Published Feb 3, 2025 • 39
Large Language Model Guided Self-Debugging Code Generation

Paper • 2502.02928 • Published Feb 5, 2025 • 13
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

Paper • 2502.04306 • Published Feb 6, 2025 • 20
Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning

Paper • 2502.06060 • Published Feb 9, 2025 • 37
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging

Paper • 2502.05664 • Published Feb 8, 2025 • 24
Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Paper • 2502.06589 • Published Feb 10, 2025 • 21
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Paper • 2502.08047 • Published Feb 12, 2025 • 28
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Paper • 2502.09560 • Published Feb 13, 2025 • 35
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks

Paper • 2502.08235 • Published Feb 12, 2025 • 59
MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Paper • 2502.14499 • Published Feb 20, 2025 • 195
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC

Paper • 2502.14282 • Published Feb 20, 2025 • 29
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18, 2025 • 58
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents

Paper • 2502.11357 • Published Feb 17, 2025 • 11
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding

Paper • 2502.19400 • Published Feb 26, 2025 • 47
Towards an AI co-scientist

Paper • 2502.18864 • Published Feb 26, 2025 • 53
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving

Paper • 2502.16111 • Published Feb 22, 2025 • 9
TAG: A Decentralized Framework for Multi-Agent Hierarchical Reinforcement Learning

Paper • 2502.15425 • Published Feb 21, 2025 • 9
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration

Paper • 2502.17110 • Published Feb 24, 2025 • 14
WebGames: Challenging General-Purpose Web-Browsing AI Agents

Paper • 2502.18356 • Published Feb 25, 2025 • 14
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

Paper • 2502.18906 • Published Feb 26, 2025 • 12
Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents

Paper • 2502.16069 • Published Feb 22, 2025 • 20
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

Paper • 2502.19328 • Published Feb 26, 2025 • 23
ATLaS: Agent Tuning via Learning Critical Steps

Paper • 2503.02197 • Published Mar 4, 2025 • 9
Gemini Robotics: Bringing AI into the Physical World

Paper • 2503.20020 • Published Mar 25, 2025 • 31
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

Paper • 2503.02268 • Published Mar 4, 2025 • 11
Unified Video Action Model

Paper • 2503.00200 • Published Feb 28, 2025 • 14
MPO: Boosting LLM Agents with Meta Plan Optimization

Paper • 2503.02682 • Published Mar 4, 2025 • 29
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Paper • 2503.01935 • Published Mar 3, 2025 • 30
World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning

Paper • 2503.10480 • Published Mar 13, 2025 • 57
Automated Movie Generation via Multi-Agent CoT Planning

Paper • 2503.07314 • Published Mar 10, 2025 • 44
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Paper • 2503.07608 • Published Mar 10, 2025 • 23
SafeArena: Evaluating the Safety of Autonomous Web Agents

Paper • 2503.04957 • Published Mar 6, 2025 • 21
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Paper • 2503.08525 • Published Mar 11, 2025 • 17
Agent models: Internalizing Chain-of-Action Generation into Reasoning models

Paper • 2503.06580 • Published Mar 9, 2025 • 20
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

Paper • 2503.07459 • Published Mar 10, 2025 • 16
LocAgent: Graph-Guided LLM Agents for Code Localization

Paper • 2503.09089 • Published Mar 12, 2025 • 13
AI-native Memory 2.0: Second Me

Paper • 2503.08102 • Published Mar 11, 2025 • 13
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Paper • 2503.21620 • Published Mar 27, 2025 • 62
Agents Play Thousands of 3D Video Games

Paper • 2503.13356 • Published Mar 17, 2025 • 9
SWE-smith: Scaling Data for Software Engineering Agents

Paper • 2504.21798 • Published Apr 30, 2025 • 15
Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 97
Why Do Multi-Agent LLM Systems Fail?

Paper • 2503.13657 • Published Mar 17, 2025 • 49
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

Paper • 2503.12349 • Published Mar 16, 2025 • 44
API Agents vs. GUI Agents: Divergence and Convergence

Paper • 2503.11069 • Published Mar 14, 2025 • 36
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools

Paper • 2503.10970 • Published Mar 14, 2025 • 18
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Paper • 2503.13444 • Published Mar 17, 2025 • 20
STEVE: AStep Verification Pipeline for Computer-use Agent Training

Paper • 2503.12532 • Published Mar 16, 2025 • 17
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

Paper • 2503.15478 • Published Mar 19, 2025 • 14
Measuring AI Ability to Complete Long Tasks

Paper • 2503.14499 • Published Mar 18, 2025 • 16
Free-form language-based robotic reasoning and grasping

Paper • 2503.13082 • Published Mar 17, 2025 • 11
Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Paper • 2503.21460 • Published Mar 27, 2025 • 84
MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving

Paper • 2503.16905 • Published Mar 21, 2025 • 54
Open Deep Search: Democratizing Search with Open-source Reasoning Agents

Paper • 2503.20201 • Published Mar 26, 2025 • 48
MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization

Paper • 2503.16874 • Published Mar 21, 2025 • 45
RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

Paper • 2503.16408 • Published Mar 20, 2025 • 42
AgentRxiv: Towards Collaborative Autonomous Research

Paper • 2503.18102 • Published Mar 23, 2025 • 25
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Paper • 2503.21696 • Published Mar 27, 2025 • 23
Defeating Prompt Injections by Design

Paper • 2503.18813 • Published Mar 24, 2025 • 25
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Paper • 2503.13964 • Published Mar 18, 2025 • 20
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31, 2025 • 305
PaperBench: Evaluating AI's Ability to Replicate AI Research

Paper • 2504.01848 • Published Apr 2, 2025 • 37
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

Paper • 2503.23145 • Published Mar 29, 2025 • 35
RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy

Paper • 2503.24388 • Published Mar 31, 2025 • 29
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Paper • 2504.00906 • Published Apr 1, 2025 • 27
Towards Trustworthy GUI Agents: A Survey

Paper • 2503.23434 • Published Mar 30, 2025 • 21
Interpreting Emergent Planning in Model-Free Reinforcement Learning

Paper • 2504.01871 • Published Apr 2, 2025 • 12
ActionStudio: A Lightweight Framework for Data and Training of Large Action Models

Paper • 2503.22673 • Published Mar 28, 2025 • 12
Scaling Laws in Scientific Discovery with AI and Robot Scientists

Paper • 2503.22444 • Published Mar 28, 2025 • 12
VerifiAgent: a Unified Verification Agent in Language Model Reasoning

Paper • 2504.00406 • Published Apr 1, 2025 • 8
MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow

Paper • 2503.18968 • Published Mar 21, 2025 • 8
A Unified Agentic Framework for Evaluating Conditional Image Generation

Paper • 2504.07046 • Published Apr 9, 2025 • 30
Agentic Knowledgeable Self-awareness

Paper • 2504.03553 • Published Apr 4, 2025 • 27
MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations

Paper • 2504.07830 • Published Apr 10, 2025 • 18
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement

Paper • 2504.03561 • Published Apr 4, 2025 • 18
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Paper • 2504.03601 • Published Apr 4, 2025 • 18
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning

Paper • 2503.22738 • Published Mar 26, 2025 • 17
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models

Paper • 2504.06148 • Published Apr 8, 2025 • 13
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Paper • 2504.07079 • Published Apr 9, 2025 • 12
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Paper • 2504.11536 • Published Apr 15, 2025 • 63
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning

Paper • 2504.08672 • Published Apr 11, 2025 • 55
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

Paper • 2504.08388 • Published Apr 11, 2025 • 43
TextArena

Paper • 2504.11442 • Published Apr 15, 2025 • 30
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Paper • 2504.08942 • Published Apr 11, 2025 • 29
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Paper • 2504.09702 • Published Apr 13, 2025 • 18
SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

Paper • 2504.10157 • Published Apr 14, 2025 • 17
Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

Paper • 2504.10127 • Published Apr 14, 2025 • 17
ReZero: Enhancing LLM search ability by trying one-more-time

Paper • 2504.11001 • Published Apr 15, 2025 • 16
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Paper • 2504.08066 • Published Apr 10, 2025 • 22
Exploring Expert Failures Improves LLM Agent Tuning

Paper • 2504.13145 • Published Apr 17, 2025 • 12
MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

Paper • 2504.12563 • Published Apr 17, 2025 • 4
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Paper • 2504.17192 • Published Apr 24, 2025 • 124
FlowReasoner: Reinforcing Query-Level Meta-Agents

Paper • 2504.15257 • Published Apr 21, 2025 • 47
ToolRL: Reward is All Tool Learning Needs

Paper • 2504.13958 • Published Apr 16, 2025 • 49
OTC: Optimal Tool Calls via Reinforcement Learning

Paper • 2504.14870 • Published Apr 21, 2025 • 35
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Paper • 2504.13203 • Published Apr 15, 2025 • 35
BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation

Paper • 2504.14538 • Published Apr 20, 2025 • 30
UFO2: The Desktop AgentOS

Paper • 2504.14603 • Published Apr 20, 2025 • 29
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

Paper • 2504.16078 • Published Apr 22, 2025 • 21
WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents

Paper • 2504.15785 • Published Apr 22, 2025 • 22

Upvote

Collection guide
Browse collections