kaizuberbuehler 's Collections Agents
updated
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World
Tasks
Paper
• 2412.14161
• Published
• 51
Training Software Engineering Agents and Verifiers with SWE-Gym
Paper
• 2412.21139
• Published
• 25
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse
Task Synthesis
Paper
• 2412.19723
• Published
• 87
AgentGen: Enhancing Planning Abilities for Large Language Model based
Agent via Environment and Task Generation
Paper
• 2408.00764
• Published
• 1
More Agents Is All You Need
Paper
• 2402.05120
• Published
• 57
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
Paper
• 2402.07456
• Published
• 46
Generative Agents: Interactive Simulacra of Human Behavior
Paper
• 2304.03442
• Published
• 15
Language Agent Tree Search Unifies Reasoning Acting and Planning in
Language Models
Paper
• 2310.04406
• Published
• 10
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and
Optimisation
Paper
• 2312.13010
• Published
• 6
GAIA: a benchmark for General AI Assistants
Paper
• 2311.12983
• Published
• 246
LLM Agent Operating System
Paper
• 2403.16971
• Published
• 73
Octopus v2: On-device language model for super agent
Paper
• 2404.01744
• Published
• 58
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler
Generation
Paper
• 2404.12753
• Published
• 43
Scaling Instructable Agents Across Many Simulated Worlds
Paper
• 2404.10179
• Published
• 28
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments
Paper
• 2404.07972
• Published
• 52
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents
Paper
• 2404.05902
• Published
• 22
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
• 2404.05719
• Published
• 83
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web
Navigating Agent
Paper
• 2404.03648
• Published
• 29
Voyager: An Open-Ended Embodied Agent with Large Language Models
Paper
• 2305.16291
• Published
• 13
LASER: LLM Agent with State-Space Exploration for Web Navigation
Paper
• 2309.08172
• Published
• 14
The Rise and Potential of Large Language Model Based Agents: A Survey
Paper
• 2309.07864
• Published
• 8
Reflexion: Language Agents with Verbal Reinforcement Learning
Paper
• 2303.11366
• Published
• 5
LEGENT: Open Platform for Embodied Agents
Paper
• 2404.18243
• Published
• 22
Diffusion for World Modeling: Visual Details Matter in Atari
Paper
• 2405.12399
• Published
• 30
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
• 2406.09246
• Published
• 43
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex
Interactive Tasks
Paper
• 2305.17390
• Published
• 4
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Paper
• 2407.18961
• Published
• 40
AppWorld: A Controllable World of Apps and People for Benchmarking
Interactive Coding Agents
Paper
• 2407.18901
• Published
• 35
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Paper
• 2407.21787
• Published
• 13
OmniParser for Pure Vision Based GUI Agent
Paper
• 2408.00203
• Published
• 24
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper
• 2307.13854
• Published
• 27
Diffusion Augmented Agents: A Framework for Efficient Exploration and
Transfer Learning
Paper
• 2407.20798
• Published
• 24
Diversity Empowers Intelligence: Integrating Expertise of Software
Engineering Agents
Paper
• 2408.07060
• Published
• 41
The AI Scientist: Towards Fully Automated Open-Ended Scientific
Discovery
Paper
• 2408.06292
• Published
• 128
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
Paper
• 2408.14354
• Published
• 41
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated
clinical environments
Paper
• 2405.07960
• Published
• 1
On the limits of agency in agent-based models
Paper
• 2409.10568
• Published
• 14
DSBench: How Far Are Data Science Agents to Becoming Data Science
Experts?
Paper
• 2409.07703
• Published
• 66
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks
at Scale
Paper
• 2409.16299
• Published
• 11
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer
Use
Paper
• 2411.10323
• Published
• 34
Generative World Explorer
Paper
• 2411.11844
• Published
• 77
Paper
• 2412.13501
• Published
• 29
Large Action Models: From Inception to Implementation
Paper
• 2412.10047
• Published
• 36
A3: Android Agent Arena for Mobile GUI Agents
Paper
• 2501.01149
• Published
• 22
ResearchTown: Simulator of Human Research Community
Paper
• 2412.17767
• Published
• 14
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital
World
Paper
• 2412.17589
• Published
• 14
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Paper
• 2412.14470
• Published
• 13
GenEx: Generating an Explorable World
Paper
• 2412.09624
• Published
• 98
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web
Tutorials
Paper
• 2412.09605
• Published
• 30
The BrowserGym Ecosystem for Web Agent Research
Paper
• 2412.05467
• Published
• 24
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper
• 2412.04454
• Published
• 71
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and
Proactive Robotic Failure Detection
Paper
• 2412.04455
• Published
• 38
MALT: Improving Reasoning with Multi-Agent LLM Training
Paper
• 2412.01928
• Published
• 45
Mars-PO: Multi-Agent Reasoning System Preference Optimization
Paper
• 2411.19039
• Published
• 1
Flow-DPO: Improving LLM Mathematical Reasoning through Online
Multi-Agent Learning
Paper
• 2410.22304
• Published
• 18
MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics
Manipulation
Paper
• 2411.17636
• Published
• 2
Cooperative Strategic Planning Enhances Reasoning Capabilities in Large
Language Models
Paper
• 2410.20007
• Published
• 1
Enhancing LLM Agents for Code Generation with Possibility and Pass-rate
Prioritized Experience Replay
Paper
• 2410.12236
• Published
• 1
Large Language Model-Brained GUI Agents: A Survey
Paper
• 2411.18279
• Published
• 30
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
• 2411.17465
• Published
• 90
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning
for Web Agents
Paper
• 2411.06559
• Published
• 16
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile
Manipulation
Paper
• 2411.04999
• Published
• 18
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle
Grandmaster Level
Paper
• 2411.03562
• Published
• 69
Agent Laboratory: Using LLM Agents as Research Assistants
Paper
• 2501.04227
• Published
• 95
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning
and Reflection
Paper
• 2501.04575
• Published
• 25
SDPO: Segment-Level Direct Preference Optimization for Social Agents
Paper
• 2501.01821
• Published
• 20
SOTOPIA: Interactive Evaluation for Social Intelligence in Language
Agents
Paper
• 2310.11667
• Published
• 4
WebWalker: Benchmarking LLMs in Web Traversal
Paper
• 2501.07572
• Published
• 23
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Paper
• 2501.05707
• Published
• 20
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub
Issue Resolution
Paper
• 2501.05040
• Published
• 15
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper
• 2501.09747
• Published
• 29
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous
Reinforcement Learning
Paper
• 2406.11896
• Published
• 20
From Novice to Expert: LLM Agent Policy Optimization via Step-wise
Reinforcement Learning
Paper
• 2411.03817
• Published
• 1
PaSa: An LLM Agent for Comprehensive Academic Paper Search
Paper
• 2501.10120
• Published
• 54
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper
• 2501.12326
• Published
• 64
Agent-R: Training Language Model Agents to Reflect via Iterative
Self-Training
Paper
• 2501.11425
• Published
• 109
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Paper
• 2501.11733
• Published
• 28
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in
Realistic Environments
Paper
• 2501.10893
• Published
• 26
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in
Virtual 3D Spaces
Paper
• 2501.12909
• Published
• 74
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI
Systems
Paper
• 2501.11067
• Published
• 13
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
Paper
• 2501.13200
• Published
• 69
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search
Paper
• 2502.02584
• Published
• 16
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models
Beneficial?
Paper
• 2502.00674
• Published
• 13
TwinMarket: A Scalable Behavioral and Social Simulation for Financial
Markets
Paper
• 2502.01506
• Published
• 39
Large Language Model Guided Self-Debugging Code Generation
Paper
• 2502.02928
• Published
• 13
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference
Optimization
Paper
• 2502.04306
• Published
• 20
Training Language Models for Social Deduction with Multi-Agent
Reinforcement Learning
Paper
• 2502.06060
• Published
• 38
CODESIM: Multi-Agent Code Generation and Problem Solving through
Simulation-Driven Planning and Debugging
Paper
• 2502.05664
• Published
• 24
Hephaestus: Improving Fundamental Agent Capabilities of Large Language
Models through Continual Pre-Training
Paper
• 2502.06589
• Published
• 21
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
Paper
• 2502.08047
• Published
• 28
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
Models for Vision-Driven Embodied Agents
Paper
• 2502.09560
• Published
• 35
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in
Agentic Tasks
Paper
• 2502.08235
• Published
• 59
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Paper
• 2502.14499
• Published
• 194
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex
Task Automation on PC
Paper
• 2502.14282
• Published
• 29
Magma: A Foundation Model for Multimodal AI Agents
Paper
• 2502.13130
• Published
• 58
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for
Multimodal Web Agents
Paper
• 2502.11357
• Published
• 11
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem
Understanding
Paper
• 2502.19400
• Published
• 47
Towards an AI co-scientist
Paper
• 2502.18864
• Published
• 52
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning
Trajectories for Complex Problem Solving
Paper
• 2502.16111
• Published
• 9
TAG: A Decentralized Framework for Multi-Agent Hierarchical
Reinforcement Learning
Paper
• 2502.15425
• Published
• 9
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided
Multi-Agent Collaboration
Paper
• 2502.17110
• Published
• 14
WebGames: Challenging General-Purpose Web-Browsing AI Agents
Paper
• 2502.18356
• Published
• 14
VEM: Environment-Free Exploration for Training GUI Agent with Value
Environment Model
Paper
• 2502.18906
• Published
• 12
Curie: Toward Rigorous and Automated Scientific Experimentation with AI
Agents
Paper
• 2502.16069
• Published
• 20
Agentic Reward Modeling: Integrating Human Preferences with Verifiable
Correctness Signals for Reliable Reward Systems
Paper
• 2502.19328
• Published
• 23
ATLaS: Agent Tuning via Learning Critical Steps
Paper
• 2503.02197
• Published
• 9
Gemini Robotics: Bringing AI into the Physical World
Paper
• 2503.20020
• Published
• 31
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users
Paper
• 2503.02268
• Published
• 11
Unified Video Action Model
Paper
• 2503.00200
• Published
• 14
MPO: Boosting LLM Agents with Meta Plan Optimization
Paper
• 2503.02682
• Published
• 29
MultiAgentBench: Evaluating the Collaboration and Competition of LLM
agents
Paper
• 2503.01935
• Published
• 30
World Modeling Makes a Better Planner: Dual Preference Optimization for
Embodied Task Planning
Paper
• 2503.10480
• Published
• 56
Automated Movie Generation via Multi-Agent CoT Planning
Paper
• 2503.07314
• Published
• 44
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via
Reinforcement Learning and Reasoning
Paper
• 2503.07608
• Published
• 23
SafeArena: Evaluating the Safety of Autonomous Web Agents
Paper
• 2503.04957
• Published
• 21
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based
VLM Agent Training
Paper
• 2503.08525
• Published
• 17
Agent models: Internalizing Chain-of-Action Generation into Reasoning
models
Paper
• 2503.06580
• Published
• 20
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for
Complex Medical Reasoning
Paper
• 2503.07459
• Published
• 16
LocAgent: Graph-Guided LLM Agents for Code Localization
Paper
• 2503.09089
• Published
• 13
AI-native Memory 2.0: Second Me
Paper
• 2503.08102
• Published
• 13
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
• 2503.21620
• Published
• 62
Agents Play Thousands of 3D Video Games
Paper
• 2503.13356
• Published
• 9
SWE-smith: Scaling Data for Software Engineering Agents
Paper
• 2504.21798
• Published
• 15
Survey on Evaluation of LLM-based Agents
Paper
• 2503.16416
• Published
• 96
Why Do Multi-Agent LLM Systems Fail?
Paper
• 2503.13657
• Published
• 48
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?
Paper
• 2503.12349
• Published
• 44
API Agents vs. GUI Agents: Divergence and Convergence
Paper
• 2503.11069
• Published
• 36
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of
Tools
Paper
• 2503.10970
• Published
• 18
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Paper
• 2503.13444
• Published
• 19
STEVE: AStep Verification Pipeline for Computer-use Agent Training
Paper
• 2503.12532
• Published
• 17
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning
Tasks
Paper
• 2503.15478
• Published
• 14
Measuring AI Ability to Complete Long Tasks
Paper
• 2503.14499
• Published
• 16
Free-form language-based robotic reasoning and grasping
Paper
• 2503.13082
• Published
• 11
Large Language Model Agent: A Survey on Methodology, Applications and
Challenges
Paper
• 2503.21460
• Published
• 83
MAPS: A Multi-Agent Framework Based on Big Seven Personality and
Socratic Guidance for Multimodal Scientific Problem Solving
Paper
• 2503.16905
• Published
• 54
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
Paper
• 2503.20201
• Published
• 48
MARS: A Multi-Agent Framework Incorporating Socratic Guidance for
Automated Prompt Optimization
Paper
• 2503.16874
• Published
• 45
RoboFactory: Exploring Embodied Agent Collaboration with Compositional
Constraints
Paper
• 2503.16408
• Published
• 42
AgentRxiv: Towards Collaborative Autonomous Research
Paper
• 2503.18102
• Published
• 25
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for
Embodied Interactive Tasks
Paper
• 2503.21696
• Published
• 23
Defeating Prompt Injections by Design
Paper
• 2503.18813
• Published
• 24
MDocAgent: A Multi-Modal Multi-Agent Framework for Document
Understanding
Paper
• 2503.13964
• Published
• 20
Advances and Challenges in Foundation Agents: From Brain-Inspired
Intelligence to Evolutionary, Collaborative, and Safe Systems
Paper
• 2504.01990
• Published
• 303
PaperBench: Evaluating AI's Ability to Replicate AI Research
Paper
• 2504.01848
• Published
• 37
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive
Program Synthesis
Paper
• 2503.23145
• Published
• 35
RIG: Synergizing Reasoning and Imagination in End-to-End Generalist
Policy
Paper
• 2503.24388
• Published
• 29
Agent S2: A Compositional Generalist-Specialist Framework for Computer
Use Agents
Paper
• 2504.00906
• Published
• 27
Towards Trustworthy GUI Agents: A Survey
Paper
• 2503.23434
• Published
• 21
Interpreting Emergent Planning in Model-Free Reinforcement Learning
Paper
• 2504.01871
• Published
• 12
ActionStudio: A Lightweight Framework for Data and Training of Large
Action Models
Paper
• 2503.22673
• Published
• 12
Scaling Laws in Scientific Discovery with AI and Robot Scientists
Paper
• 2503.22444
• Published
• 12
VerifiAgent: a Unified Verification Agent in Language Model Reasoning
Paper
• 2504.00406
• Published
• 8
MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via
Reasoning Agentic Workflow
Paper
• 2503.18968
• Published
• 8
A Unified Agentic Framework for Evaluating Conditional Image Generation
Paper
• 2504.07046
• Published
• 30
Agentic Knowledgeable Self-awareness
Paper
• 2504.03553
• Published
• 27
MOSAIC: Modeling Social AI for Content Dissemination and Regulation in
Multi-Agent Simulations
Paper
• 2504.07830
• Published
• 18
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge
Refinement
Paper
• 2504.03561
• Published
• 18
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated
Agent-Human Interplay
Paper
• 2504.03601
• Published
• 18
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning
Paper
• 2503.22738
• Published
• 17
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric
Capabilities in Multimodal Large Language Models
Paper
• 2504.06148
• Published
• 13
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing
Skills
Paper
• 2504.07079
• Published
• 12
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Paper
• 2504.11536
• Published
• 63
Genius: A Generalizable and Purely Unsupervised Self-Training Framework
For Advanced Reasoning
Paper
• 2504.08672
• Published
• 55
MineWorld: a Real-Time and Open-Source Interactive World Model on
Minecraft
Paper
• 2504.08388
• Published
• 42
Paper
• 2504.11442
• Published
• 30
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent
Trajectories
Paper
• 2504.08942
• Published
• 28
MLRC-Bench: Can Language Agents Solve Machine Learning Research
Challenges?
Paper
• 2504.09702
• Published
• 18
SocioVerse: A World Model for Social Simulation Powered by LLM Agents
and A Pool of 10 Million Real-World Users
Paper
• 2504.10157
• Published
• 17
Breaking the Data Barrier -- Building GUI Agents Through Task
Generalization
Paper
• 2504.10127
• Published
• 17
ReZero: Enhancing LLM search ability by trying one-more-time
Paper
• 2504.11001
• Published
• 16
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via
Agentic Tree Search
Paper
• 2504.08066
• Published
• 16
Exploring Expert Failures Improves LLM Agent Tuning
Paper
• 2504.13145
• Published
• 12
MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic
Data Generation
Paper
• 2504.12563
• Published
• 4
Paper2Code: Automating Code Generation from Scientific Papers in Machine
Learning
Paper
• 2504.17192
• Published
• 123
FlowReasoner: Reinforcing Query-Level Meta-Agents
Paper
• 2504.15257
• Published
• 47
ToolRL: Reward is All Tool Learning Needs
Paper
• 2504.13958
• Published
• 49
OTC: Optimal Tool Calls via Reinforcement Learning
Paper
• 2504.14870
• Published
• 35
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
Paper
• 2504.13203
• Published
• 35
BookWorld: From Novels to Interactive Agent Societies for Creative Story
Generation
Paper
• 2504.14538
• Published
• 30
UFO2: The Desktop AgentOS
Paper
• 2504.14603
• Published
• 29
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making
Abilities
Paper
• 2504.16078
• Published
• 21
WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World
Model-based LLM Agents
Paper
• 2504.15785
• Published
• 22