arxiv:2606.14885

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Published on Jun 12

· Submitted by

Tom Lu on Jun 17

SKY Lab

Upvote

Authors:

Abstract

DR-DCI framework combines retrieval with direct corpus interaction by dynamically pulling relevant documents into a local workspace, enabling scalable and efficient agentic search across large corpora.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Agentic search over large corpora relies on retriever-mediated interfaces (e.g., BM25 or ColBERT) for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded document views, limiting agents' ability to reorganize material and verify constraints across documents. Direct Corpus Interaction (DCI) addresses this limitation by exposing shell-executable corpus operations for flexible search, filtering, comparison, and verification. However, full-corpus terminal commands become slow and unstable as the corpus grows, degrading performance and efficiency. We introduce DR-DCI, a retriever-steered DCI framework that treats retrieval as an agent-callable action for expanding a local workspace. Rather than operating directly over the full corpus, the agent dynamically pulls relevant documents into an evolving workspace and conducts DCI operations within it. This design combines retriever-level recall with DCI-style precision: retrieval keeps exploration scalable, while DCI preserves the local operations needed for effective evidence resolution. Experiments show that DR-DCI is both effective and efficient across scales. On Browsecomp-Plus, DR-DCI reaches 71.2\% accuracy, improving over raw DCI and ablated variants by up to 8.3 points while reducing tool usage, wall time, and estimated cost. With workspace-preserving context reset, accuracy further improves to 73.3\%. In corpus-scaling experiments, DR-DCI remains effective from 100K to 10M documents, whereas raw DCI becomes unstable and BM25 performs substantially worse. DR-DCI also scales to a 20M-scale file-per-document Wiki-18 QA setting, achieving an average score of 63.0 across six benchmarks and outperforming retrieval-based and trained search-agent baselines. Ablation analysis further shows that ranked previews and inter-document DCI are key to performance.

View arXiv page View PDF GitHub 2 Add to collection

Community

eigentom

Paper submitter about 9 hours ago

•

edited about 8 hours ago

🔥 Direct Corpus Interaction made agentic search simple: let agents search the corpus directly with shell tools.

🤔 But what happens when the corpus grows from 100K to 10M+ documents?

Introducing DR-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion.

🚀 The Idea:
Instead of forcing agents to run terminal commands over the full corpus, DR-DCI uses retrieval as an agent-callable action to dynamically expand a local workspace.

The agent first pulls potentially relevant documents into its workspace, then performs local DCI operations — search, filter, compare, inspect, and verify — within that evolving workspace.

💡 Why it works:

DR-DCI combines the best of both worlds:

🔍 Retriever-level recall for scalable exploration
🛠️ DCI-style local operations for precise evidence verification
📁 Workspace-based reasoning for stable multi-document search

📊 Results:

On BrowseComp-Plus, DR-DCI reaches 71.2% accuracy, improving over raw DCI and ablated variants by up to 8.3 points, while reducing tool usage, wall time, and estimated cost.

With workspace-preserving context reset, accuracy further improves to 73.3%.

As the corpus scale, DR-DCI remains effective from 100K to 10M documents, while raw DCI becomes unstable and BM25 performs substantially worse.

DR-DCI also scales to a 20M-scale Wiki-18 file-per-document QA setting, achieving an average score of 63.0 across six benchmarks and outperforming retrieval-based and trained search-agent baselines.

💡 Key Insight:
For large-scale agentic search, retrieval should not replace interaction.
Instead, it should steer interaction.

DR-DCI shows that retrieval-guided workspace expansion can preserve the flexibility of Direct Corpus Interaction while making it scalable to tens of millions of documents.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.14885

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.14885 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.14885 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.14885 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.