Text Generation
Transformers
Safetensors
English
qwen3
Explorer SubAgent
Repository Exploration
conversational
text-generation-inference
Instructions to use microsoft/FastContext-1.0-4B-RL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/FastContext-1.0-4B-RL with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/FastContext-1.0-4B-RL") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/FastContext-1.0-4B-RL") model = AutoModelForMultimodalLM.from_pretrained("microsoft/FastContext-1.0-4B-RL") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use microsoft/FastContext-1.0-4B-RL with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/FastContext-1.0-4B-RL" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/FastContext-1.0-4B-RL", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/microsoft/FastContext-1.0-4B-RL
- SGLang
How to use microsoft/FastContext-1.0-4B-RL with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/FastContext-1.0-4B-RL" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/FastContext-1.0-4B-RL", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/FastContext-1.0-4B-RL" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/FastContext-1.0-4B-RL", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use microsoft/FastContext-1.0-4B-RL with Docker Model Runner:
docker model run hf.co/microsoft/FastContext-1.0-4B-RL
| language: | |
| - en | |
| license: mit | |
| tags: | |
| - Explorer SubAgent | |
| - Repository Exploration | |
| library_name: transformers | |
| base_model: | |
| - Qwen/Qwen3-4B-Instruct-2507 | |
| ## 1. Model Introduction | |
| **FastContext-1.0** is a lightweight **repository-exploration subagent** for LLM coding agents. Instead of letting a single model both explore the repository and solve the task, FastContext separates these two roles: it is invoked on demand by a main coding agent, issues **parallel read-only tool calls** (READ, GLOB, GREP), and returns **compact file paths and line ranges** as focused context. | |
| Repository exploration is a major bottleneck in modern coding agents β locating relevant code consumes a large share of the token budget and pollutes the solver's context with irrelevant snippets. In our analysis of GPT-5.4 trajectories, reading and searching account for **56.2% of all tool-use turns** and **46.5% of the main agent's total tokens**. FastContext moves this work into a dedicated subagent so the main agent receives clean, grounded evidence rather than the long trail of exploratory reads and searches. | |
| The model family spans **4Bβ30B parameters**, bootstrapped from strong reference-model trajectories via supervised fine-tuning (SFT) and refined with task-grounded reinforcement learning (RL) for broad first-turn search, multi-turn evidence gathering, and precise citation generation. | |
| - **Backbones:** Qwen3-4B-Instruct (4B explorer) and Qwen3-Coder-30B-A3B (30B explorer) | |
| - **Variants:** `FC-4B-SFT`, `FC-4B-RL` (deployment targets), `FC-30B-SFT` (scaling reference) | |
| - **Context length:** up to 262K tokens | |
| - **Paper:** *FastContext: Training Efficient Repository Explorer for Coding Agents* | |
| - **Code & data:** https://github.com/microsoft/fastcontext | |
| ### How it works | |
| ``` | |
| Coding Agent ββqueryβββΆ FastContext ββread/searchβββΆ Repository | |
| β² β | |
| βββββ file-line βββββββββ | |
| citations | |
| ``` | |
| Internally, FastContext runs an exploration loop: | |
| 1. **Query understanding** β translate the issue into search intents. | |
| 2. **Parallel tool calling** β issue multiple `READ` / `GLOB` / `GREP` calls in a single turn to cover complementary hypotheses. | |
| 3. **Observation-driven refinement** β use tool outputs to guide the next search turn. | |
| 4. **Final citations** β return a compact `<final_answer>` block of file paths and line ranges. | |
| ## 2. Evaluation Results | |
| ### End-to-end performance (Mini-SWE-Agent) | |
| Integrating FastContext into Mini-SWE-Agent improves end-to-end resolution rates by **up to 5.5%** while reducing main-agent token consumption by **up to 60%**, with only marginal overhead. Scores, tokens, and turns are measured on the main-agent trajectory; deltas are relative to `w/o Explore` for the same main agent. | |
| | Main Agent | Subagent | SWE-bench Multilingual | SWE-bench Pro | SWE-QA | | |
| |---|---|---|---|---| | |
| | **GPT-5.4** | w/o Explore | 71.7 / 457k | 46.0 / 818k | 81.3 / 418k | | |
| | | FC-30B-SFT | **75.0** (β3.3) / 356k (β22.1%) | 49.0 (β3.0) / 688k (β15.9%) | **82.0** (β0.7) / 206k (β50.7%) | | |
| | | FC-4B-SFT | 73.3 (β1.6) / 364k (β20.4%) | 47.0 (β1.0) / 689k (β15.8%) | 81.9 (β0.6) / 213k (β49.0%) | | |
| | | FC-4B-RL | 74.7 (β3.0) / 338k (β26.0%) | 48.5 (β2.5) / 701k (β14.3%) | **82.0** (β0.7) / 210k (β49.8%) | | |
| | **GLM-5.1** | w/o Explore | 72.3 / 2514k | 17.5 / 2692k | 72.7 / 401k | | |
| | | FC-30B-SFT | 73.7 (β1.4) / 1797k (β28.5%) | 20.0 (β2.5) / 2370k (β12.0%) | 73.3 (β0.6) / 292k (β27.2%) | | |
| | | FC-4B-SFT | 73.3 (β1.0) / 1919k (β23.7%) | 18.0 (β0.5) / 2279k (β15.3%) | 73.4 (β0.7) / 306k (β23.7%) | | |
| | | FC-4B-RL | 73.7 (β1.4) / 1971k (β21.6%) | **22.5** (β5.0) / 2210k (β17.9%) | 73.5 (β0.8) / 302k (β24.7%) | | |
| | **Kimi-K2.6** | w/o Explore | 76.3 / 1553k | 31.0 / 2383k | 71.6 / 510k | | |
| | | FC-30B-SFT | 76.7 (β0.4) / 1360k (β12.4%) | 33.0 (β2.0) / 2150k (β9.8%) | 72.8 (β1.2) / 373k (β26.9%) | | |
| | | FC-4B-SFT | 75.3 (β1.0) / 1306k (β15.9%) | 32.5 (β1.5) / 2159k (β9.4%) | 72.6 (β1.0) / 402k (β21.2%) | | |
| | | FC-4B-RL | **78.3** (β2.0) / 1384k (β10.9%) | **33.5** (β2.5) / 2158k (β9.4%) | 72.6 (β1.0) / 378k (β25.9%) | | |
| *Score / Tokens shown per cell. Best result per main-agent block in bold.* | |
| **Highlights:** | |
| - FastContext improves end-to-end accuracy for **every main agent and benchmark**; the largest gains appear on SWE-bench Pro (e.g. GPT-5.4 +5.5, GLM-5.1 +5.0). | |
| - The biggest token savings reach **60.3%** (GPT-5.4 on SWE-QA). | |
| - The compact **4B-RL** explorer can outperform the larger **30B-SFT** explorer β e.g. on GLM-5.1 SWE-bench Pro it reaches 22.5 vs. 20.0 while using fewer tokens. | |
| ## 3. Quick Start | |
| Launch the model with an OpenAI-compatible server (e.g. SGLang). The example below serves the 4B explorer: | |
| ```bash | |
| python3 -m sglang.launch_server \ | |
| --model-path FastContext-1.0-4B-SFT \ | |
| --tool-call-parser qwen \ | |
| --context-length 262144 \ | |
| --trust-remote-code \ | |
| --dtype bfloat16 \ | |
| --host 0.0.0.0 \ | |
| --port 30000 \ | |
| --tp-size 1 \ | |
| --mem-fraction-static 0.8 | |
| ``` | |
| FastContext exposes only three read-only tools to the model: | |
| | Tool | Purpose | | |
| |---|---| | |
| | `READ` | Return line-numbered file contents | | |
| | `GLOB` | Path discovery by glob pattern | | |
| | `GREP` | Regex search over repository text (ripgrep-style) | | |
| At each turn the explorer either issues one or more (parallel) tool calls or stops with a final `<final_answer>` evidence list. Wire FastContext into a coding agent (e.g. Mini-SWE-Agent) as an exploration subagent the main agent can invoke on demand. | |
| ## 4. Training Recipe | |
| FastContext is trained in two stages: | |
| - **Supervised fine-tuning (SFT):** The exploration traces, split into three sources matching the runtime behavior of the subagent β `parallel_toolcalls` (broad first-turn search), `multiturn_traj` (multi-turn evidence gathering), and `linerange` (precise citation generation). | |
| - **Reinforcement learning (RL):** The model is rolled out as the actual subagent and optimized with **GRPO** using a deterministic reward combining file- and line-level F1, a bonus for bounded parallel exploration, and format penalties. | |
| ## License | |
| This project is licensed under the MIT License. | |
| ## Citation | |
| ```bibtex | |
| @misc{zhang2026fastcontexttrainingefficientrepository, | |
| title={FastContext: Training Efficient Repository Explorer for Coding Agents}, | |
| author={Shaoqiu Zhang and Maoquan Wang and Yuling Shi and Yuhang Wang and Xiaodong Gu and Yongqiang Yao and Tori Gong and Sheng Chen and Rao Fu and Anisha Agarwal and Spandan Garg and Gabriel Ryan and Colin Merkel and Yufan Huang and Shengyu Fu}, | |
| year={2026}, | |
| eprint={2606.14066}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.SE}, | |
| url={https://arxiv.org/abs/2606.14066}, | |
| } | |
| ``` |