VibecoderMcSwaggins commited on
Commit
ef20d17
Β·
1 Parent(s): a4327d1

docs: Add P2 bug report for 7B model producing garbage output

Browse files

This commit introduces a new documentation file detailing a P2 bug where the Qwen2.5-7B-Instruct model generates incoherent streaming output, displaying random tokens instead of meaningful responses. The report includes symptoms, reproduction steps, root cause analysis, impact assessment, potential solutions, and a recommended action plan.

Key findings indicate that the 7B model lacks the reasoning capacity for complex multi-agent prompts, necessitating a review of model selection and architecture for the Free Tier.

Files added:
- P2_7B_MODEL_GARBAGE_OUTPUT.md

P2_7B_MODEL_GARBAGE_OUTPUT.md ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P2 Bug: 7B Model Produces Garbage Streaming Output
2
+
3
+ **Date**: 2025-12-02
4
+ **Status**: OPEN - Investigating
5
+ **Severity**: P2 (Major - Degrades User Experience)
6
+ **Component**: Free Tier / HuggingFace + Multi-Agent Orchestration
7
+
8
+ ---
9
+
10
+ ## Symptoms
11
+
12
+ When running a research query on Free Tier (Qwen2.5-7B-Instruct), the streaming output shows **garbage tokens** instead of coherent agent reasoning:
13
+
14
+ ```
15
+ πŸ“‘ **STREAMING**: yarg
16
+ πŸ“‘ **STREAMING**: PostalCodes
17
+ πŸ“‘ **STREAMING**: PostalCodes
18
+ πŸ“‘ **STREAMING**: FunctionFlags
19
+ πŸ“‘ **STREAMING**: search_pubmed
20
+ πŸ“‘ **STREAMING**: search_clinical_trials
21
+ πŸ“‘ **STREAMING**: system
22
+ πŸ“‘ **STREAMING**: Transferred to searcher, adopt the persona immediately.
23
+ ```
24
+
25
+ The model outputs random tokens like "yarg", "PostalCodes", "FunctionFlags" instead of actual research reasoning.
26
+
27
+ ---
28
+
29
+ ## Reproduction Steps
30
+
31
+ 1. Go to HuggingFace Spaces: https://huggingface.co/spaces/vcms/deepboner
32
+ 2. Leave API key empty (Free Tier)
33
+ 3. Click any example query or type a question
34
+ 4. Click submit
35
+ 5. Observe streaming output - garbage tokens appear
36
+
37
+ **Expected**: Coherent agent reasoning like "Searching PubMed for female libido treatments..."
38
+ **Actual**: Random tokens like "yarg", "PostalCodes"
39
+
40
+ ---
41
+
42
+ ## Root Cause Analysis
43
+
44
+ ### Primary Cause: 7B Model Too Small for Multi-Agent Prompts
45
+
46
+ The Qwen2.5-7B-Instruct model has **insufficient reasoning capacity** for the complex multi-agent framework. The system requires the model to:
47
+
48
+ 1. **Adopt agent personas** with specialized instructions
49
+ 2. **Follow structured workflows** (Search β†’ Judge β†’ Hypothesis β†’ Report)
50
+ 3. **Make tool calls** (search_pubmed, search_clinical_trials, etc.)
51
+ 4. **Generate JSON-formatted progress ledgers** for workflow control
52
+ 5. **Understand manager instructions** and delegate appropriately
53
+
54
+ A 7B parameter model simply does not have the reasoning depth to handle this. Larger models (70B+) were originally intended, but those are routed to unreliable third-party providers (see `HF_FREE_TIER_ANALYSIS.md`).
55
+
56
+ ### Technical Flow (Where Garbage Appears)
57
+
58
+ ```
59
+ User Query
60
+ ↓
61
+ AdvancedOrchestrator.run() [advanced.py:247]
62
+ ↓
63
+ workflow.run_stream(task) [builds Magentic workflow]
64
+ ↓
65
+ MagenticAgentDeltaEvent emitted with event.text
66
+ ↓
67
+ Yields AgentEvent(type="streaming", message=event.text) [advanced.py:314-319]
68
+ ↓
69
+ Gradio displays: "πŸ“‘ **STREAMING**: {garbage}"
70
+ ```
71
+
72
+ The garbage tokens are **raw model output**. The 7B model is:
73
+ - Not following the system prompt
74
+ - Outputting partial/incomplete token sequences
75
+ - Possibly attempting tool calls but formatting incorrectly
76
+ - Hallucinating random words
77
+
78
+ ### Evidence from Microsoft Reference Framework
79
+
80
+ The Microsoft Agent Framework's `_magentic.py` (lines 1717-1741) shows how agent invocation works:
81
+
82
+ ```python
83
+ async for update in agent.run_stream(messages=self._chat_history):
84
+ updates.append(update)
85
+ await self._emit_agent_delta_event(ctx, update)
86
+ ```
87
+
88
+ The framework passes through whatever the underlying chat client produces. If the model produces garbage, the framework streams it directly.
89
+
90
+ ### Why Click Example vs Submit Shows Different Initial State
91
+
92
+ Both code paths go through the same `research_agent()` function in `app.py`. The difference:
93
+
94
+ - **Example click**: Immediately submits query, so you see garbage quickly
95
+ - **Submit button click**: Shows "Starting research (Advanced mode)" banner first, then garbage
96
+
97
+ Both ultimately produce the same garbage output from the 7B model.
98
+
99
+ ---
100
+
101
+ ## Impact Assessment
102
+
103
+ | Aspect | Impact |
104
+ |--------|--------|
105
+ | Free Tier Users | Cannot get usable research results |
106
+ | Demo Quality | Appears broken/unprofessional |
107
+ | Trust | Users may think the entire system is broken |
108
+ | Differentiation | Undermines "free tier works!" messaging |
109
+
110
+ ---
111
+
112
+ ## Potential Solutions
113
+
114
+ ### Option 1: Switch to Better Small Model (Recommended - Quick Fix)
115
+
116
+ Find a small model that better handles complex instructions. Candidates:
117
+
118
+ | Model | Size | Tool Calling | Instruction Following |
119
+ |-------|------|--------------|----------------------|
120
+ | `mistralai/Mistral-7B-Instruct-v0.3` | 7B | Yes | Better |
121
+ | `microsoft/Phi-3-mini-4k-instruct` | 3.8B | Limited | Good |
122
+ | `google/gemma-2-9b-it` | 9B | Yes | Good |
123
+ | `Qwen/Qwen2.5-14B-Instruct` | 14B | Yes | Better |
124
+
125
+ **Risk**: 14B model might still be routed to third-party providers. Need to test each.
126
+
127
+ ### Option 2: Simplify Free Tier Architecture
128
+
129
+ Create a **simpler single-agent mode** for Free Tier:
130
+ - Remove multi-agent coordination (Manager, multiple ChatAgents)
131
+ - Use a single direct query β†’ search β†’ synthesize flow
132
+ - Reduce prompt complexity significantly
133
+
134
+ **Pros**: More reliable with smaller models
135
+ **Cons**: Loses sophisticated multi-agent research capability
136
+
137
+ ### Option 3: Output Filtering/Validation
138
+
139
+ Add validation layer to detect and filter garbage output:
140
+
141
+ ```python
142
+ def is_valid_streaming_token(text: str) -> bool:
143
+ """Check if streaming token appears valid."""
144
+ # Garbage patterns we've seen
145
+ garbage_patterns = ["yarg", "PostalCodes", "FunctionFlags"]
146
+ if any(g in text for g in garbage_patterns):
147
+ return False
148
+ # Check for minimum coherence (has spaces, reasonable length)
149
+ return len(text) > 0 and text.strip()
150
+ ```
151
+
152
+ **Pros**: Band-aid fix, quick to implement
153
+ **Cons**: Doesn't fix root cause, will miss new garbage patterns
154
+
155
+ ### Option 4: Graceful Degradation
156
+
157
+ Detect when model output is incoherent and fall back to:
158
+ - Returning an error message
159
+ - Suggesting user provide an API key
160
+ - Using a cached/templated response
161
+
162
+ ### Option 5: Prompt Engineering for 7B Models
163
+
164
+ Significantly simplify the agent prompts for 7B compatibility:
165
+ - Shorter system prompts
166
+ - More explicit step-by-step instructions
167
+ - Remove abstract concepts
168
+ - Use few-shot examples
169
+
170
+ ---
171
+
172
+ ## Recommended Action Plan
173
+
174
+ ### Phase 1: Quick Fix (P2)
175
+ 1. Test `mistralai/Mistral-7B-Instruct-v0.3` or `Qwen/Qwen2.5-14B-Instruct`
176
+ 2. Verify they stay on HuggingFace native infrastructure (no third-party routing)
177
+ 3. Evaluate output quality on sample queries
178
+
179
+ ### Phase 2: Architecture Review (P3)
180
+ 1. Consider simplified single-agent mode for Free Tier
181
+ 2. Design graceful degradation when model output is invalid
182
+ 3. Add output validation layer
183
+
184
+ ### Phase 3: Long-term (P4)
185
+ 1. Consider hybrid approach: simple mode for free tier, advanced for paid
186
+ 2. Explore fine-tuning a small model specifically for research agent tasks
187
+
188
+ ---
189
+
190
+ ## Files Involved
191
+
192
+ | File | Relevance |
193
+ |------|-----------|
194
+ | `src/orchestrators/advanced.py` | Main orchestrator, streaming event handling |
195
+ | `src/clients/huggingface.py` | HuggingFace chat client adapter |
196
+ | `src/agents/magentic_agents.py` | Agent definitions and prompts |
197
+ | `src/app.py` | Gradio UI, event display |
198
+ | `src/utils/config.py` | Model configuration |
199
+
200
+ ---
201
+
202
+ ## Relation to Previous Bugs
203
+
204
+ - **P0 Repr Bug (RESOLVED)**: Fixed in PR #117 - Was about `<generator object>` appearing due to async generator mishandling
205
+ - **P1 HuggingFace Novita Error (RESOLVED)**: Fixed in PR #118 - Was about 72B models being routed to failing third-party providers
206
+
207
+ This P2 bug is **downstream** of the P1 fix - we fixed the 500 errors by switching to 7B, but now the 7B model doesn't produce quality output.
208
+
209
+ ---
210
+
211
+ ## Questions to Investigate
212
+
213
+ 1. What models in the 7-20B range stay on HuggingFace native infrastructure?
214
+ 2. Can we detect third-party routing before making the full request?
215
+ 3. Is the chat template correct for Qwen2.5-7B? (Some models need specific formatting)
216
+ 4. Are there HuggingFace serverless models specifically optimized for tool calling?
217
+
218
+ ---
219
+
220
+ ## References
221
+
222
+ - `HF_FREE_TIER_ANALYSIS.md` - Analysis of HuggingFace provider routing
223
+ - `CLAUDE.md` - Critical HuggingFace Free Tier section
224
+ - Microsoft Agent Framework `_magentic.py` - Reference implementation
P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P3 Tech Debt: Remove Anthropic Partial Wiring
2
+
3
+ **Date**: 2025-12-03
4
+ **Status**: OPEN
5
+ **Severity**: P3 (Tech Debt / Simplification)
6
+ **Component**: Architecture / Provider Integration
7
+
8
+ ---
9
+
10
+ ## Summary
11
+
12
+ Remove all Anthropic-related code, configuration, and references from the codebase. Anthropic is partially wired but **not fully threaded through the architecture**, creating confusion and half-implemented code paths.
13
+
14
+ ---
15
+
16
+ ## Rationale
17
+
18
+ ### 1. Anthropic Does NOT Provide Embeddings
19
+
20
+ Our architecture requires embeddings for:
21
+ - RAG (LlamaIndex/ChromaDB)
22
+ - Evidence deduplication
23
+ - Semantic search
24
+
25
+ Anthropic only provides chat completion, not embeddings. This means even with a working Anthropic chat client, users would need a **second provider** for embeddings, breaking the unified experience.
26
+
27
+ ### 2. Partial Implementation Creates Confusion
28
+
29
+ Current state:
30
+ - `settings.anthropic_api_key` exists βœ…
31
+ - `settings.has_anthropic_key` property exists βœ…
32
+ - `settings.anthropic_model` configured βœ…
33
+ - `AnthropicChatClient` for agent_framework **DOES NOT EXIST** ❌
34
+ - Code raises `NotImplementedError` when Anthropic detected ❌
35
+
36
+ This half-state causes:
37
+ - User confusion ("Why doesn't my Anthropic key work?")
38
+ - Developer confusion ("Is Anthropic supported or not?")
39
+ - Dead code paths that need maintenance
40
+
41
+ ### 3. Unified Architecture Principle
42
+
43
+ **Principle**: Only support providers that work **end-to-end** through the entire stack:
44
+
45
+ ```
46
+ Provider Requirements:
47
+ β”œβ”€β”€ Chat Completion (for agents) βœ… Required
48
+ β”œβ”€β”€ Function/Tool Calling βœ… Required
49
+ β”œβ”€β”€ Embeddings (for RAG) βœ… Required
50
+ └── Streaming βœ… Required
51
+ ```
52
+
53
+ | Provider | Chat | Tools | Embeddings | Streaming | Status |
54
+ |----------|------|-------|------------|-----------|--------|
55
+ | OpenAI | βœ… | βœ… | βœ… | βœ… | **KEEP** |
56
+ | HuggingFace | βœ… | βœ… | βœ… (local) | βœ… | **KEEP** |
57
+ | Gemini | βœ… | βœ… | βœ… | βœ… | Future (Phase 4) |
58
+ | Anthropic | βœ… | βœ… | ❌ | βœ… | **REMOVE** |
59
+
60
+ ---
61
+
62
+ ## Files to Clean Up
63
+
64
+ ### Configuration
65
+ - [ ] `src/utils/config.py` - Remove `anthropic_api_key`, `anthropic_model`, `has_anthropic_key`
66
+
67
+ ### Client Factory
68
+ - [ ] `src/clients/factory.py` - Remove Anthropic detection and `NotImplementedError`
69
+
70
+ ### Legacy Code (pydantic-ai based)
71
+ - [ ] `src/utils/llm_factory.py` - Remove `AnthropicModel`, `AnthropicProvider` imports and handling
72
+ - [ ] `src/agent_factory/judges.py` - Remove Anthropic model selection
73
+
74
+ ### App/UI
75
+ - [ ] `src/app.py` - Remove `has_anthropic_key` checks and "Anthropic from env" backend info
76
+
77
+ ### Documentation
78
+ - [ ] `CLAUDE.md` - Update LLM provider list
79
+ - [ ] `AGENTS.md` - Update LLM provider list
80
+ - [ ] `GEMINI.md` - Update LLM provider list
81
+
82
+ ### Tests
83
+ - [ ] `tests/unit/clients/test_chat_client_factory.py` - Remove Anthropic test cases
84
+ - [ ] `tests/unit/utils/test_config.py` - Remove Anthropic config tests
85
+
86
+ ---
87
+
88
+ ## Code Snippets to Remove
89
+
90
+ ### `src/utils/config.py`
91
+ ```python
92
+ # REMOVE these lines:
93
+ anthropic_api_key: str | None = Field(default=None, description="Anthropic API key")
94
+ anthropic_model: str = Field(
95
+ default="claude-sonnet-4-5-20250929", description="Anthropic model"
96
+ )
97
+
98
+ @property
99
+ def has_anthropic_key(self) -> bool:
100
+ """Check if Anthropic API key is available."""
101
+ return bool(self.anthropic_api_key)
102
+ ```
103
+
104
+ ### `src/clients/factory.py`
105
+ ```python
106
+ # REMOVE these lines:
107
+ if api_key.startswith("sk-ant-"):
108
+ normalized = "anthropic"
109
+
110
+ if normalized == "anthropic":
111
+ raise NotImplementedError(
112
+ "Anthropic client not yet implemented. "
113
+ "Use OpenAI key (sk-...) or leave empty for free HuggingFace tier."
114
+ )
115
+ ```
116
+
117
+ ### `src/app.py`
118
+ ```python
119
+ # REMOVE these lines:
120
+ elif settings.has_anthropic_key:
121
+ backend_info = "Paid API (Anthropic from env)"
122
+
123
+ has_anthropic = settings.has_anthropic_key
124
+ has_paid_key = has_openai or has_anthropic or bool(user_api_key)
125
+ # Change to:
126
+ has_paid_key = has_openai or bool(user_api_key)
127
+ ```
128
+
129
+ ---
130
+
131
+ ## Migration Notes
132
+
133
+ ### For Users with Anthropic Keys
134
+
135
+ If users have `ANTHROPIC_API_KEY` set in their environment:
136
+ 1. It will be **silently ignored** (not an error)
137
+ 2. System falls through to HuggingFace free tier
138
+ 3. Users should use `OPENAI_API_KEY` instead for paid tier
139
+
140
+ ### Future Consideration
141
+
142
+ If Anthropic adds embeddings API in the future, we can re-add support. But until then, partial support creates more confusion than value.
143
+
144
+ ---
145
+
146
+ ## Definition of Done
147
+
148
+ - [ ] All Anthropic references removed from `src/`
149
+ - [ ] All Anthropic tests removed or updated
150
+ - [ ] Documentation updated to reflect supported providers: OpenAI, HuggingFace, (future: Gemini)
151
+ - [ ] `make check` passes (lint, typecheck, tests)
152
+ - [ ] PR reviewed and merged
153
+
154
+ ---
155
+
156
+ ## Related Documents
157
+
158
+ - `P2_7B_MODEL_GARBAGE_OUTPUT.md` - Current free tier model quality issues
159
+ - `HF_FREE_TIER_ANALYSIS.md` - HuggingFace provider routing analysis
160
+ - `CLAUDE.md` - Agent context with provider documentation