fix(P1): Switch to Qwen2.5-7B to avoid HuggingFace third-party provider failures (#118)
Browse files* docs: Add P1 bug doc for Simple Mode removal breaking Free Tier UX
SPEC-16 Unified Architecture removed Simple Mode, forcing all users
to Advanced Mode. When no API key is provided, Advanced Mode falls back
to HuggingFace Free Tier which triggers upstream agent-framework repr
bug (#2562).
Options documented:
A) Wait for upstream fix (PR #2566)
B) Restore Simple Mode for free tier
C) Current workaround in _extract_text()
* docs: Update P1 bug doc and SPEC-16 with rollback warning
CRITICAL: Simple Mode was deleted BEFORE verifying Advanced+HF worked.
Problem:
- Upstream agent-framework has repr bug (#2562)
- Advanced Mode + HuggingFace = garbage output
- Simple Mode (the working fallback) was deleted prematurely
Bug doc updates:
- Added "What Went Wrong" timeline
- Added Gradio UI confusion analysis (examples vs chat button)
- Recommendation: Restore Simple Mode as fallback
SPEC-16 updates:
- Status changed to "PARTIALLY IMPLEMENTED - ROLLBACK REQUIRED"
- Added critical warning about premature deletion
- Links to P1 bug doc for action items
* docs: CRITICAL - Simple Mode is NOT being deleted
This commit makes it CRYSTAL CLEAR across all documentation:
β SIMPLE MODE IS NOT BEING DELETED - NON-NEGOTIABLE
What went wrong:
- SPEC-16 was supposed to INTEGRATE Simple Mode, not DELETE it
- simple.py was deleted BEFORE verifying Advanced+HF worked
- Upstream agent-framework has repr bug (#2562)
- Free tier users now have no working fallback
Required actions:
1. RESTORE simple.py from git history or MCP reference
2. KEEP Simple Mode as free-tier fallback indefinitely
3. Use Advanced Mode ONLY for paid API key users
4. Wait for upstream #2566 to merge before reconsidering
Updated files:
- SPEC_16: Status changed to "ON HOLD", added warning
- P0_SIMPLE_MODE_FORCED_SYNTHESIS_BYPASS: Changed to "Patch simple.py"
- ACTIVE_BUGS: Marked Simple Mode issues as OPEN
* docs: DO NOT use MCP reference repo - it's buggy
Updated all docs and GitHub issues to clarify:
1. DO NOT restore from MCP reference repo - has known bugs
2. Git revert in THIS repo MAY be possible - review for bugs first
3. Clean implementation preferred if old code is too buggy
4. Goal is WORKING Simple Mode, not blindly restored buggy code
Files updated:
- ACTIVE_BUGS.md
- SPEC_16_UNIFIED_CHAT_CLIENT_ARCHITECTURE.md
- P1_SIMPLE_MODE_REMOVED_BREAKS_FREE_TIER_UX.md
GitHub issues updated:
- #105: Added warning about reference repo
- #113: Added warning about reference repo
* docs: Clarify UNIFIED architecture with Simple Mode INTEGRATED
- NOT two parallel universes/orchestrators
- ONE codebase handles all tiers (free + paid)
- Simple Mode behavior INTEGRATED, not separate
- Blocked by upstream bug #2562, waiting for PR #2566
* docs: Add architecture documentation for unified system
- Current state: Advanced Mode only, simple.py deleted
- Goal: ONE unified architecture (not parallel universes)
- Simple Mode INTEGRATED via HuggingFaceChatClient
- Blocked by upstream #2562, waiting for PR #2566
- Includes path forward for all scenarios
* docs: Update all bug docs for unified architecture consistency
- ACTIVE_BUGS.md: Consolidated free tier issue as single P0 blocker
- P0_SIMPLE_MODE_FORCED_SYNTHESIS_BYPASS.md: Simplified - bug fixed by unification
- All docs now consistently say: ONE unified architecture, NOT parallel universes
- Simple Mode behavior INTEGRATED via HuggingFaceChatClient
- simple.py is DELETED, not being restored
* docs: FINAL - Clear terminology, framework integration documented
Architecture:
- No API Key (Free) β HuggingFace backend
- API Key (Paid) β OpenAI backend
- ONE codebase, different backends, no "modes"
Framework Stack:
- Microsoft Agent Framework = orchestration (routes agents)
- Pydantic AI = structured outputs (validates data)
- Both work TOGETHER, not mutually exclusive
Blocked by upstream #2562, waiting for PR #2566.
All docs and GitHub issues now use consistent terminology.
* docs: Fix root-level docs for unified architecture
- CLAUDE.md: Remove simple.py reference, update orchestrator description
- AGENTS.md: Same fix
- GEMINI.md: Same fix
- README.md: "Two Modes" β "Unified Architecture" + Free/Paid Tier
All root docs now consistent with unified architecture:
- ONE orchestrator (advanced.py) for all users
- Auto-selects backend: OpenAI (if key) or HuggingFace (free)
- No more "Simple Mode" vs "Advanced Mode" terminology
* fix: Switch default HF model to Qwen2.5-7B to avoid Novita 500 errors
- HF_FREE_TIER_ANALYSIS.md +68 -0
- src/utils/config.py +4 -3
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hugging Face Free Tier Reliability Analysis (December 2025)
|
| 2 |
+
|
| 3 |
+
## Executive Summary
|
| 4 |
+
|
| 5 |
+
**Root Cause:** The recurring 500/401 errors on the Free Tier (Advanced Mode without API keys) are caused by implicit routing of large models (70B+) to unstable third-party "Inference Providers" (Novita, Hyperbolic) instead of running natively on Hugging Face's infrastructure.
|
| 6 |
+
|
| 7 |
+
**Solution:** Switch the default Free Tier model from flagship-class models (72B) to high-performance mid-sized models (7B-32B) that are hosted natively by Hugging Face's Serverless Inference API.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## 1. The "Inference Providers" Trap
|
| 12 |
+
|
| 13 |
+
Hugging Face offers two distinct execution paths for its Inference API:
|
| 14 |
+
|
| 15 |
+
1. **Serverless Inference API (Native):**
|
| 16 |
+
* **Host:** Hugging Face's own infrastructure.
|
| 17 |
+
* **Reliability:** High (Direct control).
|
| 18 |
+
* **Constraints:** Limited to models that fit on standard inference hardware (typically <10GB-30GB VRAM usage).
|
| 19 |
+
* **Typical Models:** `bert-base`, `gpt2`, `Mistral-7B`, `Qwen2.5-7B`.
|
| 20 |
+
|
| 21 |
+
2. **Inference Providers (Third-Party Marketplace):**
|
| 22 |
+
* **Host:** Partners like Novita, Hyperbolic, Together AI, Sambanova.
|
| 23 |
+
* **Reliability:** Variable. "Staging mode" authentication issues, rate limits, and service outages (500 errors) are common on the free routing layer.
|
| 24 |
+
* **Purpose:** To serve massive models (Llama-3.1-405B, Qwen2.5-72B) that are too expensive for HF to host for free.
|
| 25 |
+
|
| 26 |
+
**The Problem:**
|
| 27 |
+
When we request `Qwen/Qwen2.5-72B-Instruct` (or `Llama-3.1-70B`) without an API key, HF transparently routes this request to a partner (Novita/Hyperbolic).
|
| 28 |
+
* **Novita Status:** Currently returning 500 Internal Server Errors.
|
| 29 |
+
* **Hyperbolic Status:** Previously returned 401 Unauthorized (Staging Mode auth bug).
|
| 30 |
+
|
| 31 |
+
We are effectively relying on a "best effort" chain of third-party providers for our core application stability.
|
| 32 |
+
|
| 33 |
+
## 2. The "Golden Path" for Free Tier
|
| 34 |
+
|
| 35 |
+
To ensure stability, the Free Tier must target models that reside on the **Native** path.
|
| 36 |
+
|
| 37 |
+
**Criteria for Native Stability:**
|
| 38 |
+
* **Size:** < 30B parameters (ideal: 7B - 12B).
|
| 39 |
+
* **Popularity:** "Warm" models (high traffic keeps them loaded in memory).
|
| 40 |
+
* **Architecture:** Standard transformers (easy for HF to serve).
|
| 41 |
+
|
| 42 |
+
**Candidate Models (Dec 2025):**
|
| 43 |
+
|
| 44 |
+
| Model | Size | Provider Risk | Native Capability |
|
| 45 |
+
|-------|------|---------------|-------------------|
|
| 46 |
+
| **Qwen/Qwen2.5-7B-Instruct** | 7B | **Low** | **Excellent** (Math: 75.5, Code: 84.8) |
|
| 47 |
+
| **mistralai/Mistral-Nemo-Instruct-2407** | 12B | Low | Very Good |
|
| 48 |
+
| **Qwen/Qwen2.5-72B-Instruct** | 72B | **High** (Novita) | Excellent (but unreliable) |
|
| 49 |
+
| **meta-llama/Llama-3.1-70B-Instruct** | 70B | **High** (Hyperbolic) | Excellent (but unreliable) |
|
| 50 |
+
|
| 51 |
+
## 3. Recommendation
|
| 52 |
+
|
| 53 |
+
**Immediate Fix:**
|
| 54 |
+
Change the default `HUGGINGFACE_MODEL` in `src/utils/config.py` from `Qwen/Qwen2.5-72B-Instruct` to **`Qwen/Qwen2.5-7B-Instruct`**.
|
| 55 |
+
|
| 56 |
+
**Why Qwen2.5-7B?**
|
| 57 |
+
* **Performance:** Outperforms Llama-3.1-8B and matches GPT-3.5 levels in many benchmarks.
|
| 58 |
+
* **Reliability:** Small enough to be hosted natively.
|
| 59 |
+
* **Context:** 128k context window (perfect for RAG).
|
| 60 |
+
|
| 61 |
+
## 4. Future Architecture (Unified Client)
|
| 62 |
+
|
| 63 |
+
For the Unified Chat Client architecture:
|
| 64 |
+
1. **Tier 0 (Free):** Hardcoded to Native Models (Qwen 7B, Mistral Nemo).
|
| 65 |
+
2. **Tier 1 (BYO Key):** Allow user to select any model (70B+), assuming they provide a key that grants access to premium providers or PRO tier.
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
*Analysis performed by Gemini CLI Agent, Dec 2, 2025*
|
|
@@ -36,10 +36,11 @@ class Settings(BaseSettings):
|
|
| 36 |
default="claude-sonnet-4-5-20250929", description="Anthropic model"
|
| 37 |
)
|
| 38 |
# HuggingFace (free tier)
|
| 39 |
-
# NOTE:
|
| 40 |
-
# Qwen2.5-
|
|
|
|
| 41 |
huggingface_model: str | None = Field(
|
| 42 |
-
default="Qwen/Qwen2.5-
|
| 43 |
)
|
| 44 |
hf_token: str | None = Field(
|
| 45 |
default=None, alias="HF_TOKEN", description="HuggingFace API token"
|
|
|
|
| 36 |
default="claude-sonnet-4-5-20250929", description="Anthropic model"
|
| 37 |
)
|
| 38 |
# HuggingFace (free tier)
|
| 39 |
+
# NOTE: Large models (70B+) are routed to third-party providers (Novita, Hyperbolic) which are
|
| 40 |
+
# unreliable (500/401 errors). We use Qwen2.5-7B-Instruct as it is small enough to run on
|
| 41 |
+
# Hugging Face's native serverless infrastructure.
|
| 42 |
huggingface_model: str | None = Field(
|
| 43 |
+
default="Qwen/Qwen2.5-7B-Instruct", description="HuggingFace model name"
|
| 44 |
)
|
| 45 |
hf_token: str | None = Field(
|
| 46 |
default=None, alias="HF_TOKEN", description="HuggingFace API token"
|