VibecoderMcSwaggins commited on
Commit
ea28d9c
Β·
unverified Β·
1 Parent(s): 78ec52a

fix(P1): Switch to Qwen2.5-7B to avoid HuggingFace third-party provider failures (#118)

Browse files

* docs: Add P1 bug doc for Simple Mode removal breaking Free Tier UX

SPEC-16 Unified Architecture removed Simple Mode, forcing all users
to Advanced Mode. When no API key is provided, Advanced Mode falls back
to HuggingFace Free Tier which triggers upstream agent-framework repr
bug (#2562).

Options documented:
A) Wait for upstream fix (PR #2566)
B) Restore Simple Mode for free tier
C) Current workaround in _extract_text()

* docs: Update P1 bug doc and SPEC-16 with rollback warning

CRITICAL: Simple Mode was deleted BEFORE verifying Advanced+HF worked.

Problem:
- Upstream agent-framework has repr bug (#2562)
- Advanced Mode + HuggingFace = garbage output
- Simple Mode (the working fallback) was deleted prematurely

Bug doc updates:
- Added "What Went Wrong" timeline
- Added Gradio UI confusion analysis (examples vs chat button)
- Recommendation: Restore Simple Mode as fallback

SPEC-16 updates:
- Status changed to "PARTIALLY IMPLEMENTED - ROLLBACK REQUIRED"
- Added critical warning about premature deletion
- Links to P1 bug doc for action items

* docs: CRITICAL - Simple Mode is NOT being deleted

This commit makes it CRYSTAL CLEAR across all documentation:

β›” SIMPLE MODE IS NOT BEING DELETED - NON-NEGOTIABLE

What went wrong:
- SPEC-16 was supposed to INTEGRATE Simple Mode, not DELETE it
- simple.py was deleted BEFORE verifying Advanced+HF worked
- Upstream agent-framework has repr bug (#2562)
- Free tier users now have no working fallback

Required actions:
1. RESTORE simple.py from git history or MCP reference
2. KEEP Simple Mode as free-tier fallback indefinitely
3. Use Advanced Mode ONLY for paid API key users
4. Wait for upstream #2566 to merge before reconsidering

Updated files:
- SPEC_16: Status changed to "ON HOLD", added warning
- P0_SIMPLE_MODE_FORCED_SYNTHESIS_BYPASS: Changed to "Patch simple.py"
- ACTIVE_BUGS: Marked Simple Mode issues as OPEN

* docs: DO NOT use MCP reference repo - it's buggy

Updated all docs and GitHub issues to clarify:

1. DO NOT restore from MCP reference repo - has known bugs
2. Git revert in THIS repo MAY be possible - review for bugs first
3. Clean implementation preferred if old code is too buggy
4. Goal is WORKING Simple Mode, not blindly restored buggy code

Files updated:
- ACTIVE_BUGS.md
- SPEC_16_UNIFIED_CHAT_CLIENT_ARCHITECTURE.md
- P1_SIMPLE_MODE_REMOVED_BREAKS_FREE_TIER_UX.md

GitHub issues updated:
- #105: Added warning about reference repo
- #113: Added warning about reference repo

* docs: Clarify UNIFIED architecture with Simple Mode INTEGRATED

- NOT two parallel universes/orchestrators
- ONE codebase handles all tiers (free + paid)
- Simple Mode behavior INTEGRATED, not separate
- Blocked by upstream bug #2562, waiting for PR #2566

* docs: Add architecture documentation for unified system

- Current state: Advanced Mode only, simple.py deleted
- Goal: ONE unified architecture (not parallel universes)
- Simple Mode INTEGRATED via HuggingFaceChatClient
- Blocked by upstream #2562, waiting for PR #2566
- Includes path forward for all scenarios

* docs: Update all bug docs for unified architecture consistency

- ACTIVE_BUGS.md: Consolidated free tier issue as single P0 blocker
- P0_SIMPLE_MODE_FORCED_SYNTHESIS_BYPASS.md: Simplified - bug fixed by unification
- All docs now consistently say: ONE unified architecture, NOT parallel universes
- Simple Mode behavior INTEGRATED via HuggingFaceChatClient
- simple.py is DELETED, not being restored

* docs: FINAL - Clear terminology, framework integration documented

Architecture:
- No API Key (Free) β†’ HuggingFace backend
- API Key (Paid) β†’ OpenAI backend
- ONE codebase, different backends, no "modes"

Framework Stack:
- Microsoft Agent Framework = orchestration (routes agents)
- Pydantic AI = structured outputs (validates data)
- Both work TOGETHER, not mutually exclusive

Blocked by upstream #2562, waiting for PR #2566.

All docs and GitHub issues now use consistent terminology.

* docs: Fix root-level docs for unified architecture

- CLAUDE.md: Remove simple.py reference, update orchestrator description
- AGENTS.md: Same fix
- GEMINI.md: Same fix
- README.md: "Two Modes" β†’ "Unified Architecture" + Free/Paid Tier

All root docs now consistent with unified architecture:
- ONE orchestrator (advanced.py) for all users
- Auto-selects backend: OpenAI (if key) or HuggingFace (free)
- No more "Simple Mode" vs "Advanced Mode" terminology

* fix: Switch default HF model to Qwen2.5-7B to avoid Novita 500 errors

Files changed (2) hide show
  1. HF_FREE_TIER_ANALYSIS.md +68 -0
  2. src/utils/config.py +4 -3
HF_FREE_TIER_ANALYSIS.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hugging Face Free Tier Reliability Analysis (December 2025)
2
+
3
+ ## Executive Summary
4
+
5
+ **Root Cause:** The recurring 500/401 errors on the Free Tier (Advanced Mode without API keys) are caused by implicit routing of large models (70B+) to unstable third-party "Inference Providers" (Novita, Hyperbolic) instead of running natively on Hugging Face's infrastructure.
6
+
7
+ **Solution:** Switch the default Free Tier model from flagship-class models (72B) to high-performance mid-sized models (7B-32B) that are hosted natively by Hugging Face's Serverless Inference API.
8
+
9
+ ---
10
+
11
+ ## 1. The "Inference Providers" Trap
12
+
13
+ Hugging Face offers two distinct execution paths for its Inference API:
14
+
15
+ 1. **Serverless Inference API (Native):**
16
+ * **Host:** Hugging Face's own infrastructure.
17
+ * **Reliability:** High (Direct control).
18
+ * **Constraints:** Limited to models that fit on standard inference hardware (typically <10GB-30GB VRAM usage).
19
+ * **Typical Models:** `bert-base`, `gpt2`, `Mistral-7B`, `Qwen2.5-7B`.
20
+
21
+ 2. **Inference Providers (Third-Party Marketplace):**
22
+ * **Host:** Partners like Novita, Hyperbolic, Together AI, Sambanova.
23
+ * **Reliability:** Variable. "Staging mode" authentication issues, rate limits, and service outages (500 errors) are common on the free routing layer.
24
+ * **Purpose:** To serve massive models (Llama-3.1-405B, Qwen2.5-72B) that are too expensive for HF to host for free.
25
+
26
+ **The Problem:**
27
+ When we request `Qwen/Qwen2.5-72B-Instruct` (or `Llama-3.1-70B`) without an API key, HF transparently routes this request to a partner (Novita/Hyperbolic).
28
+ * **Novita Status:** Currently returning 500 Internal Server Errors.
29
+ * **Hyperbolic Status:** Previously returned 401 Unauthorized (Staging Mode auth bug).
30
+
31
+ We are effectively relying on a "best effort" chain of third-party providers for our core application stability.
32
+
33
+ ## 2. The "Golden Path" for Free Tier
34
+
35
+ To ensure stability, the Free Tier must target models that reside on the **Native** path.
36
+
37
+ **Criteria for Native Stability:**
38
+ * **Size:** < 30B parameters (ideal: 7B - 12B).
39
+ * **Popularity:** "Warm" models (high traffic keeps them loaded in memory).
40
+ * **Architecture:** Standard transformers (easy for HF to serve).
41
+
42
+ **Candidate Models (Dec 2025):**
43
+
44
+ | Model | Size | Provider Risk | Native Capability |
45
+ |-------|------|---------------|-------------------|
46
+ | **Qwen/Qwen2.5-7B-Instruct** | 7B | **Low** | **Excellent** (Math: 75.5, Code: 84.8) |
47
+ | **mistralai/Mistral-Nemo-Instruct-2407** | 12B | Low | Very Good |
48
+ | **Qwen/Qwen2.5-72B-Instruct** | 72B | **High** (Novita) | Excellent (but unreliable) |
49
+ | **meta-llama/Llama-3.1-70B-Instruct** | 70B | **High** (Hyperbolic) | Excellent (but unreliable) |
50
+
51
+ ## 3. Recommendation
52
+
53
+ **Immediate Fix:**
54
+ Change the default `HUGGINGFACE_MODEL` in `src/utils/config.py` from `Qwen/Qwen2.5-72B-Instruct` to **`Qwen/Qwen2.5-7B-Instruct`**.
55
+
56
+ **Why Qwen2.5-7B?**
57
+ * **Performance:** Outperforms Llama-3.1-8B and matches GPT-3.5 levels in many benchmarks.
58
+ * **Reliability:** Small enough to be hosted natively.
59
+ * **Context:** 128k context window (perfect for RAG).
60
+
61
+ ## 4. Future Architecture (Unified Client)
62
+
63
+ For the Unified Chat Client architecture:
64
+ 1. **Tier 0 (Free):** Hardcoded to Native Models (Qwen 7B, Mistral Nemo).
65
+ 2. **Tier 1 (BYO Key):** Allow user to select any model (70B+), assuming they provide a key that grants access to premium providers or PRO tier.
66
+
67
+ ---
68
+ *Analysis performed by Gemini CLI Agent, Dec 2, 2025*
src/utils/config.py CHANGED
@@ -36,10 +36,11 @@ class Settings(BaseSettings):
36
  default="claude-sonnet-4-5-20250929", description="Anthropic model"
37
  )
38
  # HuggingFace (free tier)
39
- # NOTE: Llama-3.1-70B is routed to Hyperbolic (partner) which has unreliable "staging mode"
40
- # Qwen2.5-72B works reliably via HuggingFace's native infrastructure
 
41
  huggingface_model: str | None = Field(
42
- default="Qwen/Qwen2.5-72B-Instruct", description="HuggingFace model name"
43
  )
44
  hf_token: str | None = Field(
45
  default=None, alias="HF_TOKEN", description="HuggingFace API token"
 
36
  default="claude-sonnet-4-5-20250929", description="Anthropic model"
37
  )
38
  # HuggingFace (free tier)
39
+ # NOTE: Large models (70B+) are routed to third-party providers (Novita, Hyperbolic) which are
40
+ # unreliable (500/401 errors). We use Qwen2.5-7B-Instruct as it is small enough to run on
41
+ # Hugging Face's native serverless infrastructure.
42
  huggingface_model: str | None = Field(
43
+ default="Qwen/Qwen2.5-7B-Instruct", description="HuggingFace model name"
44
  )
45
  hf_token: str | None = Field(
46
  default=None, alias="HF_TOKEN", description="HuggingFace API token"