Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

VibecoderMcSwaggins commited on 15 days ago

Commit

ea28d9c

unverified ·

1 Parent(s): 78ec52a

fix(P1): Switch to Qwen2.5-7B to avoid HuggingFace third-party provider failures (#118)

Browse files

* docs: Add P1 bug doc for Simple Mode removal breaking Free Tier UX

SPEC-16 Unified Architecture removed Simple Mode, forcing all users
to Advanced Mode. When no API key is provided, Advanced Mode falls back
to HuggingFace Free Tier which triggers upstream agent-framework repr
bug (#2562).

Options documented:
A) Wait for upstream fix (PR #2566)
B) Restore Simple Mode for free tier
C) Current workaround in _extract_text()

* docs: Update P1 bug doc and SPEC-16 with rollback warning

CRITICAL: Simple Mode was deleted BEFORE verifying Advanced+HF worked.

Problem:
- Upstream agent-framework has repr bug (#2562)
- Advanced Mode + HuggingFace = garbage output
- Simple Mode (the working fallback) was deleted prematurely

Bug doc updates:
- Added "What Went Wrong" timeline
- Added Gradio UI confusion analysis (examples vs chat button)
- Recommendation: Restore Simple Mode as fallback

SPEC-16 updates:
- Status changed to "PARTIALLY IMPLEMENTED - ROLLBACK REQUIRED"
- Added critical warning about premature deletion
- Links to P1 bug doc for action items

* docs: CRITICAL - Simple Mode is NOT being deleted

This commit makes it CRYSTAL CLEAR across all documentation:

⛔ SIMPLE MODE IS NOT BEING DELETED - NON-NEGOTIABLE

What went wrong:
- SPEC-16 was supposed to INTEGRATE Simple Mode, not DELETE it
- simple.py was deleted BEFORE verifying Advanced+HF worked
- Upstream agent-framework has repr bug (#2562)
- Free tier users now have no working fallback

Required actions:
1. RESTORE simple.py from git history or MCP reference
2. KEEP Simple Mode as free-tier fallback indefinitely
3. Use Advanced Mode ONLY for paid API key users
4. Wait for upstream #2566 to merge before reconsidering

Updated files:
- SPEC_16: Status changed to "ON HOLD", added warning
- P0_SIMPLE_MODE_FORCED_SYNTHESIS_BYPASS: Changed to "Patch simple.py"
- ACTIVE_BUGS: Marked Simple Mode issues as OPEN

* docs: DO NOT use MCP reference repo - it's buggy

Updated all docs and GitHub issues to clarify:

1. DO NOT restore from MCP reference repo - has known bugs
2. Git revert in THIS repo MAY be possible - review for bugs first
3. Clean implementation preferred if old code is too buggy
4. Goal is WORKING Simple Mode, not blindly restored buggy code

Files updated:
- ACTIVE_BUGS.md
- SPEC_16_UNIFIED_CHAT_CLIENT_ARCHITECTURE.md
- P1_SIMPLE_MODE_REMOVED_BREAKS_FREE_TIER_UX.md

GitHub issues updated:
- #105: Added warning about reference repo
- #113: Added warning about reference repo

* docs: Clarify UNIFIED architecture with Simple Mode INTEGRATED

- NOT two parallel universes/orchestrators
- ONE codebase handles all tiers (free + paid)
- Simple Mode behavior INTEGRATED, not separate
- Blocked by upstream bug #2562, waiting for PR #2566

* docs: Add architecture documentation for unified system

- Current state: Advanced Mode only, simple.py deleted
- Goal: ONE unified architecture (not parallel universes)
- Simple Mode INTEGRATED via HuggingFaceChatClient
- Blocked by upstream #2562, waiting for PR #2566
- Includes path forward for all scenarios

* docs: Update all bug docs for unified architecture consistency

- ACTIVE_BUGS.md: Consolidated free tier issue as single P0 blocker
- P0_SIMPLE_MODE_FORCED_SYNTHESIS_BYPASS.md: Simplified - bug fixed by unification
- All docs now consistently say: ONE unified architecture, NOT parallel universes
- Simple Mode behavior INTEGRATED via HuggingFaceChatClient
- simple.py is DELETED, not being restored

* docs: FINAL - Clear terminology, framework integration documented

Architecture:
- No API Key (Free) → HuggingFace backend
- API Key (Paid) → OpenAI backend
- ONE codebase, different backends, no "modes"

Framework Stack:
- Microsoft Agent Framework = orchestration (routes agents)
- Pydantic AI = structured outputs (validates data)
- Both work TOGETHER, not mutually exclusive

Blocked by upstream #2562, waiting for PR #2566.

All docs and GitHub issues now use consistent terminology.

* docs: Fix root-level docs for unified architecture

- CLAUDE.md: Remove simple.py reference, update orchestrator description
- AGENTS.md: Same fix
- GEMINI.md: Same fix
- README.md: "Two Modes" → "Unified Architecture" + Free/Paid Tier

All root docs now consistent with unified architecture:
- ONE orchestrator (advanced.py) for all users
- Auto-selects backend: OpenAI (if key) or HuggingFace (free)
- No more "Simple Mode" vs "Advanced Mode" terminology

* fix: Switch default HF model to Qwen2.5-7B to avoid Novita 500 errors

Files changed (2) hide show

HF_FREE_TIER_ANALYSIS.md +68 -0
src/utils/config.py +4 -3

HF_FREE_TIER_ANALYSIS.md ADDED Viewed

	@@ -0,0 +1,68 @@

+# Hugging Face Free Tier Reliability Analysis (December 2025)
+## Executive Summary
+**Root Cause:** The recurring 500/401 errors on the Free Tier (Advanced Mode without API keys) are caused by implicit routing of large models (70B+) to unstable third-party "Inference Providers" (Novita, Hyperbolic) instead of running natively on Hugging Face's infrastructure.
+**Solution:** Switch the default Free Tier model from flagship-class models (72B) to high-performance mid-sized models (7B-32B) that are hosted natively by Hugging Face's Serverless Inference API.
+---
+## 1. The "Inference Providers" Trap
+Hugging Face offers two distinct execution paths for its Inference API:
+1.  **Serverless Inference API (Native):**
+    *   **Host:** Hugging Face's own infrastructure.
+    *   **Reliability:** High (Direct control).
+    *   **Constraints:** Limited to models that fit on standard inference hardware (typically <10GB-30GB VRAM usage).
+    *   **Typical Models:** `bert-base`, `gpt2`, `Mistral-7B`, `Qwen2.5-7B`.
+2.  **Inference Providers (Third-Party Marketplace):**
+    *   **Host:** Partners like Novita, Hyperbolic, Together AI, Sambanova.
+    *   **Reliability:** Variable. "Staging mode" authentication issues, rate limits, and service outages (500 errors) are common on the free routing layer.
+    *   **Purpose:** To serve massive models (Llama-3.1-405B, Qwen2.5-72B) that are too expensive for HF to host for free.
+**The Problem:**
+When we request `Qwen/Qwen2.5-72B-Instruct` (or `Llama-3.1-70B`) without an API key, HF transparently routes this request to a partner (Novita/Hyperbolic).
+*   **Novita Status:** Currently returning 500 Internal Server Errors.
+*   **Hyperbolic Status:** Previously returned 401 Unauthorized (Staging Mode auth bug).
+We are effectively relying on a "best effort" chain of third-party providers for our core application stability.
+## 2. The "Golden Path" for Free Tier
+To ensure stability, the Free Tier must target models that reside on the **Native** path.
+**Criteria for Native Stability:**
+*   **Size:** < 30B parameters (ideal: 7B - 12B).
+*   **Popularity:** "Warm" models (high traffic keeps them loaded in memory).
+*   **Architecture:** Standard transformers (easy for HF to serve).
+**Candidate Models (Dec 2025):**
+| Model | Size | Provider Risk | Native Capability |
+|-------|------|---------------|-------------------|
+| **Qwen/Qwen2.5-7B-Instruct** | 7B | **Low** | **Excellent** (Math: 75.5, Code: 84.8) |
+| **mistralai/Mistral-Nemo-Instruct-2407** | 12B | Low | Very Good |
+| **Qwen/Qwen2.5-72B-Instruct** | 72B | **High** (Novita) | Excellent (but unreliable) |
+| **meta-llama/Llama-3.1-70B-Instruct** | 70B | **High** (Hyperbolic) | Excellent (but unreliable) |
+## 3. Recommendation
+**Immediate Fix:**
+Change the default `HUGGINGFACE_MODEL` in `src/utils/config.py` from `Qwen/Qwen2.5-72B-Instruct` to **`Qwen/Qwen2.5-7B-Instruct`**.
+**Why Qwen2.5-7B?**
+*   **Performance:** Outperforms Llama-3.1-8B and matches GPT-3.5 levels in many benchmarks.
+*   **Reliability:** Small enough to be hosted natively.
+*   **Context:** 128k context window (perfect for RAG).
+## 4. Future Architecture (Unified Client)
+For the Unified Chat Client architecture:
+1.  **Tier 0 (Free):** Hardcoded to Native Models (Qwen 7B, Mistral Nemo).
+2.  **Tier 1 (BYO Key):** Allow user to select any model (70B+), assuming they provide a key that grants access to premium providers or PRO tier.
+---
+*Analysis performed by Gemini CLI Agent, Dec 2, 2025*

src/utils/config.py CHANGED Viewed

@@ -36,10 +36,11 @@ class Settings(BaseSettings):
         default="claude-sonnet-4-5-20250929", description="Anthropic model"
     )
     # HuggingFace (free tier)
-    # NOTE: Llama-3.1-70B is routed to Hyperbolic (partner) which has unreliable "staging mode"
-    # Qwen2.5-72B works reliably via HuggingFace's native infrastructure
     huggingface_model: str | None = Field(
-        default="Qwen/Qwen2.5-72B-Instruct", description="HuggingFace model name"
     )
     hf_token: str | None = Field(
         default=None, alias="HF_TOKEN", description="HuggingFace API token"

         default="claude-sonnet-4-5-20250929", description="Anthropic model"
     )
     # HuggingFace (free tier)
+    # NOTE: Large models (70B+) are routed to third-party providers (Novita, Hyperbolic) which are
+    # unreliable (500/401 errors). We use Qwen2.5-7B-Instruct as it is small enough to run on
+    # Hugging Face's native serverless infrastructure.
     huggingface_model: str | None = Field(
+        default="Qwen/Qwen2.5-7B-Instruct", description="HuggingFace model name"
     )
     hf_token: str | None = Field(
         default=None, alias="HF_TOKEN", description="HuggingFace API token"