DeepBoner / docs /architecture /HF_FREE_TIER_ANALYSIS.md
VibecoderMcSwaggins's picture
docs: Organize and archive resolved bug documentation
f9f62d4
|
raw
history blame
3.5 kB

Hugging Face Free Tier Reliability Analysis (December 2025)

Executive Summary

Root Cause: The recurring 500/401 errors on the Free Tier (Advanced Mode without API keys) are caused by implicit routing of large models (70B+) to unstable third-party "Inference Providers" (Novita, Hyperbolic) instead of running natively on Hugging Face's infrastructure.

Solution: Switch the default Free Tier model from flagship-class models (72B) to high-performance mid-sized models (7B-32B) that are hosted natively by Hugging Face's Serverless Inference API.


1. The "Inference Providers" Trap

Hugging Face offers two distinct execution paths for its Inference API:

  1. Serverless Inference API (Native):

    • Host: Hugging Face's own infrastructure.
    • Reliability: High (Direct control).
    • Constraints: Limited to models that fit on standard inference hardware (typically <10GB-30GB VRAM usage).
    • Typical Models: bert-base, gpt2, Mistral-7B, Qwen2.5-7B.
  2. Inference Providers (Third-Party Marketplace):

    • Host: Partners like Novita, Hyperbolic, Together AI, Sambanova.
    • Reliability: Variable. "Staging mode" authentication issues, rate limits, and service outages (500 errors) are common on the free routing layer.
    • Purpose: To serve massive models (Llama-3.1-405B, Qwen2.5-72B) that are too expensive for HF to host for free.

The Problem: When we request Qwen/Qwen2.5-72B-Instruct (or Llama-3.1-70B) without an API key, HF transparently routes this request to a partner (Novita/Hyperbolic).

  • Novita Status: Currently returning 500 Internal Server Errors.
  • Hyperbolic Status: Previously returned 401 Unauthorized (Staging Mode auth bug).

We are effectively relying on a "best effort" chain of third-party providers for our core application stability.

2. The "Golden Path" for Free Tier

To ensure stability, the Free Tier must target models that reside on the Native path.

Criteria for Native Stability:

  • Size: < 30B parameters (ideal: 7B - 12B).
  • Popularity: "Warm" models (high traffic keeps them loaded in memory).
  • Architecture: Standard transformers (easy for HF to serve).

Candidate Models (Dec 2025):

Model Size Provider Risk Native Capability
Qwen/Qwen2.5-7B-Instruct 7B Low Excellent (Math: 75.5, Code: 84.8)
mistralai/Mistral-Nemo-Instruct-2407 12B Low Very Good
Qwen/Qwen2.5-72B-Instruct 72B High (Novita) Excellent (but unreliable)
meta-llama/Llama-3.1-70B-Instruct 70B High (Hyperbolic) Excellent (but unreliable)

3. Recommendation

Immediate Fix: Change the default HUGGINGFACE_MODEL in src/utils/config.py from Qwen/Qwen2.5-72B-Instruct to Qwen/Qwen2.5-7B-Instruct.

Why Qwen2.5-7B?

  • Performance: Outperforms Llama-3.1-8B and matches GPT-3.5 levels in many benchmarks.
  • Reliability: Small enough to be hosted natively.
  • Context: 128k context window (perfect for RAG).

4. Future Architecture (Unified Client)

For the Unified Chat Client architecture:

  1. Tier 0 (Free): Hardcoded to Native Models (Qwen 7B, Mistral Nemo).
  2. Tier 1 (BYO Key): Allow user to select any model (70B+), assuming they provide a key that grants access to premium providers or PRO tier.

Analysis performed by Gemini CLI Agent, Dec 2, 2025