Codette Multi-Adapter Inference + Chat System β Implementation Plan
Overview
Build three things inside codette-training-lab:
- HF Upload Scripts + Model Cards β publish each trained adapter to HuggingFace
- Multi-Adapter Inference Engine β loads Llama 3.1 8B + dynamically switches between 8 LoRA adapters
- Gradio Real-Time Chat App β interactive UI to test any adapter with streaming responses, deployable to HF Spaces
Architecture
codette-training-lab/
βββ inference/ β NEW
β βββ __init__.py
β βββ model_loader.py β Core: loads base model + all adapters via PEFT
β βββ multi_adapter_engine.py β Orchestrates multi-perspective generation
β βββ chat_app.py β Gradio UI with streaming chat
βββ scripts/
β βββ upload_adapters.py β NEW: push adapters to HF Hub
β βββ model_card_template.md β NEW: model card for each adapter
βββ app.py β NEW: HF Spaces entry point (launches chat_app)
Part 1: HF Upload Scripts + Model Cards (2 files)
scripts/upload_adapters.py
- Scans
adapters/directory for trained adapter folders - For each adapter: creates an HF repo
Raiff1982/codette-{adapter_name}, uploads safetensors + adapter_config.json + tokenizer - Generates a model card from template with correct metadata (base_model, datasets, pipeline_tag, etc.)
- Supports
--adapter newtonto upload one or--allto upload all 8
scripts/model_card_template.md
- Standard HF model card with YAML frontmatter
- Fields: base_model, datasets, tags, pipeline_tag, license
- Sections: description, intended use, training details, how to use
Part 2: Multi-Adapter Inference Engine (2 files)
inference/model_loader.py β CodetteModelLoader
- Loads
meta-llama/Llama-3.1-8B-Instructin 4-bit QLoRA (same config as training) - Uses PEFT's
PeftModel.from_pretrained()to load the first adapter - Uses
model.load_adapter("path", adapter_name="name")for each additional adapter - Exposes
set_active_adapter(name)to switch between loaded adapters at runtime - Manages tokenizer (Llama 3.1 chat template with
apply_chat_template) - GPU memory footprint: ~5GB base + ~20MB per adapter = ~5.2GB total (fits A10G/T4/consumer GPUs)
inference/multi_adapter_engine.py β CodetteEngine
- Takes a
CodetteModelLoaderinstance - Single-perspective mode: user picks one adapter, generates with it
- Multi-perspective mode: runs the query through N selected adapters, collects responses, synthesizes
- Synthesis: combines multiple adapter responses into one unified answer (using the multi_perspective adapter or a template)
- Streaming support via
TextIteratorStreamerfor real-time token output - Generation params: temperature, top_p, max_tokens, repetition_penalty β all configurable per adapter from
adapter_registry.yaml
Part 3: Gradio Chat Interface (2 files)
inference/chat_app.py β create_chat_app()
- Chat Tab: streaming chatbot with adapter selector dropdown
- Dropdown: "Newton", "DaVinci", "Empathy", "Philosophy", "Quantum", "RC-XI", "Multi-Perspective", "Systems", "All (synthesized)"
- Slider controls: temperature, max tokens, top_p
- Streaming output token-by-token
- Chat history with system/user/assistant roles
- Compare Tab: side-by-side adapter comparison
- Select 2-4 adapters, send same prompt, see responses side by side
- Quality scores from ReasoningMetrics displayed per response
- Status Tab: model info, loaded adapters, GPU memory, adapter configs
- Theme:
gr.themes.Soft()matching existing Codette aesthetic
app.py (project root) β HF Spaces entry point
- Minimal: imports and launches
create_chat_app() - Loads adapters from HF Hub (for Spaces) or local
adapters/directory - Configurable via env vars:
CODETTE_ADAPTER_SOURCE=hub|local,HF_TOKEN,ADAPTER_NAMES
Key Design Decisions
PEFT multi-adapter β PEFT natively supports loading multiple LoRA adapters on one base model and switching with
set_adapter(). No need to load 8 separate models.Streaming β
TextIteratorStreamerfrom transformers, threaded generation, yielded to Gradio chatbot for real-time display.Chat template β Llama 3.1 uses
<|begin_of_text|><|start_header_id|>system<|end_header_id|>...format. We usetokenizer.apply_chat_template()which handles this automatically.System prompts from registry β Each adapter's system prompt comes from
adapter_registry.yaml, injected as the system message in chat.HF Spaces compatible β The app.py + requirements.txt are structured so deploying to a HF Space with GPU runtime works out of the box.
File Count: 7 new files
| File | Purpose | ~Lines |
|---|---|---|
inference/__init__.py |
Package exports | 10 |
inference/model_loader.py |
Load base + adapters | 200 |
inference/multi_adapter_engine.py |
Generation orchestration | 250 |
inference/chat_app.py |
Gradio UI | 350 |
app.py |
HF Spaces entry point | 50 |
scripts/upload_adapters.py |
Push to HF Hub | 180 |
scripts/model_card_template.md |
Model card template | 80 |
Total: ~1,120 lines of new code
Execution Order
- Upload scripts + model cards (so adapters are on HF when chat loads)
- Model loader (core inference)
- Multi-adapter engine (orchestration)
- Chat app + entry point (UI)
- Test locally, then deploy to HF Space