| # Codette Multi-Adapter Inference + Chat System β Implementation Plan |
|
|
| ## Overview |
|
|
| Build three things inside `codette-training-lab`: |
|
|
| 1. **HF Upload Scripts + Model Cards** β publish each trained adapter to HuggingFace |
| 2. **Multi-Adapter Inference Engine** β loads Llama 3.1 8B + dynamically switches between 8 LoRA adapters |
| 3. **Gradio Real-Time Chat App** β interactive UI to test any adapter with streaming responses, deployable to HF Spaces |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ``` |
| codette-training-lab/ |
| βββ inference/ β NEW |
| β βββ __init__.py |
| β βββ model_loader.py β Core: loads base model + all adapters via PEFT |
| β βββ multi_adapter_engine.py β Orchestrates multi-perspective generation |
| β βββ chat_app.py β Gradio UI with streaming chat |
| βββ scripts/ |
| β βββ upload_adapters.py β NEW: push adapters to HF Hub |
| β βββ model_card_template.md β NEW: model card for each adapter |
| βββ app.py β NEW: HF Spaces entry point (launches chat_app) |
| ``` |
|
|
| --- |
|
|
| ## Part 1: HF Upload Scripts + Model Cards (2 files) |
|
|
| ### `scripts/upload_adapters.py` |
| - Scans `adapters/` directory for trained adapter folders |
| - For each adapter: creates an HF repo `Raiff1982/codette-{adapter_name}`, uploads safetensors + adapter_config.json + tokenizer |
| - Generates a model card from template with correct metadata (base_model, datasets, pipeline_tag, etc.) |
| - Supports `--adapter newton` to upload one or `--all` to upload all 8 |
| |
| ### `scripts/model_card_template.md` |
| - Standard HF model card with YAML frontmatter |
| - Fields: base_model, datasets, tags, pipeline_tag, license |
| - Sections: description, intended use, training details, how to use |
| |
| --- |
| |
| ## Part 2: Multi-Adapter Inference Engine (2 files) |
| |
| ### `inference/model_loader.py` β `CodetteModelLoader` |
| - Loads `meta-llama/Llama-3.1-8B-Instruct` in 4-bit QLoRA (same config as training) |
| - Uses PEFT's `PeftModel.from_pretrained()` to load the first adapter |
| - Uses `model.load_adapter("path", adapter_name="name")` for each additional adapter |
| - Exposes `set_active_adapter(name)` to switch between loaded adapters at runtime |
| - Manages tokenizer (Llama 3.1 chat template with `apply_chat_template`) |
| - GPU memory footprint: ~5GB base + ~20MB per adapter = ~5.2GB total (fits A10G/T4/consumer GPUs) |
|
|
| ### `inference/multi_adapter_engine.py` β `CodetteEngine` |
| - Takes a `CodetteModelLoader` instance |
| - **Single-perspective mode**: user picks one adapter, generates with it |
| - **Multi-perspective mode**: runs the query through N selected adapters, collects responses, synthesizes |
| - **Synthesis**: combines multiple adapter responses into one unified answer (using the multi_perspective adapter or a template) |
| - Streaming support via `TextIteratorStreamer` for real-time token output |
| - Generation params: temperature, top_p, max_tokens, repetition_penalty β all configurable per adapter from `adapter_registry.yaml` |
|
|
| --- |
|
|
| ## Part 3: Gradio Chat Interface (2 files) |
|
|
| ### `inference/chat_app.py` β `create_chat_app()` |
| - **Chat Tab**: streaming chatbot with adapter selector dropdown |
| - Dropdown: "Newton", "DaVinci", "Empathy", "Philosophy", "Quantum", "RC-XI", "Multi-Perspective", "Systems", "All (synthesized)" |
| - Slider controls: temperature, max tokens, top_p |
| - Streaming output token-by-token |
| - Chat history with system/user/assistant roles |
| - **Compare Tab**: side-by-side adapter comparison |
| - Select 2-4 adapters, send same prompt, see responses side by side |
| - Quality scores from ReasoningMetrics displayed per response |
| - **Status Tab**: model info, loaded adapters, GPU memory, adapter configs |
| - Theme: `gr.themes.Soft()` matching existing Codette aesthetic |
|
|
| ### `app.py` (project root) β HF Spaces entry point |
| - Minimal: imports and launches `create_chat_app()` |
| - Loads adapters from HF Hub (for Spaces) or local `adapters/` directory |
| - Configurable via env vars: `CODETTE_ADAPTER_SOURCE=hub|local`, `HF_TOKEN`, `ADAPTER_NAMES` |
|
|
| --- |
|
|
| ## Key Design Decisions |
|
|
| 1. **PEFT multi-adapter** β PEFT natively supports loading multiple LoRA adapters on one base model and switching with `set_adapter()`. No need to load 8 separate models. |
|
|
| 2. **Streaming** β `TextIteratorStreamer` from transformers, threaded generation, yielded to Gradio chatbot for real-time display. |
|
|
| 3. **Chat template** β Llama 3.1 uses `<|begin_of_text|><|start_header_id|>system<|end_header_id|>...` format. We use `tokenizer.apply_chat_template()` which handles this automatically. |
|
|
| 4. **System prompts from registry** β Each adapter's system prompt comes from `adapter_registry.yaml`, injected as the system message in chat. |
|
|
| 5. **HF Spaces compatible** β The app.py + requirements.txt are structured so deploying to a HF Space with GPU runtime works out of the box. |
|
|
| --- |
|
|
| ## File Count: 7 new files |
|
|
| | File | Purpose | ~Lines | |
| |------|---------|--------| |
| | `inference/__init__.py` | Package exports | 10 | |
| | `inference/model_loader.py` | Load base + adapters | 200 | |
| | `inference/multi_adapter_engine.py` | Generation orchestration | 250 | |
| | `inference/chat_app.py` | Gradio UI | 350 | |
| | `app.py` | HF Spaces entry point | 50 | |
| | `scripts/upload_adapters.py` | Push to HF Hub | 180 | |
| | `scripts/model_card_template.md` | Model card template | 80 | |
|
|
| **Total: ~1,120 lines of new code** |
|
|
| --- |
|
|
| ## Execution Order |
|
|
| 1. Upload scripts + model cards (so adapters are on HF when chat loads) |
| 2. Model loader (core inference) |
| 3. Multi-adapter engine (orchestration) |
| 4. Chat app + entry point (UI) |
| 5. Test locally, then deploy to HF Space |
|
|