File size: 5,605 Bytes
d574a3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# Codette Multi-Adapter Inference + Chat System β€” Implementation Plan

## Overview

Build three things inside `codette-training-lab`:

1. **HF Upload Scripts + Model Cards** β€” publish each trained adapter to HuggingFace
2. **Multi-Adapter Inference Engine** β€” loads Llama 3.1 8B + dynamically switches between 8 LoRA adapters
3. **Gradio Real-Time Chat App** β€” interactive UI to test any adapter with streaming responses, deployable to HF Spaces

---

## Architecture

```
codette-training-lab/
β”œβ”€β”€ inference/                    ← NEW
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ model_loader.py          ← Core: loads base model + all adapters via PEFT
β”‚   β”œβ”€β”€ multi_adapter_engine.py  ← Orchestrates multi-perspective generation
β”‚   └── chat_app.py              ← Gradio UI with streaming chat
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ upload_adapters.py       ← NEW: push adapters to HF Hub
β”‚   └── model_card_template.md   ← NEW: model card for each adapter
└── app.py                       ← NEW: HF Spaces entry point (launches chat_app)
```

---

## Part 1: HF Upload Scripts + Model Cards (2 files)

### `scripts/upload_adapters.py`
- Scans `adapters/` directory for trained adapter folders
- For each adapter: creates an HF repo `Raiff1982/codette-{adapter_name}`, uploads safetensors + adapter_config.json + tokenizer
- Generates a model card from template with correct metadata (base_model, datasets, pipeline_tag, etc.)
- Supports `--adapter newton` to upload one or `--all` to upload all 8

### `scripts/model_card_template.md`
- Standard HF model card with YAML frontmatter
- Fields: base_model, datasets, tags, pipeline_tag, license
- Sections: description, intended use, training details, how to use

---

## Part 2: Multi-Adapter Inference Engine (2 files)

### `inference/model_loader.py` β€” `CodetteModelLoader`
- Loads `meta-llama/Llama-3.1-8B-Instruct` in 4-bit QLoRA (same config as training)
- Uses PEFT's `PeftModel.from_pretrained()` to load the first adapter
- Uses `model.load_adapter("path", adapter_name="name")` for each additional adapter
- Exposes `set_active_adapter(name)` to switch between loaded adapters at runtime
- Manages tokenizer (Llama 3.1 chat template with `apply_chat_template`)
- GPU memory footprint: ~5GB base + ~20MB per adapter = ~5.2GB total (fits A10G/T4/consumer GPUs)

### `inference/multi_adapter_engine.py` β€” `CodetteEngine`
- Takes a `CodetteModelLoader` instance
- **Single-perspective mode**: user picks one adapter, generates with it
- **Multi-perspective mode**: runs the query through N selected adapters, collects responses, synthesizes
- **Synthesis**: combines multiple adapter responses into one unified answer (using the multi_perspective adapter or a template)
- Streaming support via `TextIteratorStreamer` for real-time token output
- Generation params: temperature, top_p, max_tokens, repetition_penalty β€” all configurable per adapter from `adapter_registry.yaml`

---

## Part 3: Gradio Chat Interface (2 files)

### `inference/chat_app.py` β€” `create_chat_app()`
- **Chat Tab**: streaming chatbot with adapter selector dropdown
  - Dropdown: "Newton", "DaVinci", "Empathy", "Philosophy", "Quantum", "RC-XI", "Multi-Perspective", "Systems", "All (synthesized)"
  - Slider controls: temperature, max tokens, top_p
  - Streaming output token-by-token
  - Chat history with system/user/assistant roles
- **Compare Tab**: side-by-side adapter comparison
  - Select 2-4 adapters, send same prompt, see responses side by side
  - Quality scores from ReasoningMetrics displayed per response
- **Status Tab**: model info, loaded adapters, GPU memory, adapter configs
- Theme: `gr.themes.Soft()` matching existing Codette aesthetic

### `app.py` (project root) β€” HF Spaces entry point
- Minimal: imports and launches `create_chat_app()`
- Loads adapters from HF Hub (for Spaces) or local `adapters/` directory
- Configurable via env vars: `CODETTE_ADAPTER_SOURCE=hub|local`, `HF_TOKEN`, `ADAPTER_NAMES`

---

## Key Design Decisions

1. **PEFT multi-adapter** β€” PEFT natively supports loading multiple LoRA adapters on one base model and switching with `set_adapter()`. No need to load 8 separate models.

2. **Streaming** β€” `TextIteratorStreamer` from transformers, threaded generation, yielded to Gradio chatbot for real-time display.

3. **Chat template** β€” Llama 3.1 uses `<|begin_of_text|><|start_header_id|>system<|end_header_id|>...` format. We use `tokenizer.apply_chat_template()` which handles this automatically.

4. **System prompts from registry** β€” Each adapter's system prompt comes from `adapter_registry.yaml`, injected as the system message in chat.

5. **HF Spaces compatible** β€” The app.py + requirements.txt are structured so deploying to a HF Space with GPU runtime works out of the box.

---

## File Count: 7 new files

| File | Purpose | ~Lines |
|------|---------|--------|
| `inference/__init__.py` | Package exports | 10 |
| `inference/model_loader.py` | Load base + adapters | 200 |
| `inference/multi_adapter_engine.py` | Generation orchestration | 250 |
| `inference/chat_app.py` | Gradio UI | 350 |
| `app.py` | HF Spaces entry point | 50 |
| `scripts/upload_adapters.py` | Push to HF Hub | 180 |
| `scripts/model_card_template.md` | Model card template | 80 |

**Total: ~1,120 lines of new code**

---

## Execution Order

1. Upload scripts + model cards (so adapters are on HF when chat loads)
2. Model loader (core inference)
3. Multi-adapter engine (orchestration)
4. Chat app + entry point (UI)
5. Test locally, then deploy to HF Space