ehartford commited on
Commit
569fad6
·
verified ·
1 Parent(s): 5154f51

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -229
README.md CHANGED
@@ -14,19 +14,17 @@ pipeline_tag: image-text-to-text
14
 
15
  # Prisma-VL-8B: Introspective Vision-Language Model
16
 
17
- **An 8-billion parameter vision-language model architected from the ground up with 16-bit temporal uncertainty feedback for self-aware, calibrated predictions.**
18
 
19
  ## What is This?
20
 
21
- Prisma-VL-8B is a **reference implementation** of an introspective transformer architecture. The model doesn't just predict - it *knows* when it's uncertain and uses that self-awareness to calibrate subsequent predictions.
22
-
23
- This is not a base model with modifications. **This IS the architecture.** The 16-bit temporal uncertainty feedback mechanism is fundamental to how this model thinks.
24
 
25
  ## Core Architecture
26
 
27
  ### The Introspective Mechanism
28
 
29
- Every transformer processes tokens sequentially. Prisma-VL-8B adds one crucial element: **memory of its own uncertainty**.
30
 
31
  ```
32
  Standard Transformer:
@@ -38,11 +36,11 @@ Introspective Transformer:
38
 
39
  ### How It Works
40
 
41
- **The 65,536-Level Uncertainty System:**
42
 
43
  At each prediction step:
44
  1. **Measure**: Compute entropy of output distribution (how uncertain am I?)
45
- 2. **Quantize**: Convert to 16-bit code (0-65535, representing confidence levels)
46
  3. **Inject**: Next token receives this as learned embedding signal
47
  4. **Learn**: Through training, model learns what each uncertainty level means
48
 
@@ -51,25 +49,6 @@ At each prediction step:
51
  - When it's extrapolating (rising uncertainty)
52
  - When it needs to be conservative (high uncertainty)
53
 
54
- ### Architecture Components
55
-
56
- ```python
57
- # Core introspective components (built into PrismaVLModel)
58
-
59
- self.uncertainty_embeddings = nn.Embedding(65536, hidden_dim)
60
- # 65,536 learned vectors: "uncertainty vocabulary"
61
- # Each represents: "I was X% uncertain on the last token"
62
-
63
- self.prev_uncertainty_code = None # [batch, seq] with values [0-65535]
64
- # Temporal memory: tracks uncertainty history across generation
65
- ```
66
-
67
- **Parameter Cost:** 65,536 × 4096 = 268,435,456 parameters (3.35% of model)
68
-
69
- **Memory Cost:** 2 bytes per token (uncertainty code)
70
-
71
- **Compute Cost:** One embedding lookup per token (negligible)
72
-
73
  ## Why This Matters
74
 
75
  ### Traditional Language Models
@@ -92,20 +71,7 @@ Generate "The capital of France is Mad..."
92
  [code:23] → [code:15] → [code:142] → STOP # Detects uncertainty spike
93
  ```
94
 
95
- The model **feels** when predictions are going wrong and can self-correct or abstain.
96
-
97
- ## What Gets Learned
98
-
99
- Through standard training (no special loss needed), the 65,536 uncertainty embeddings learn semantic meaning:
100
-
101
- | Code Range | Semantic Meaning | Learned Behavior |
102
- |------------|------------------|------------------|
103
- | 0-16383 | "I was very confident" | Maintain trajectory, continue assertively |
104
- | 16384-32767 | "Moderate confidence" | Proceed with caution, verify facts |
105
- | 32768-49151 | "Some uncertainty" | Hedge statements, qualify claims |
106
- | 49152-65535 | "Very uncertain" | Conservative generation, flag uncertainty |
107
-
108
- This creates a **calibration vocabulary** - the model learns to speak about its own knowledge state with fine-grained resolution.
109
 
110
  ## Usage
111
 
@@ -166,183 +132,9 @@ print(f"Average confidence: {1 - mean_uncertainty:.2%}")
166
  print(f"Highest uncertainty code: {max_uncertainty}")
167
  ```
168
 
169
- ### Resetting State
170
-
171
- ```python
172
- # Between independent generations, reset uncertainty history
173
- model.model.reset_uncertainty()
174
-
175
- # Fresh start - no previous context
176
- outputs = model.generate(**inputs)
177
- ```
178
-
179
- ## Model Specifications
180
-
181
- ### Vision Encoder
182
- - **Architecture**: 27-layer Vision Transformer
183
- - **Hidden Dimension**: 1152
184
- - **Patch Size**: 16×16
185
- - **Temporal Patches**: 2 (for video)
186
- - **Parameters**: ~1.15B
187
-
188
- ### Language Model
189
- - **Architecture**: 36-layer Transformer
190
- - **Hidden Dimension**: 4096
191
- - **Attention Heads**: 32 (8 KV heads, GQA)
192
- - **Intermediate Size**: 12,288
193
- - **Context Length**: 262,144 tokens
194
- - **Parameters**: ~6.85B
195
-
196
- ### Introspective System
197
- - **Uncertainty Levels**: 65,536 (16-bit)
198
- - **Uncertainty Embeddings**: 65,536 × 4096
199
- - **Parameters**: 268,435,456 (268M)
200
- - **Overhead**: 3.35% of total model
201
-
202
- ### Total Model
203
- - **Parameters**: ~8.27B (7.85B base + 268M introspective)
204
- - **Precision**: BFloat16 recommended
205
- - **Hardware**: 24GB VRAM recommended
206
-
207
- ## Design Philosophy
208
-
209
- ### Why 16-bit Quantization?
210
-
211
- - **Fine-Grained Resolution**: 65,536 levels capture nuanced confidence gradations
212
- - **Rich Representation**: Model can learn subtle uncertainty distinctions
213
- - **Precise Calibration**: Higher resolution enables better self-awareness
214
- - **Still Efficient**: Only 2 bytes per token, single embedding table lookup
215
-
216
- ### Why Temporal Feedback?
217
-
218
- - **Causal Awareness**: Model sees its own prediction history
219
- - **Self-Correction**: Can detect and recover from errors
220
- - **Calibration**: Learns confidence from experience
221
- - **No External Labels**: Uses its own predictions as training signal
222
-
223
- ### Why Built-In?
224
-
225
- - **Native Integration**: Works seamlessly with vision and text processing
226
- - **Always Active**: No modes to enable/disable
227
- - **End-to-End Training**: Learns uncertainty simultaneously with task
228
- - **Production Ready**: No inference overhead, no special handling
229
-
230
- ## When to Use This Architecture
231
-
232
- ### ✅ Good Fit
233
- - Applications requiring calibrated confidence estimates
234
- - Domains where hallucination prevention is critical
235
- - Long-form generation (benefits from temporal awareness)
236
- - Interactive systems (can express uncertainty to users)
237
- - Research on model introspection and self-awareness
238
-
239
- ### ⚠️ Considerations
240
- - Requires fine-tuning for uncertainty calibration
241
- - Adds 1.05M parameters (minimal but non-zero)
242
- - Uncertainty codes need interpretation in your domain
243
 
244
- ## Performance Characteristics
245
-
246
- ### Computational Overhead
247
-
248
- | Phase | Additional Cost |
249
- |-------|----------------|
250
- | Forward Pass | +1 embedding lookup per token (~0.1% compute) |
251
- | Uncertainty Computation | Entropy calculation (in `torch.no_grad()`, negligible) |
252
- | Memory | +2 bytes per token in cache |
253
- | Training | Standard backprop through uncertainty embeddings |
254
-
255
- ### Expected Benefits (After Fine-tuning)
256
-
257
- - **Calibration**: Better alignment between confidence and accuracy
258
- - **Hallucination Reduction**: Early detection of uncertain territory
259
- - **Adaptive Behavior**: Conservative when uncertain, assertive when confident
260
- - **Interpretability**: Uncertainty codes reveal model state
261
-
262
- ## Training Recommendations
263
-
264
- ### Initial Setup
265
- 1. Load model with randomly initialized uncertainty embeddings
266
- 2. Use your standard vision-language training recipe
267
- 3. No changes to loss functions or training loops required
268
- 4. Uncertainty mechanism learns automatically
269
-
270
- ### Convergence
271
- - Uncertainty embeddings converge at similar rate to language model
272
- - Monitor validation loss as usual
273
- - Well-calibrated uncertainty emerges with sufficient training data
274
-
275
- ### Fine-tuning
276
- - Start from pre-trained weights (if available)
277
- - Use domain-specific data for best calibration
278
- - Larger batch sizes help uncertainty statistics stabilize
279
-
280
- ### Evaluation
281
- ```python
282
- # Assess calibration: compare uncertainty to actual accuracy
283
- # High uncertainty should correlate with lower accuracy
284
- ```
285
-
286
- ## Technical Implementation
287
-
288
- ### Files
289
- - `modeling.py`: Core architecture with introspective mechanism
290
- - `configuration.py`: Model configuration
291
- - `processing.py`: Vision/text processor
292
- - `test.py`: Inference example
293
-
294
- ### Key Methods
295
-
296
- ```python
297
- # In PrismaVLModel
298
- def __init__(self):
299
- self.uncertainty_embeddings = nn.Embedding(65536, hidden_dim)
300
- self.prev_uncertainty_code = None
301
-
302
- def reset_uncertainty(self):
303
- """Clear uncertainty history between generations"""
304
- self.prev_uncertainty_code = None
305
-
306
- # In forward pass
307
- uncertainty_embeds = self.uncertainty_embeddings(prev_uncertainty_code)
308
- inputs_embeds = inputs_embeds + uncertainty_shifted
309
-
310
- # After logits
311
- entropy = -(probs * log_probs).sum(-1)
312
- uncertainty_code = (entropy_norm * 65535).long()
313
- ```
314
-
315
- ### Dependencies
316
- ```
317
- torch >= 2.0.0
318
- transformers >= 4.57.0
319
- accelerate >= 0.20.0
320
- Pillow
321
- ```
322
-
323
- ## Hardware Requirements
324
-
325
- | Configuration | VRAM | Precision | Batch Size |
326
- |--------------|------|-----------|------------|
327
- | Minimum | 16GB | 8-bit | 1 |
328
- | Recommended | 24GB | BFloat16 | 2-4 |
329
- | Optimal | 40GB+ | BFloat16 | 8+ |
330
-
331
- ## Research Context
332
-
333
- This architecture demonstrates that **transformer self-awareness is learnable** through standard training. No RLHF, no auxiliary losses, no external signals - just 65,536 embeddings that learn to represent "how uncertain was I?"
334
-
335
- The key insight: **uncertainty is a learnable signal, not a post-hoc calculation**. With 16-bit quantization, the model can develop a highly nuanced understanding of its own confidence states.
336
-
337
- ## Future Directions
338
-
339
- Potential extensions of this architecture:
340
-
341
- 1. **Multi-Resolution Uncertainty**: Track uncertainty at token, phrase, and document levels
342
- 2. **Cross-Modal Uncertainty**: Separate tracking for vision vs. language predictions
343
- 3. **Uncertainty-Guided Sampling**: Adjust temperature based on live uncertainty
344
- 4. **Explicit Uncertainty Tokens**: Generate "<uncertain>" tokens in output
345
- 5. **Confidence-Aware Search**: Use uncertainty for better beam search
346
 
347
  ## Citation
348
 
@@ -358,12 +150,6 @@ Potential extensions of this architecture:
358
 
359
  Apache 2.0
360
 
361
- ## Acknowledgments
362
-
363
- - Architecture inspired by temporal feedback patterns in cognitive science
364
- - 16-bit high-resolution quantization for fine-grained uncertainty representation
365
- - Vision-language backbone based on multimodal transformer designs
366
-
367
  ## Additional Resources
368
 
369
  - [Architecture Deep Dive](./INTROSPECTIVE_ARCHITECTURE.md)
@@ -372,10 +158,4 @@ Apache 2.0
372
 
373
  ---
374
 
375
- **This is not a modified model. This is the architecture.**
376
-
377
- Prisma-VL-8B exists to demonstrate that transformers can be introspective by design.
378
-
379
- **Status**: ✅ Production ready - fully functional in training and inference
380
-
381
- **Last Updated**: 2025-01-08
 
14
 
15
  # Prisma-VL-8B: Introspective Vision-Language Model
16
 
17
+ **A vision-language model architected with temporal uncertainty feedback for self-aware predictions.**
18
 
19
  ## What is This?
20
 
21
+ Prisma-VL-8B is a reference implementation of an introspective transformer architecture. The model uses its confidence to calibrate subsequent predictions.
 
 
22
 
23
  ## Core Architecture
24
 
25
  ### The Introspective Mechanism
26
 
27
+ Every transformer processes tokens sequentially. Prisma-VL-8B adds one crucial element: memory of its own uncertainty.
28
 
29
  ```
30
  Standard Transformer:
 
36
 
37
  ### How It Works
38
 
39
+ The Uncertainty System:
40
 
41
  At each prediction step:
42
  1. **Measure**: Compute entropy of output distribution (how uncertain am I?)
43
+ 2. **Quantize**: Convert to 16-bit code representing confidence levels
44
  3. **Inject**: Next token receives this as learned embedding signal
45
  4. **Learn**: Through training, model learns what each uncertainty level means
46
 
 
49
  - When it's extrapolating (rising uncertainty)
50
  - When it needs to be conservative (high uncertainty)
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ## Why This Matters
53
 
54
  ### Traditional Language Models
 
71
  [code:23] → [code:15] → [code:142] → STOP # Detects uncertainty spike
72
  ```
73
 
74
+ The model feels when predictions are going wrong and can self-correct or abstain.
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ## Usage
77
 
 
132
  print(f"Highest uncertainty code: {max_uncertainty}")
133
  ```
134
 
135
+ ## Introspection
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
 
137
+ From prediction, emerges language. From awareness of uncertainty, emerges introspection.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
  ## Citation
140
 
 
150
 
151
  Apache 2.0
152
 
 
 
 
 
 
 
153
  ## Additional Resources
154
 
155
  - [Architecture Deep Dive](./INTROSPECTIVE_ARCHITECTURE.md)
 
158
 
159
  ---
160
 
161
+ Prisma-VL-8B demonstrates introspective transformers.