Update README.md
Browse files
README.md
CHANGED
|
@@ -14,19 +14,17 @@ pipeline_tag: image-text-to-text
|
|
| 14 |
|
| 15 |
# Prisma-VL-8B: Introspective Vision-Language Model
|
| 16 |
|
| 17 |
-
**
|
| 18 |
|
| 19 |
## What is This?
|
| 20 |
|
| 21 |
-
Prisma-VL-8B is a
|
| 22 |
-
|
| 23 |
-
This is not a base model with modifications. **This IS the architecture.** The 16-bit temporal uncertainty feedback mechanism is fundamental to how this model thinks.
|
| 24 |
|
| 25 |
## Core Architecture
|
| 26 |
|
| 27 |
### The Introspective Mechanism
|
| 28 |
|
| 29 |
-
Every transformer processes tokens sequentially. Prisma-VL-8B adds one crucial element:
|
| 30 |
|
| 31 |
```
|
| 32 |
Standard Transformer:
|
|
@@ -38,11 +36,11 @@ Introspective Transformer:
|
|
| 38 |
|
| 39 |
### How It Works
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
At each prediction step:
|
| 44 |
1. **Measure**: Compute entropy of output distribution (how uncertain am I?)
|
| 45 |
-
2. **Quantize**: Convert to 16-bit code
|
| 46 |
3. **Inject**: Next token receives this as learned embedding signal
|
| 47 |
4. **Learn**: Through training, model learns what each uncertainty level means
|
| 48 |
|
|
@@ -51,25 +49,6 @@ At each prediction step:
|
|
| 51 |
- When it's extrapolating (rising uncertainty)
|
| 52 |
- When it needs to be conservative (high uncertainty)
|
| 53 |
|
| 54 |
-
### Architecture Components
|
| 55 |
-
|
| 56 |
-
```python
|
| 57 |
-
# Core introspective components (built into PrismaVLModel)
|
| 58 |
-
|
| 59 |
-
self.uncertainty_embeddings = nn.Embedding(65536, hidden_dim)
|
| 60 |
-
# 65,536 learned vectors: "uncertainty vocabulary"
|
| 61 |
-
# Each represents: "I was X% uncertain on the last token"
|
| 62 |
-
|
| 63 |
-
self.prev_uncertainty_code = None # [batch, seq] with values [0-65535]
|
| 64 |
-
# Temporal memory: tracks uncertainty history across generation
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
**Parameter Cost:** 65,536 × 4096 = 268,435,456 parameters (3.35% of model)
|
| 68 |
-
|
| 69 |
-
**Memory Cost:** 2 bytes per token (uncertainty code)
|
| 70 |
-
|
| 71 |
-
**Compute Cost:** One embedding lookup per token (negligible)
|
| 72 |
-
|
| 73 |
## Why This Matters
|
| 74 |
|
| 75 |
### Traditional Language Models
|
|
@@ -92,20 +71,7 @@ Generate "The capital of France is Mad..."
|
|
| 92 |
[code:23] → [code:15] → [code:142] → STOP # Detects uncertainty spike
|
| 93 |
```
|
| 94 |
|
| 95 |
-
The model
|
| 96 |
-
|
| 97 |
-
## What Gets Learned
|
| 98 |
-
|
| 99 |
-
Through standard training (no special loss needed), the 65,536 uncertainty embeddings learn semantic meaning:
|
| 100 |
-
|
| 101 |
-
| Code Range | Semantic Meaning | Learned Behavior |
|
| 102 |
-
|------------|------------------|------------------|
|
| 103 |
-
| 0-16383 | "I was very confident" | Maintain trajectory, continue assertively |
|
| 104 |
-
| 16384-32767 | "Moderate confidence" | Proceed with caution, verify facts |
|
| 105 |
-
| 32768-49151 | "Some uncertainty" | Hedge statements, qualify claims |
|
| 106 |
-
| 49152-65535 | "Very uncertain" | Conservative generation, flag uncertainty |
|
| 107 |
-
|
| 108 |
-
This creates a **calibration vocabulary** - the model learns to speak about its own knowledge state with fine-grained resolution.
|
| 109 |
|
| 110 |
## Usage
|
| 111 |
|
|
@@ -166,183 +132,9 @@ print(f"Average confidence: {1 - mean_uncertainty:.2%}")
|
|
| 166 |
print(f"Highest uncertainty code: {max_uncertainty}")
|
| 167 |
```
|
| 168 |
|
| 169 |
-
##
|
| 170 |
-
|
| 171 |
-
```python
|
| 172 |
-
# Between independent generations, reset uncertainty history
|
| 173 |
-
model.model.reset_uncertainty()
|
| 174 |
-
|
| 175 |
-
# Fresh start - no previous context
|
| 176 |
-
outputs = model.generate(**inputs)
|
| 177 |
-
```
|
| 178 |
-
|
| 179 |
-
## Model Specifications
|
| 180 |
-
|
| 181 |
-
### Vision Encoder
|
| 182 |
-
- **Architecture**: 27-layer Vision Transformer
|
| 183 |
-
- **Hidden Dimension**: 1152
|
| 184 |
-
- **Patch Size**: 16×16
|
| 185 |
-
- **Temporal Patches**: 2 (for video)
|
| 186 |
-
- **Parameters**: ~1.15B
|
| 187 |
-
|
| 188 |
-
### Language Model
|
| 189 |
-
- **Architecture**: 36-layer Transformer
|
| 190 |
-
- **Hidden Dimension**: 4096
|
| 191 |
-
- **Attention Heads**: 32 (8 KV heads, GQA)
|
| 192 |
-
- **Intermediate Size**: 12,288
|
| 193 |
-
- **Context Length**: 262,144 tokens
|
| 194 |
-
- **Parameters**: ~6.85B
|
| 195 |
-
|
| 196 |
-
### Introspective System
|
| 197 |
-
- **Uncertainty Levels**: 65,536 (16-bit)
|
| 198 |
-
- **Uncertainty Embeddings**: 65,536 × 4096
|
| 199 |
-
- **Parameters**: 268,435,456 (268M)
|
| 200 |
-
- **Overhead**: 3.35% of total model
|
| 201 |
-
|
| 202 |
-
### Total Model
|
| 203 |
-
- **Parameters**: ~8.27B (7.85B base + 268M introspective)
|
| 204 |
-
- **Precision**: BFloat16 recommended
|
| 205 |
-
- **Hardware**: 24GB VRAM recommended
|
| 206 |
-
|
| 207 |
-
## Design Philosophy
|
| 208 |
-
|
| 209 |
-
### Why 16-bit Quantization?
|
| 210 |
-
|
| 211 |
-
- **Fine-Grained Resolution**: 65,536 levels capture nuanced confidence gradations
|
| 212 |
-
- **Rich Representation**: Model can learn subtle uncertainty distinctions
|
| 213 |
-
- **Precise Calibration**: Higher resolution enables better self-awareness
|
| 214 |
-
- **Still Efficient**: Only 2 bytes per token, single embedding table lookup
|
| 215 |
-
|
| 216 |
-
### Why Temporal Feedback?
|
| 217 |
-
|
| 218 |
-
- **Causal Awareness**: Model sees its own prediction history
|
| 219 |
-
- **Self-Correction**: Can detect and recover from errors
|
| 220 |
-
- **Calibration**: Learns confidence from experience
|
| 221 |
-
- **No External Labels**: Uses its own predictions as training signal
|
| 222 |
-
|
| 223 |
-
### Why Built-In?
|
| 224 |
-
|
| 225 |
-
- **Native Integration**: Works seamlessly with vision and text processing
|
| 226 |
-
- **Always Active**: No modes to enable/disable
|
| 227 |
-
- **End-to-End Training**: Learns uncertainty simultaneously with task
|
| 228 |
-
- **Production Ready**: No inference overhead, no special handling
|
| 229 |
-
|
| 230 |
-
## When to Use This Architecture
|
| 231 |
-
|
| 232 |
-
### ✅ Good Fit
|
| 233 |
-
- Applications requiring calibrated confidence estimates
|
| 234 |
-
- Domains where hallucination prevention is critical
|
| 235 |
-
- Long-form generation (benefits from temporal awareness)
|
| 236 |
-
- Interactive systems (can express uncertainty to users)
|
| 237 |
-
- Research on model introspection and self-awareness
|
| 238 |
-
|
| 239 |
-
### ⚠️ Considerations
|
| 240 |
-
- Requires fine-tuning for uncertainty calibration
|
| 241 |
-
- Adds 1.05M parameters (minimal but non-zero)
|
| 242 |
-
- Uncertainty codes need interpretation in your domain
|
| 243 |
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
### Computational Overhead
|
| 247 |
-
|
| 248 |
-
| Phase | Additional Cost |
|
| 249 |
-
|-------|----------------|
|
| 250 |
-
| Forward Pass | +1 embedding lookup per token (~0.1% compute) |
|
| 251 |
-
| Uncertainty Computation | Entropy calculation (in `torch.no_grad()`, negligible) |
|
| 252 |
-
| Memory | +2 bytes per token in cache |
|
| 253 |
-
| Training | Standard backprop through uncertainty embeddings |
|
| 254 |
-
|
| 255 |
-
### Expected Benefits (After Fine-tuning)
|
| 256 |
-
|
| 257 |
-
- **Calibration**: Better alignment between confidence and accuracy
|
| 258 |
-
- **Hallucination Reduction**: Early detection of uncertain territory
|
| 259 |
-
- **Adaptive Behavior**: Conservative when uncertain, assertive when confident
|
| 260 |
-
- **Interpretability**: Uncertainty codes reveal model state
|
| 261 |
-
|
| 262 |
-
## Training Recommendations
|
| 263 |
-
|
| 264 |
-
### Initial Setup
|
| 265 |
-
1. Load model with randomly initialized uncertainty embeddings
|
| 266 |
-
2. Use your standard vision-language training recipe
|
| 267 |
-
3. No changes to loss functions or training loops required
|
| 268 |
-
4. Uncertainty mechanism learns automatically
|
| 269 |
-
|
| 270 |
-
### Convergence
|
| 271 |
-
- Uncertainty embeddings converge at similar rate to language model
|
| 272 |
-
- Monitor validation loss as usual
|
| 273 |
-
- Well-calibrated uncertainty emerges with sufficient training data
|
| 274 |
-
|
| 275 |
-
### Fine-tuning
|
| 276 |
-
- Start from pre-trained weights (if available)
|
| 277 |
-
- Use domain-specific data for best calibration
|
| 278 |
-
- Larger batch sizes help uncertainty statistics stabilize
|
| 279 |
-
|
| 280 |
-
### Evaluation
|
| 281 |
-
```python
|
| 282 |
-
# Assess calibration: compare uncertainty to actual accuracy
|
| 283 |
-
# High uncertainty should correlate with lower accuracy
|
| 284 |
-
```
|
| 285 |
-
|
| 286 |
-
## Technical Implementation
|
| 287 |
-
|
| 288 |
-
### Files
|
| 289 |
-
- `modeling.py`: Core architecture with introspective mechanism
|
| 290 |
-
- `configuration.py`: Model configuration
|
| 291 |
-
- `processing.py`: Vision/text processor
|
| 292 |
-
- `test.py`: Inference example
|
| 293 |
-
|
| 294 |
-
### Key Methods
|
| 295 |
-
|
| 296 |
-
```python
|
| 297 |
-
# In PrismaVLModel
|
| 298 |
-
def __init__(self):
|
| 299 |
-
self.uncertainty_embeddings = nn.Embedding(65536, hidden_dim)
|
| 300 |
-
self.prev_uncertainty_code = None
|
| 301 |
-
|
| 302 |
-
def reset_uncertainty(self):
|
| 303 |
-
"""Clear uncertainty history between generations"""
|
| 304 |
-
self.prev_uncertainty_code = None
|
| 305 |
-
|
| 306 |
-
# In forward pass
|
| 307 |
-
uncertainty_embeds = self.uncertainty_embeddings(prev_uncertainty_code)
|
| 308 |
-
inputs_embeds = inputs_embeds + uncertainty_shifted
|
| 309 |
-
|
| 310 |
-
# After logits
|
| 311 |
-
entropy = -(probs * log_probs).sum(-1)
|
| 312 |
-
uncertainty_code = (entropy_norm * 65535).long()
|
| 313 |
-
```
|
| 314 |
-
|
| 315 |
-
### Dependencies
|
| 316 |
-
```
|
| 317 |
-
torch >= 2.0.0
|
| 318 |
-
transformers >= 4.57.0
|
| 319 |
-
accelerate >= 0.20.0
|
| 320 |
-
Pillow
|
| 321 |
-
```
|
| 322 |
-
|
| 323 |
-
## Hardware Requirements
|
| 324 |
-
|
| 325 |
-
| Configuration | VRAM | Precision | Batch Size |
|
| 326 |
-
|--------------|------|-----------|------------|
|
| 327 |
-
| Minimum | 16GB | 8-bit | 1 |
|
| 328 |
-
| Recommended | 24GB | BFloat16 | 2-4 |
|
| 329 |
-
| Optimal | 40GB+ | BFloat16 | 8+ |
|
| 330 |
-
|
| 331 |
-
## Research Context
|
| 332 |
-
|
| 333 |
-
This architecture demonstrates that **transformer self-awareness is learnable** through standard training. No RLHF, no auxiliary losses, no external signals - just 65,536 embeddings that learn to represent "how uncertain was I?"
|
| 334 |
-
|
| 335 |
-
The key insight: **uncertainty is a learnable signal, not a post-hoc calculation**. With 16-bit quantization, the model can develop a highly nuanced understanding of its own confidence states.
|
| 336 |
-
|
| 337 |
-
## Future Directions
|
| 338 |
-
|
| 339 |
-
Potential extensions of this architecture:
|
| 340 |
-
|
| 341 |
-
1. **Multi-Resolution Uncertainty**: Track uncertainty at token, phrase, and document levels
|
| 342 |
-
2. **Cross-Modal Uncertainty**: Separate tracking for vision vs. language predictions
|
| 343 |
-
3. **Uncertainty-Guided Sampling**: Adjust temperature based on live uncertainty
|
| 344 |
-
4. **Explicit Uncertainty Tokens**: Generate "<uncertain>" tokens in output
|
| 345 |
-
5. **Confidence-Aware Search**: Use uncertainty for better beam search
|
| 346 |
|
| 347 |
## Citation
|
| 348 |
|
|
@@ -358,12 +150,6 @@ Potential extensions of this architecture:
|
|
| 358 |
|
| 359 |
Apache 2.0
|
| 360 |
|
| 361 |
-
## Acknowledgments
|
| 362 |
-
|
| 363 |
-
- Architecture inspired by temporal feedback patterns in cognitive science
|
| 364 |
-
- 16-bit high-resolution quantization for fine-grained uncertainty representation
|
| 365 |
-
- Vision-language backbone based on multimodal transformer designs
|
| 366 |
-
|
| 367 |
## Additional Resources
|
| 368 |
|
| 369 |
- [Architecture Deep Dive](./INTROSPECTIVE_ARCHITECTURE.md)
|
|
@@ -372,10 +158,4 @@ Apache 2.0
|
|
| 372 |
|
| 373 |
---
|
| 374 |
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
Prisma-VL-8B exists to demonstrate that transformers can be introspective by design.
|
| 378 |
-
|
| 379 |
-
**Status**: ✅ Production ready - fully functional in training and inference
|
| 380 |
-
|
| 381 |
-
**Last Updated**: 2025-01-08
|
|
|
|
| 14 |
|
| 15 |
# Prisma-VL-8B: Introspective Vision-Language Model
|
| 16 |
|
| 17 |
+
**A vision-language model architected with temporal uncertainty feedback for self-aware predictions.**
|
| 18 |
|
| 19 |
## What is This?
|
| 20 |
|
| 21 |
+
Prisma-VL-8B is a reference implementation of an introspective transformer architecture. The model uses its confidence to calibrate subsequent predictions.
|
|
|
|
|
|
|
| 22 |
|
| 23 |
## Core Architecture
|
| 24 |
|
| 25 |
### The Introspective Mechanism
|
| 26 |
|
| 27 |
+
Every transformer processes tokens sequentially. Prisma-VL-8B adds one crucial element: memory of its own uncertainty.
|
| 28 |
|
| 29 |
```
|
| 30 |
Standard Transformer:
|
|
|
|
| 36 |
|
| 37 |
### How It Works
|
| 38 |
|
| 39 |
+
The Uncertainty System:
|
| 40 |
|
| 41 |
At each prediction step:
|
| 42 |
1. **Measure**: Compute entropy of output distribution (how uncertain am I?)
|
| 43 |
+
2. **Quantize**: Convert to 16-bit code representing confidence levels
|
| 44 |
3. **Inject**: Next token receives this as learned embedding signal
|
| 45 |
4. **Learn**: Through training, model learns what each uncertainty level means
|
| 46 |
|
|
|
|
| 49 |
- When it's extrapolating (rising uncertainty)
|
| 50 |
- When it needs to be conservative (high uncertainty)
|
| 51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
## Why This Matters
|
| 53 |
|
| 54 |
### Traditional Language Models
|
|
|
|
| 71 |
[code:23] → [code:15] → [code:142] → STOP # Detects uncertainty spike
|
| 72 |
```
|
| 73 |
|
| 74 |
+
The model feels when predictions are going wrong and can self-correct or abstain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
## Usage
|
| 77 |
|
|
|
|
| 132 |
print(f"Highest uncertainty code: {max_uncertainty}")
|
| 133 |
```
|
| 134 |
|
| 135 |
+
## Introspection
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
+
From prediction, emerges language. From awareness of uncertainty, emerges introspection.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
## Citation
|
| 140 |
|
|
|
|
| 150 |
|
| 151 |
Apache 2.0
|
| 152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
## Additional Resources
|
| 154 |
|
| 155 |
- [Architecture Deep Dive](./INTROSPECTIVE_ARCHITECTURE.md)
|
|
|
|
| 158 |
|
| 159 |
---
|
| 160 |
|
| 161 |
+
Prisma-VL-8B demonstrates introspective transformers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|