QuixiAI
/

Prisma-VL-8B

+## PRISMA V2: Joint Uncertainty Prediction Mechanism — Implementation Specification
+**Architecture Overview**:
+PRISMA V2 replaces Python-side uncertainty state with a **learned, explicit uncertainty latent** predicted jointly with tokens. At each step, the model predicts both the next token *and* an uncertainty code that conditions the following step. This preserves temporal introspection while remaining fully compatible with stateless inference engines.
+---
+## **Core Design Principle**
+> **Uncertainty must be data, not memory.**
+All information required for the next decoding step is carried explicitly through tensors (tokens, uncertainty codes, or cache), never through mutable module state.
+---
+## **Differences from Prisma V1 (Detailed)**
+Prisma V2 is not a minor refactor of Prisma V1. It represents a **fundamental shift in how uncertainty is represented, propagated, and learned**.
+This section documents those differences precisely.
+---
+### **1. Source of Uncertainty**
+**Prisma V1**
+* Uncertainty is **measured post-hoc** from the model’s output distribution
+* Computed via entropy of logits
+* Acts as an external diagnostic signal
+```text
+uncertainty_t = H(P(y_t))
+```
+**Prisma V2**
+* Uncertainty is **predicted by the model itself**
+* Learned as an auxiliary latent variable
+* Acts as an internal representation
+```text
+(token_{t+1}, uncertainty_{t+1}) = f(token_t, uncertainty_t)
+```
+**Implication**:
+V1 answers *“how uncertain was I?”*
+V2 answers *“how uncertain will I be?”*
+---
+### **2. State Representation**
+**Prisma V1**
+* Uses mutable Python-side state:
+```python
+self.prev_uncertainty_code
+```
+* State exists **outside** the model’s forward graph
+* Relies on strict step-by-step execution order
+**Prisma V2**
+* No mutable state
+* Uncertainty is passed explicitly as a tensor:
+```python
+uncertainty_codes: Tensor[B, S]
+```
+* Fully contained within the model’s inputs and outputs
+**Implication**:
+V1 requires engine cooperation.
+V2 requires only tensors.
+---
+### **3. Runtime Compatibility**
+| Runtime                  | Prisma V1 | Prisma V2 |
+| ------------------------ | --------- | --------- |
+| HuggingFace Transformers | ✅         | ✅         |
+| vLLM                     | ❌         | ✅         |
+| llama.cpp                | ❌         | ✅         |
+| MLX                      | ❌         | ✅         |
+| Tensor Parallel          | ⚠️        | ✅         |
+**Reason**:
+* V1 violates the stateless decoding assumptions of modern runtimes
+* V2 conforms to them by construction
+---
+### **4. Temporal Feedback Mechanism**
+**Prisma V1**
+* Feedback loop implemented via external buffer
+* Requires padding, truncation, and shifting logic
+* Not visible to KV cache or sampler
+**Prisma V2**
+* Feedback loop is **architectural**
+* Uncertainty is predicted one step ahead and injected naturally
+* Temporal alignment is implicit in training and decoding
+**Implication**:
+V2’s feedback loop is **native**, not simulated.
+---
+### **5. Learning Dynamics**
+**Prisma V1**
+* Uncertainty signal is fixed (entropy)
+* Model can only learn *how to react* to uncertainty
+* Cannot redefine what uncertainty means
+**Prisma V2**
+* Uncertainty is supervised initially by entropy, then free to diverge
+* Model can learn:
+  * epistemic uncertainty
+  * ambiguity
+  * distribution shift
+  * task-specific hesitation signals
+**Implication**:
+V1 teaches *response to uncertainty*.
+V2 teaches *representation of uncertainty*.
+---
+### **6. Training Complexity**
+**Prisma V1**
+* No additional loss
+* Entropy computed every forward
+* Sensitive to tensor parallel sharding
+**Prisma V2**
+* Adds a lightweight auxiliary loss
+* Entropy used only as a teacher signal during training
+* No entropy computation at inference
+**Implication**:
+V2 trades a small training cost for large inference robustness.
+---
+### **7. Inference Behavior**
+**Prisma V1**
+* Uncertainty exists only implicitly
+* Difficult to inspect or intervene at runtime
+* Breaks under batched or reordered decoding
+**Prisma V2**
+* Uncertainty is explicit and inspectable
+* Sampler can condition on it
+* Works under any batching or scheduling strategy
+---
+### **8. Conceptual Framing**
+**Prisma V1**
+* Introspection via *measurement*
+* Confidence is something the model observes after the fact
+**Prisma V2**
+* Introspection via *prediction*
+* Confidence is something the model reasons about and plans with
+> Prisma V1 makes the model *aware of its uncertainty.*
+> Prisma V2 makes uncertainty part of the model’s internal world model.
+---
+### **Summary Table**
+| Dimension              | Prisma V1          | Prisma V2          |
+| ---------------------- | ------------------ | ------------------ |
+| Uncertainty source     | Entropy (measured) | Learned latent     |
+| State handling         | Mutable buffer     | Explicit tensor    |
+| Runtime support        | Limited            | Universal          |
+| KV cache compatibility | ❌                  | ✅                  |
+| Tensor parallel        | Fragile            | Safe               |
+| Introspection depth    | Reactive           | Predictive         |
+| Deployment readiness   | Research-only      | Production-capable |
+---
+### **Why Prisma V2 Exists**
+Prisma V1 demonstrated that **temporal uncertainty feedback produces introspective behavior**.
+Prisma V2 makes that insight **architectural, portable, and deployable**.
+It is not a workaround.
+It is the correct abstraction boundary.
+> *Uncertainty must be data, not memory.*
+---
+## **Core Components to Add**
+```python
+# In your CausalLM class
+self.n_uncertainty_levels = 256  # V2: compact, sufficient
+self.uncertainty_embeddings = nn.Embedding(
+    self.n_uncertainty_levels,
+    hidden_dim
+)
+# NEW: Uncertainty prediction head
+self.uncertainty_head = nn.Linear(
+    hidden_dim,
+    self.n_uncertainty_levels,
+    bias=False
+)
+```
+---
+## **Initialization Details**
+### Uncertainty Embeddings
+* Initialized from `N(0, σ²)` where `σ = config.initializer_range`
+### Uncertainty Head (Important)
+```python
+self.uncertainty_head.weight.data.zero_()
+```
+**Rationale**:
+* Model initially predicts *neutral uncertainty*
+* Early training behaves identically to the base model
+* Avoids destabilizing LM loss with noisy auxiliary signals
+* Uncertainty pathway is learned gradually
+---
+## **Forward Pass Modifications (Input Side)**
+**Location**: *Immediately after token embedding lookup*
+```python
+def forward(self, input_ids, uncertainty_codes=None, ...):
+    inputs_embeds = self.embed_tokens(input_ids)
+    if uncertainty_codes is not None:
+        # uncertainty_codes: [B, S]
+        u = self.uncertainty_embeddings(uncertainty_codes)
+        inputs_embeds = inputs_embeds + u
+    hidden_states = self.model(
+        inputs_embeds=inputs_embeds,
+        ...
+    ).last_hidden_state
+```
+* `uncertainty_codes[t]` conditions token position `t`
+* No padding, truncation, or shifting logic required
+* Temporal alignment is handled by the training and decoding loop
+---
+## **Forward Pass Modifications (Output Side)**
+**Location**: *After transformer hidden states*
+```python
+logits = self.lm_head(hidden_states)
+uncertainty_logits = self.uncertainty_head(hidden_states)
+```
+Returns:
+```python
+return {
+    "logits": logits,                      # [B, S, vocab]
+    "uncertainty_logits": uncertainty_logits  # [B, S, n_uncertainty_levels]
+}
+```
+---
+## **Temporal Semantics**
+| Position | Input                 | Predicts                 |
+| -------- | --------------------- | ------------------------ |
+| t        | tokenₜ + uncertaintyₜ | tokenₜ₊₁, uncertaintyₜ₊₁ |
+This preserves the original PRISMA temporal feedback loop without mutable state.
+---
+## **Training Objective**
+### Language Modeling Loss
+Standard next-token prediction:
+```python
+loss_lm = cross_entropy(
+    logits[:, :-1],
+    labels[:, 1:]
+)
+```
+---
+### Uncertainty Prediction Loss
+Uncertainty is predicted **one step ahead**:
+```python
+loss_uncertainty = cross_entropy(
+    uncertainty_logits[:, :-1],
+    uncertainty_labels[:, 1:]
+)
+```
+---
+### Combined Loss
+```python
+loss = loss_lm + λ * loss_uncertainty
+```
+* Recommended: `λ ≈ 0.1` (to tune)
+---
+## **Uncertainty Supervision (Teacher Signal)**
+During training only, entropy is used as a **bootstrap target**, not as the definition of uncertainty.
+```python
+with torch.no_grad():
+    probs = softmax(logits)
+    entropy = -(probs * log(probs)).sum(dim=-1)
+    entropy_norm = entropy / log(vocab_size)
+    uncertainty_labels = quantize(entropy_norm)
+```
+**Important**:
+* Entropy is a *teacher*, not a constraint
+* The model may learn uncertainty signals that diverge from entropy
+* This is desirable if they correlate better with error or ambiguity
+---
+## **Single-Pass Training (Preferred)**
+A second forward pass is **not required**.
+```python
+outputs = model(
+    input_ids,
+    uncertainty_codes=uncertainty_input
+)
+with torch.no_grad():
+    uncertainty_labels = compute_uncertainty_labels(outputs.logits)
+loss = compute_loss(
+    outputs.logits,
+    outputs.uncertainty_logits,
+    labels,
+    uncertainty_labels
+)
+```
+---
+## **Inference Loop (All Runtimes)**
+```text
+(tokenₜ, uncertaintyₜ) → model → (tokenₜ₊₁, uncertaintyₜ₊₁)
+```
+### Neutral Start
+```python
+uncertainty_code = n_uncertainty_levels // 2
+```
+---
+## **Runtime Integration**
+| Runtime          | Integration                                          |
+| ---------------- | ---------------------------------------------------- |
+| **Transformers** | Custom `generate()` tracks `uncertainty_code` tensor |
+| **vLLM**         | Sampler tracks one uncertainty code per request      |
+| **llama.cpp**    | Store uncertainty code in `llama_context`            |
+| **MLX**          | Works directly (pure tensor graph)                   |
+No runtime relies on Python-side mutable state.
+---
+## **Performance Characteristics**
+| Component               | Parameters         | FLOPs      | Memory     | Latency          |
+| ----------------------- | ------------------ | ---------- | ---------- | ---------------- |
+| Uncertainty Head        | `hidden_dim × 256` | Negligible | Negligible | ~0               |
+| Uncertainty Embedding   | `256 × hidden_dim` | 0          | Tiny       | ~0               |
+| Entropy (training only) | 0                  | `O(B×S×V)` | O(1)       | Not in inference |
+**Inference overhead**: effectively zero
+---
+## **Theoretical Intuition**
+PRISMA V2 transforms autoregressive generation from:
+```
+P(y_t | x, y_<t)
+```
+to:
+```
+P(y_t, c_t | x, y_<t, c_<t)
+```
+where `c_t` is a learned uncertainty latent.
+This allows the model to:
+* Reduce commitment after uncertain predictions
+* Maintain momentum after confident predictions
+* Learn task-specific uncertainty signals
+* Develop introspection without relying on engine-level state
+---
+## **Why PRISMA V2 Works Everywhere**
+| Constraint         | V1 | V2 |
+| ------------------ | -- | -- |
+| Stateless decoding | ❌  | ✅  |
+| vLLM batching      | ❌  | ✅  |
+| llama.cpp KV cache | ❌  | ✅  |
+| Tensor parallel    | ⚠️ | ✅  |
+| MLX tracing        | ❌  | ✅  |
+---
+## **What to Watch For**
+* **Ablation**: remove uncertainty input, measure perplexity / behavior
+* **Calibration**: does predicted uncertainty correlate with error?
+* **Behavioral shifts**: hedging, correction, abstention
+* **Divergence from entropy**: expected and healthy
+---
+## **Summary**
+Prisma V2 preserves the introspective insight of Prisma V1 while replacing fragile mutable state with an explicit, learned uncertainty representation. This makes introspection **portable, scalable, and deployable** across all modern inference engines.
+> *The model no longer measures uncertainty — it learns what uncertainty means.*