# Runtime And Model Support

## Execution Model

Pipeline:

1. Generate visible reasoning trace.
2. Normalize and split trace into sentences.
3. Map sentence spans to token spans.
4. Run forward + backward pass.
5. Build sentence influence matrix from gradient x attention.
6. Summarize top edges and importance scores.

## Device And Dtype Policy

Default policy:

- CUDA:
  - `bfloat16` if supported
  - else `float16`
- MPS:
  - `float16`
- CPU:
  - `float32`

Override with:

- `DTYPE_PREFERENCE`
- request `dtype_preference`

## Model Requirements

Model must support all of:

- causal LM generation
- `output_attentions=True`
- eager attention
- supported decoder layer layout
- supported attention module attribute

Supported layer paths:

- `model.layers`
- `model.model.layers`
- `transformer.h`
- `gpt_neox.layers`

Supported attention attrs:

- `self_attn`
- `attn`
- `attention`

## Why Trace Limits Exist

Attribution path uses full backward pass over attention tensors. Cost grows with:

- sequence length
- layer count
- head count
- sentence count

Public defaults stay small to protect uptime.

## Good First Runtime Settings

For public demo:

- `max_new_tokens=128`
- `max_trace_tokens=256`
- `max_sentences=16`
- `validate_top_k=0`

For deeper analysis on bigger GPU:

- raise trace tokens slowly
- watch latency and memory first