# Runtime And Model Support ## Execution Model Pipeline: 1. Generate visible reasoning trace. 2. Normalize and split trace into sentences. 3. Map sentence spans to token spans. 4. Run forward + backward pass. 5. Build sentence influence matrix from gradient x attention. 6. Summarize top edges and importance scores. ## Device And Dtype Policy Default policy: - CUDA: - `bfloat16` if supported - else `float16` - MPS: - `float16` - CPU: - `float32` Override with: - `DTYPE_PREFERENCE` - request `dtype_preference` ## Model Requirements Model must support all of: - causal LM generation - `output_attentions=True` - eager attention - supported decoder layer layout - supported attention module attribute Supported layer paths: - `model.layers` - `model.model.layers` - `transformer.h` - `gpt_neox.layers` Supported attention attrs: - `self_attn` - `attn` - `attention` ## Why Trace Limits Exist Attribution path uses full backward pass over attention tensors. Cost grows with: - sequence length - layer count - head count - sentence count Public defaults stay small to protect uptime. ## Good First Runtime Settings For public demo: - `max_new_tokens=128` - `max_trace_tokens=256` - `max_sentences=16` - `validate_top_k=0` For deeper analysis on bigger GPU: - raise trace tokens slowly - watch latency and memory first