cot-anc / docs /runtime.md
BART-ender's picture
Deploy Thought Anchors
fda8fb3 verified
# Runtime And Model Support
## Execution Model
Pipeline:
1. Generate visible reasoning trace.
2. Normalize and split trace into sentences.
3. Map sentence spans to token spans.
4. Run forward + backward pass.
5. Build sentence influence matrix from gradient x attention.
6. Summarize top edges and importance scores.
## Device And Dtype Policy
Default policy:
- CUDA:
- `bfloat16` if supported
- else `float16`
- MPS:
- `float16`
- CPU:
- `float32`
Override with:
- `DTYPE_PREFERENCE`
- request `dtype_preference`
## Model Requirements
Model must support all of:
- causal LM generation
- `output_attentions=True`
- eager attention
- supported decoder layer layout
- supported attention module attribute
Supported layer paths:
- `model.layers`
- `model.model.layers`
- `transformer.h`
- `gpt_neox.layers`
Supported attention attrs:
- `self_attn`
- `attn`
- `attention`
## Why Trace Limits Exist
Attribution path uses full backward pass over attention tensors. Cost grows with:
- sequence length
- layer count
- head count
- sentence count
Public defaults stay small to protect uptime.
## Good First Runtime Settings
For public demo:
- `max_new_tokens=128`
- `max_trace_tokens=256`
- `max_sentences=16`
- `validate_top_k=0`
For deeper analysis on bigger GPU:
- raise trace tokens slowly
- watch latency and memory first