Spaces:
Sleeping
Sleeping
Runtime And Model Support
Execution Model
Pipeline:
- Generate visible reasoning trace.
- Normalize and split trace into sentences.
- Map sentence spans to token spans.
- Run forward + backward pass.
- Build sentence influence matrix from gradient x attention.
- Summarize top edges and importance scores.
Device And Dtype Policy
Default policy:
- CUDA:
bfloat16if supported- else
float16
- MPS:
float16
- CPU:
float32
Override with:
DTYPE_PREFERENCE- request
dtype_preference
Model Requirements
Model must support all of:
- causal LM generation
output_attentions=True- eager attention
- supported decoder layer layout
- supported attention module attribute
Supported layer paths:
model.layersmodel.model.layerstransformer.hgpt_neox.layers
Supported attention attrs:
self_attnattnattention
Why Trace Limits Exist
Attribution path uses full backward pass over attention tensors. Cost grows with:
- sequence length
- layer count
- head count
- sentence count
Public defaults stay small to protect uptime.
Good First Runtime Settings
For public demo:
max_new_tokens=128max_trace_tokens=256max_sentences=16validate_top_k=0
For deeper analysis on bigger GPU:
- raise trace tokens slowly
- watch latency and memory first