Spaces:
Sleeping
Sleeping
| # Runtime And Model Support | |
| ## Execution Model | |
| Pipeline: | |
| 1. Generate visible reasoning trace. | |
| 2. Normalize and split trace into sentences. | |
| 3. Map sentence spans to token spans. | |
| 4. Run forward + backward pass. | |
| 5. Build sentence influence matrix from gradient x attention. | |
| 6. Summarize top edges and importance scores. | |
| ## Device And Dtype Policy | |
| Default policy: | |
| - CUDA: | |
| - `bfloat16` if supported | |
| - else `float16` | |
| - MPS: | |
| - `float16` | |
| - CPU: | |
| - `float32` | |
| Override with: | |
| - `DTYPE_PREFERENCE` | |
| - request `dtype_preference` | |
| ## Model Requirements | |
| Model must support all of: | |
| - causal LM generation | |
| - `output_attentions=True` | |
| - eager attention | |
| - supported decoder layer layout | |
| - supported attention module attribute | |
| Supported layer paths: | |
| - `model.layers` | |
| - `model.model.layers` | |
| - `transformer.h` | |
| - `gpt_neox.layers` | |
| Supported attention attrs: | |
| - `self_attn` | |
| - `attn` | |
| - `attention` | |
| ## Why Trace Limits Exist | |
| Attribution path uses full backward pass over attention tensors. Cost grows with: | |
| - sequence length | |
| - layer count | |
| - head count | |
| - sentence count | |
| Public defaults stay small to protect uptime. | |
| ## Good First Runtime Settings | |
| For public demo: | |
| - `max_new_tokens=128` | |
| - `max_trace_tokens=256` | |
| - `max_sentences=16` | |
| - `validate_top_k=0` | |
| For deeper analysis on bigger GPU: | |
| - raise trace tokens slowly | |
| - watch latency and memory first | |