Spaces:
Sleeping
Sleeping
File size: 1,350 Bytes
fda8fb3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | # Runtime And Model Support
## Execution Model
Pipeline:
1. Generate visible reasoning trace.
2. Normalize and split trace into sentences.
3. Map sentence spans to token spans.
4. Run forward + backward pass.
5. Build sentence influence matrix from gradient x attention.
6. Summarize top edges and importance scores.
## Device And Dtype Policy
Default policy:
- CUDA:
- `bfloat16` if supported
- else `float16`
- MPS:
- `float16`
- CPU:
- `float32`
Override with:
- `DTYPE_PREFERENCE`
- request `dtype_preference`
## Model Requirements
Model must support all of:
- causal LM generation
- `output_attentions=True`
- eager attention
- supported decoder layer layout
- supported attention module attribute
Supported layer paths:
- `model.layers`
- `model.model.layers`
- `transformer.h`
- `gpt_neox.layers`
Supported attention attrs:
- `self_attn`
- `attn`
- `attention`
## Why Trace Limits Exist
Attribution path uses full backward pass over attention tensors. Cost grows with:
- sequence length
- layer count
- head count
- sentence count
Public defaults stay small to protect uptime.
## Good First Runtime Settings
For public demo:
- `max_new_tokens=128`
- `max_trace_tokens=256`
- `max_sentences=16`
- `validate_top_k=0`
For deeper analysis on bigger GPU:
- raise trace tokens slowly
- watch latency and memory first
|