File size: 1,350 Bytes
fda8fb3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# Runtime And Model Support

## Execution Model

Pipeline:

1. Generate visible reasoning trace.
2. Normalize and split trace into sentences.
3. Map sentence spans to token spans.
4. Run forward + backward pass.
5. Build sentence influence matrix from gradient x attention.
6. Summarize top edges and importance scores.

## Device And Dtype Policy

Default policy:

- CUDA:
  - `bfloat16` if supported
  - else `float16`
- MPS:
  - `float16`
- CPU:
  - `float32`

Override with:

- `DTYPE_PREFERENCE`
- request `dtype_preference`

## Model Requirements

Model must support all of:

- causal LM generation
- `output_attentions=True`
- eager attention
- supported decoder layer layout
- supported attention module attribute

Supported layer paths:

- `model.layers`
- `model.model.layers`
- `transformer.h`
- `gpt_neox.layers`

Supported attention attrs:

- `self_attn`
- `attn`
- `attention`

## Why Trace Limits Exist

Attribution path uses full backward pass over attention tensors. Cost grows with:

- sequence length
- layer count
- head count
- sentence count

Public defaults stay small to protect uptime.

## Good First Runtime Settings

For public demo:

- `max_new_tokens=128`
- `max_trace_tokens=256`
- `max_sentences=16`
- `validate_top_k=0`

For deeper analysis on bigger GPU:

- raise trace tokens slowly
- watch latency and memory first