Spaces:

BART-ender
/

cot-anc

Sleeping

App Files Files Community

cot-anc / docs /runtime.md

BART-ender's picture

Deploy Thought Anchors

fda8fb3 verified 4 days ago

|

history blame contribute delete

1.35 kB

Runtime And Model Support

Execution Model

Pipeline:

Generate visible reasoning trace.
Normalize and split trace into sentences.
Map sentence spans to token spans.
Run forward + backward pass.
Build sentence influence matrix from gradient x attention.
Summarize top edges and importance scores.

Device And Dtype Policy

Default policy:

CUDA:
- bfloat16 if supported
- else float16
MPS:
- float16
CPU:
- float32

Override with:

DTYPE_PREFERENCE
request dtype_preference

Model Requirements

Model must support all of:

causal LM generation
output_attentions=True
eager attention
supported decoder layer layout
supported attention module attribute

Supported layer paths:

model.layers
model.model.layers
transformer.h
gpt_neox.layers

Supported attention attrs:

self_attn
attn
attention

Why Trace Limits Exist

Attribution path uses full backward pass over attention tensors. Cost grows with:

sequence length
layer count
head count
sentence count

Public defaults stay small to protect uptime.

Good First Runtime Settings

For public demo:

max_new_tokens=128
max_trace_tokens=256
max_sentences=16
validate_top_k=0

For deeper analysis on bigger GPU:

raise trace tokens slowly
watch latency and memory first