cot-anc / docs /runtime.md
BART-ender's picture
Deploy Thought Anchors
fda8fb3 verified

Runtime And Model Support

Execution Model

Pipeline:

  1. Generate visible reasoning trace.
  2. Normalize and split trace into sentences.
  3. Map sentence spans to token spans.
  4. Run forward + backward pass.
  5. Build sentence influence matrix from gradient x attention.
  6. Summarize top edges and importance scores.

Device And Dtype Policy

Default policy:

  • CUDA:
    • bfloat16 if supported
    • else float16
  • MPS:
    • float16
  • CPU:
    • float32

Override with:

  • DTYPE_PREFERENCE
  • request dtype_preference

Model Requirements

Model must support all of:

  • causal LM generation
  • output_attentions=True
  • eager attention
  • supported decoder layer layout
  • supported attention module attribute

Supported layer paths:

  • model.layers
  • model.model.layers
  • transformer.h
  • gpt_neox.layers

Supported attention attrs:

  • self_attn
  • attn
  • attention

Why Trace Limits Exist

Attribution path uses full backward pass over attention tensors. Cost grows with:

  • sequence length
  • layer count
  • head count
  • sentence count

Public defaults stay small to protect uptime.

Good First Runtime Settings

For public demo:

  • max_new_tokens=128
  • max_trace_tokens=256
  • max_sentences=16
  • validate_top_k=0

For deeper analysis on bigger GPU:

  • raise trace tokens slowly
  • watch latency and memory first