Spaces:

BART-ender
/

cot-anc

Sleeping

cot-anc / docs /runtime.md

Deploy Thought Anchors

fda8fb3 verified 5 days ago

1.35 kB

	# Runtime And Model Support

	## Execution Model

	Pipeline:

	1. Generate visible reasoning trace.
	2. Normalize and split trace into sentences.
	3. Map sentence spans to token spans.
	4. Run forward + backward pass.
	5. Build sentence influence matrix from gradient x attention.
	6. Summarize top edges and importance scores.

	## Device And Dtype Policy

	Default policy:

	- CUDA:
	- `bfloat16` if supported
	- else `float16`
	- MPS:
	- `float16`
	- CPU:
	- `float32`

	Override with:

	- `DTYPE_PREFERENCE`
	- request `dtype_preference`

	## Model Requirements

	Model must support all of:

	- causal LM generation
	- `output_attentions=True`
	- eager attention
	- supported decoder layer layout
	- supported attention module attribute

	Supported layer paths:

	- `model.layers`
	- `model.model.layers`
	- `transformer.h`
	- `gpt_neox.layers`

	Supported attention attrs:

	- `self_attn`
	- `attn`
	- `attention`

	## Why Trace Limits Exist

	Attribution path uses full backward pass over attention tensors. Cost grows with:

	- sequence length
	- layer count
	- head count
	- sentence count

	Public defaults stay small to protect uptime.

	## Good First Runtime Settings

	For public demo:

	- `max_new_tokens=128`
	- `max_trace_tokens=256`
	- `max_sentences=16`
	- `validate_top_k=0`

	For deeper analysis on bigger GPU:

	- raise trace tokens slowly
	- watch latency and memory first