YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

CreditScope

Agentic Credit Scoring with MoE Observability and Circuit Tracing

CreditScope is an AI-powered credit analysis platform built on Qwen3.5-35B-A3B-FP8, a 35-billion-parameter mixture-of-experts language model. It provides a full-stack application for credit scoring, real-time MoE expert routing visualization, chain-of-thought reasoning control, and mechanistic interpretability via circuit tracing with sparse autoencoders.

License Python React


Table of Contents

  1. Architecture Overview
  2. Machine Setup from Scratch
  3. Configuration
  4. Running the Application
  5. Services and Ports
  6. Algorithm Deep Dive
  7. API Reference
  8. Project Structure
  9. Development
  10. Training SAEs and Transcoders from Scratch

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                            nginx (:80/:443/:20003)                     β”‚
β”‚              reverse proxy + WebSocket upgrade + SSL                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                      β”‚
        β–Ό                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ React Frontendβ”‚      β”‚               FastAPI Backend (:8080)            β”‚
β”‚  Vite + TS    β”‚      β”‚                                                  β”‚
β”‚  Tailwind CSS β”‚      β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  Port 3000    │◄────►│  β”‚  Agent   β”‚ β”‚  Credit    β”‚ β”‚   Circuit     β”‚  β”‚
β”‚               β”‚      β”‚  β”‚  (ReAct) β”‚ β”‚  Tools     β”‚ β”‚   Tracer API  β”‚  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
                       β”‚       β”‚                              β”‚          β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚                              β”‚
                               β–Ό                              β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚          SGLang Inference Server (:8000)          β”‚
                       β”‚                                                  β”‚
                       β”‚  Qwen3.5-35B-A3B-FP8 (40 layers, 256 experts)   β”‚
                       β”‚                                                  β”‚
                       β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
                       β”‚  β”‚  MoE Hooks   β”‚    β”‚  Residual Capture Hooksβ”‚  β”‚
                       β”‚  β”‚  (routing    β”‚    β”‚  (activation tensors   β”‚  β”‚
                       β”‚  β”‚   telemetry) β”‚    β”‚   via filesystem IPC)  β”‚  β”‚
                       β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key design decision: A single instance of the 35B model serves both chat inference and circuit tracing. The circuit tracer captures activations from the running SGLang server via forward hooks and filesystem-based IPC, avoiding the need to load a second copy of the model (which would require ~70GB additional VRAM).


Machine Setup from Scratch

These instructions provision a fresh Ubuntu server with GPU support for running CreditScope natively.

1. Hardware Requirements

Component Minimum Recommended
GPU NVIDIA with 48GB+ VRAM NVIDIA RTX PRO 6000 (96GB) or A100 80GB
CPU 8 cores 16+ cores
RAM 32GB 64GB+
Storage 100GB free 200GB+ (model weights ~35GB)
OS Ubuntu 22.04 LTS Ubuntu 24.04 LTS

For Blackwell-generation GPUs (RTX PRO 6000, sm_120), specific SGLang flags are required β€” see the SGLang flags section.

2. Install System Dependencies

sudo apt-get update && sudo apt-get upgrade -y
sudo apt-get install -y \
    python3 python3-pip python3-venv python3-dev \
    git curl wget nginx openssl lsof \
    build-essential

Install Node.js 18+:

curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs
node --version   # should be 18.x+
npm --version    # should be 9.x+

3. Install NVIDIA Drivers and CUDA

Skip this section if drivers are already installed (nvidia-smi works).

# Install NVIDIA driver (latest recommended)
sudo apt-get install -y nvidia-driver-565

# Install CUDA toolkit 12.x
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-6

# Verify
nvidia-smi
nvcc --version

4. Clone and Configure

cd /home/ubuntu
git clone https://github.com/sarelWeinberger/creditscope.git
cd creditscope
cp .env.example .env

Edit .env and set:

# Required
AUTH_USERS=admin@creditscope.local        # comma-separated allowed emails
AUTH_PASSWORD=your-secure-password        # shared login password
AUTH_SECRET_KEY=$(openssl rand -hex 32)   # session signing key

# Set your server's public IP for CORS
CORS_ORIGINS=http://localhost:3000,http://YOUR_PUBLIC_IP

# Optional: HuggingFace token if model is gated
HUGGING_FACE_HUB_TOKEN=hf_xxxxxxxxxxxxx

5. Create Python Virtual Environment

python3 -m venv .venv
source .venv/bin/activate

# Install the project with all dependencies
pip install -e ".[dev,backend,inference,circuit]"

For Blackwell GPUs, you may need a specific CuDNN version:

pip install nvidia-cudnn-cu12==9.16.0.29

6. Install Frontend Dependencies

cd frontend
npm install
cd ..

7. Set Up nginx Reverse Proxy

chmod +x scripts/setup_nginx_http.sh
./scripts/setup_nginx_http.sh

This configures nginx to:

  • Route / to the Vite frontend on port 3000
  • Route /api/ to the FastAPI backend on port 8080
  • Upgrade /api/chat/ws connections to WebSocket
  • Listen on ports 80, 443 (self-signed SSL), and 20003

8. Download Model Weights

The model downloads automatically on first SGLang start. To pre-download:

source .venv/bin/activate
python3 -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3.5-35B-A3B-FP8')"

This downloads 35GB to `/.cache/huggingface/`.

9. Start Everything

chmod +x start_services.sh
./start_services.sh

This starts:

  1. SGLang on port 8000 (waits ~60s for model load)
  2. FastAPI backend on port 8080
  3. Frontend must be started separately:
cd frontend && npm run dev &

Verify:

curl -s http://127.0.0.1:8000/v1/models | head -5   # SGLang
curl -s http://127.0.0.1:8080/health                 # Backend
curl -s http://127.0.0.1:3000/                        # Frontend
curl -s http://YOUR_PUBLIC_IP:20003/                  # Public (nginx)

Configuration

All configuration is in .env. Key variables:

Variable Default Description
MODEL_PATH Qwen/Qwen3.5-35B-A3B-FP8 HuggingFace model ID
CONTEXT_LENGTH 4096 Max context window
TP_SIZE 1 Tensor parallelism (number of GPUs)
MEM_FRACTION_STATIC 0.98 Fraction of GPU VRAM for KV cache
SGLANG_PORT 8000 Inference server port
BACKEND_PORT 8080 FastAPI backend port
FRONTEND_PORT 3000 Vite dev server port
DATABASE_URL sqlite:///./data/creditscope.db Database path
SEED_DB true Seed sample customers on startup
DEFAULT_THINKING_BUDGET standard CoT budget preset
AUTH_USERS β€” Comma-separated allowed login emails
AUTH_PASSWORD β€” Shared login password
AUTH_SECRET_KEY β€” HMAC key for session cookies

SGLang Flags for Blackwell GPUs

Blackwell-generation GPUs (sm_120: RTX PRO 6000, RTX 5090, etc.) require specific flags:

SGLANG_EXTRA_ARGS="
  --attention-backend triton         # Only triton/trtllm_mha supported on Blackwell
  --fp8-gemm-backend triton          # flashinfer FP8 unsupported on sm_120
  --disable-cuda-graph               # Stability on newer architectures
  --max-mamba-cache-size 16          # Limit DeltaNet state cache
  --skip-server-warmup               # Faster startup
  --chunked-prefill-size 512         # Reduce memory peaks
  --max-running-requests 2           # Limit concurrency
  --max-total-tokens 65536           # KV cache token budget
"

Also set these environment variables before launching SGLang:

SGLANG_ENABLE_JIT_DEEPGEMM=0        # Avoid DeepGemm recipe errors on sm_120
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True  # Better memory allocation

Thinking Budget Presets

Preset Tokens Use Case
none 0 Direct responses only
minimal 128 Simple lookups
short 512 Quick calculations
standard 2,048 Normal analysis
extended 8,192 Complex reasoning
deep 32,768 Thorough investigation
unlimited -1 No limit

Running the Application

Quick Start (All Services)

./start_services.sh
cd frontend && npm run dev &

Individual Services

# SGLang inference server
SGLANG_ENABLE_JIT_DEEPGEMM=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  PYTHONPATH=/home/ubuntu/creditscope \
  .venv/bin/python -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-35B-A3B-FP8 \
    --port 8000 --tp-size 1 --mem-fraction-static 0.50 \
    --context-length 2048 --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder --enable-metrics \
    --attention-backend triton --fp8-gemm-backend triton \
    --disable-cuda-graph --skip-server-warmup \
    --forward-hooks '[{"name":"residual_capture","target_modules":["model.layers.*"],"hook_factory":"circuit_tracer.collectors.sglang_hooks:residual_capture_factory","config":{}}]'

# FastAPI backend
.venv/bin/uvicorn backend.main:app --host 127.0.0.1 --port 8080 --reload

# Frontend
cd frontend && npm run dev

Docker Deployment

cp .env.example .env
docker compose up -d --build

Requires Docker, docker-compose-v2, and NVIDIA Container Toolkit for GPU support.

Logs

tail -f /tmp/sglang.log     # SGLang inference
tail -f /tmp/backend.log     # FastAPI backend

Services and Ports

Service Port Protocol Description
SGLang 8000 HTTP OpenAI-compatible inference API
Backend 8080 HTTP/WS FastAPI + agent + circuit tracer
Frontend 3000 HTTP Vite React dev server
nginx 80/443/20003 HTTP/HTTPS Public reverse proxy
Prometheus 9090 HTTP Metrics (Docker only)
Grafana 3001 HTTP Dashboards (Docker only)

Algorithm Deep Dive

1. Model Architecture β€” Qwen3.5-35B-A3B-FP8

CreditScope runs on a hybrid attention + MoE architecture:

Input tokens
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Embedding  β”‚  d_model = 2048
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  40 Decoder Layers (repeating pattern of 4):    β”‚
β”‚                                                  β”‚
β”‚  Layer N+0: DeltaNet (linear) attention + MoE   β”‚
β”‚  Layer N+1: DeltaNet (linear) attention + MoE   β”‚
β”‚  Layer N+2: DeltaNet (linear) attention + MoE   β”‚
β”‚  Layer N+3: Standard (full) attention   + MoE   β”‚
β”‚                                                  β”‚
β”‚  Each MoE layer: 256 experts, top-8 routing     β”‚
β”‚  Per-expert intermediate size: 512              β”‚
β”‚                                                  β”‚
β”‚  Full attention at layers: 3,7,11,...,35,39     β”‚
β”‚  DeltaNet attention at all other layers         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LM Head    β”‚  β†’ next-token logits
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Vision Encoder (27 ViT layers):
  Hidden size: 1152 β†’ projects to d_model=2048
  Intermediate size: 4304
  16 attention heads, patch size 16

DeltaNet layers use linear attention (O(n) vs O(n^2)), which makes the model efficient for long sequences. Standard attention layers every 4th position provide full-context mixing.

Mixture of Experts (MoE): Every layer has 256 experts but only routes each token to the top 8. This gives the model 35B total parameters but only ~3B active per token, enabling fast inference.

2. ReAct Agent Loop

The agent uses a Reason-Act-Observe loop to answer credit analysis queries:

User Query: "Evaluate John Smith for a $50,000 business loan"
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. REASON (thinking tokens)        β”‚
β”‚     "I need to check credit score,  β”‚
β”‚      DTI ratio, and collateral..."  β”‚
β”‚                                     β”‚
β”‚  2. ACT (tool call)                 β”‚
β”‚     calculate_credit_score(id=42)   β”‚
β”‚                                     β”‚
β”‚  3. OBSERVE (tool result)           β”‚
β”‚     Score: 720, Grade: B            β”‚
β”‚                                     β”‚
β”‚  4. REASON again                    β”‚
β”‚     "Score is good, need DTI next"  β”‚
β”‚                                     β”‚
β”‚  5. ACT                             β”‚
β”‚     calculate_dti(id=42, amount=50k)β”‚
β”‚                                     β”‚
β”‚  ... (up to 8 steps) ...            β”‚
β”‚                                     β”‚
β”‚  FINAL: Synthesize response         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Available Credit Tools:

Tool Description
calculate_credit_score Weighted score from payment history (35%), utilization (30%), age (15%), mix (10%), inquiries (10%)
calculate_dti Front-end and back-end debt-to-income ratios with risk classification
analyze_payment_history Delinquency patterns, on-time rate, severity scoring
evaluate_collateral Loan-to-value ratio, haircut-adjusted value, coverage ratio
structure_loan Amortization schedule, monthly payment calculation
apply_risk_adjustments Regulatory and behavioral risk adjustments to base score

3. MoE Expert Routing Observability

During every inference request, forward hooks on the MoE gate modules capture:

For each of the 40 MoE layers:
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ router_logits: [num_tokens Γ— 256 experts]   β”‚
  β”‚      β”‚                                      β”‚
  β”‚      β–Ό softmax + top-8 selection            β”‚
  β”‚ selected_experts: [num_tokens Γ— 8]          β”‚
  β”‚ gating_weights:   [num_tokens Γ— 8]          β”‚
  β”‚                                              β”‚
  β”‚ Metrics computed:                            β”‚
  β”‚   - Expert load distribution                β”‚
  β”‚   - Shannon entropy of routing distribution β”‚
  β”‚   - Per-expert activation frequency         β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Shannon entropy measures routing diversity:

  • Low entropy β†’ tokens concentrate on few experts (specialized)
  • High entropy β†’ tokens spread evenly across experts (generic)

The frontend displays this as a real-time heatmap of expert activations across layers.

4. Circuit Tracing with Sparse Autoencoders

Circuit tracing discovers which internal features of the model drive a specific prediction. The pipeline has five stages:

Stage 1: Activation Capture

When a trace is requested, the backend creates a sentinel file (/tmp/circuit_trace_capture). Forward hooks registered in the SGLang process detect this and save the residual stream output from each decoder layer:

SGLang Process                      Backend Process
─────────────────                   ────────────────
                                    1. Create sentinel file
model.layers[0](x) β†’ hook saves
  /tmp/circuit_trace_activations/post_0.pt
model.layers[1](x) β†’ hook saves
  /tmp/circuit_trace_activations/post_1.pt
...
model.layers[39](x) β†’ hook saves
  /tmp/circuit_trace_activations/post_39.pt
                                    2. Remove sentinel
                                    3. Load all post_*.pt files

This filesystem-based IPC has zero overhead during normal chat β€” the hook checks one stat() call per layer and returns immediately if the sentinel doesn't exist.

Stage 2: Sparse Autoencoder Feature Extraction

Each SAE decomposes a 2048-dimensional residual stream vector into ~16,384 sparse features:

Residual stream x ∈ R^2048
     β”‚
     β–Ό
x_centered = x - bias
     β”‚
     β–Ό
pre_act = W_enc Β· x_centered + b_enc     (2048 β†’ 16384)
     β”‚
     β–Ό
z = JumpReLU(pre_act)                     Sparse: ~50 active out of 16384
     β”‚                                    z_i = pre_act_i  if pre_act_i > ΞΈ_i
     β–Ό                                    z_i = 0          otherwise
x_hat = W_dec Β· z + bias                  (16384 β†’ 2048)

JumpReLU activation (from Anthropic's scaling monosemanticity work) learns a per-feature threshold ΞΈ_i. Features only activate when their pre-activation exceeds this learned threshold, giving cleaner sparsity than standard ReLU.

Training objective:

L = ||x - x_hat||Β² + Ξ» Β· ||z||₁
     ─────────────   ──────────
     reconstruction   sparsity
         loss         penalty (Ξ» = 3Γ—10⁻⁴)

SAE Registry manages 68 SAEs across the full model:

  • 40 language SAEs (one per decoder layer, 2048 β†’ 16384 features)
  • 27 vision SAEs (one per ViT layer, 1152 β†’ 9216 features)
  • 1 projection SAE (visionβ†’language bridge, 1152 β†’ 2048)

SAEs are created on-demand when a layer is first traced, avoiding allocating all 68 at startup.

Stage 3: Attribution Graph Construction

The graph represents causal flow from input tokens through features to the output prediction:

Nodes:
  - Input: one per token position
  - Feature: (layer, position, feature_idx) with activation value
  - Output: the target token prediction

Edges:
  - Feature β†’ Output: activation value (how much this feature contributes)
  - Feature β†’ Feature: virtual weight Γ— source activation

Virtual weights between features in adjacent layers are computed as:

W_virtual = W_dec_src^T Β· W_enc_tgt^T

where:
  W_dec_src: decoder weights of source layer's SAE  (d_model Γ— n_features)
  W_enc_tgt: encoder weights of target layer's SAE  (n_features Γ— d_model)
  W_virtual: (n_features_src Γ— n_features_tgt)

Attribution(src→tgt) = activation_src × W_virtual[src_feat, tgt_feat]

This captures the linear pathway: how much a source feature's decoder direction projects onto the target feature's encoder direction.

Stage 4: Graph Pruning

Raw graphs can have thousands of nodes. Pruning keeps only high-impact nodes:

1. Score each node by backward-propagated importance:
   - Output nodes get importance = 1.0
   - For each edge (src β†’ tgt):
     importance[src] += |edge.weight| Γ— importance[tgt]

2. Rank feature nodes by importance score

3. Keep top 10% (configurable), always keeping input/output nodes

4. Drop edges between pruned nodes

Stage 5: Feature Steering (Causal Validation)

Once a circuit is identified, steering validates whether those features actually control the output:

Baseline:
  model("Analyze loan risk") β†’ "The applicant shows moderate risk..."

Intervention (clamp feature 6392 at layer 39 to 0):
  model("Analyze loan risk") β†’ "The applicant appears to be..."
                                 β–² different output confirms
                                   feature 6392 was causal

Steering works by:

  1. Running the model normally to get a baseline output
  2. Registering a forward hook at the target layer that:
    • Encodes the residual stream through the SAE
    • Modifies the specified feature activation(s)
    • Decodes back to residual stream space
  3. Generating again with the hook active
  4. Comparing the outputs

5. Chain-of-Thought Budget Control

The thinking budget system controls how many tokens the model spends on internal reasoning before responding:

User query arrives
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Budget Resolution:        β”‚
β”‚   "standard" β†’ 2048 tkns β”‚
β”‚   "deep"     β†’ 32768 tknsβ”‚
β”‚   "none"     β†’ 0 (skip)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚
     β–Ό
SGLang API call with:
  max_completion_tokens = budget + response_limit
  thinking { type: "enabled", budget_tokens: 2048 }
     β”‚
     β–Ό
Model generates:
  <think>I need to evaluate... [up to budget tokens]</think>
  The credit analysis shows... [response tokens]

The frontend displays thinking content in a collapsible panel with token count and duration.


API Reference

Authentication

POST /api/auth/login     { email, password } β†’ session cookie
GET  /api/auth/me        β†’ current user info

Chat

POST /api/chat           Process chat message with agent
WS   /api/chat/ws        WebSocket streaming (thinking + response deltas)

Customers

GET  /api/customers                  List customers (paginated)
GET  /api/customers/{id}             Customer details
GET  /api/customers/{id}/credit-report   Full credit report

Circuit Tracer

POST /api/circuit/trace              Trace a prompt β†’ attribution graph
GET  /api/circuit/architecture       Model architecture map (lang + vision)
GET  /api/circuit/saes               List SAE checkpoints
GET  /api/circuit/transcoders        List transcoder checkpoints
GET  /api/circuit/registry/status    Registry summary (counts, config)
POST /api/circuit/steer              Run feature steering intervention

Observability

GET  /api/observability/moe/current       Current MoE expert activations
GET  /api/observability/moe/history       Historical expert routing data
GET  /api/observability/thinking/sessions  Thinking session data

Configuration

GET  /api/thinking/budgets           Available thinking budget presets
POST /api/thinking/config            Update thinking configuration

System

GET  /health             Health check
GET  /metrics            Prometheus metrics (text format)

Project Structure

creditscope/
β”œβ”€β”€ backend/                      # FastAPI backend
β”‚   β”œβ”€β”€ main.py                   # App entry point, middleware, router mounting
β”‚   β”œβ”€β”€ auth.py                   # Session cookie authentication (HMAC)
β”‚   β”œβ”€β”€ agent/
β”‚   β”‚   β”œβ”€β”€ orchestrator.py       # ReAct agent loop (up to 8 tool steps)
β”‚   β”‚   β”œβ”€β”€ tool_registry.py      # Tool dispatch and execution
β”‚   β”‚   β”œβ”€β”€ prompts.py            # System prompt and tool definitions
β”‚   β”‚   └── image_handler.py      # Document OCR processing
β”‚   β”œβ”€β”€ db/
β”‚   β”‚   β”œβ”€β”€ models.py             # SQLAlchemy ORM (Customer, Loan, Document)
β”‚   β”‚   β”œβ”€β”€ queries.py            # Database query functions
β”‚   β”‚   └── seed.py               # Sample data seeder (55 customers)
β”‚   β”œβ”€β”€ routers/
β”‚   β”‚   β”œβ”€β”€ chat.py               # Chat endpoint + WebSocket streaming
β”‚   β”‚   β”œβ”€β”€ customers.py          # Customer CRUD endpoints
β”‚   β”‚   β”œβ”€β”€ auth.py               # Login/logout endpoints
β”‚   β”‚   β”œβ”€β”€ history.py            # Conversation history
β”‚   β”‚   β”œβ”€β”€ observability.py      # MoE and thinking metrics
β”‚   β”‚   └── thinking.py           # CoT configuration
β”‚   β”œβ”€β”€ schemas/                  # Pydantic request/response models
β”‚   └── tools/
β”‚       β”œβ”€β”€ credit_score.py       # Weighted credit score calculation
β”‚       β”œβ”€β”€ debt_to_income.py     # DTI ratio computation
β”‚       β”œβ”€β”€ payment_history.py    # Payment pattern analysis
β”‚       β”œβ”€β”€ collateral_eval.py    # LTV and coverage assessment
β”‚       β”œβ”€β”€ loan_structure.py     # Amortization and term calculation
β”‚       └── risk_adjustment.py    # Regulatory risk adjustments
β”‚
β”œβ”€β”€ inference/                    # SGLang inference configuration
β”‚   β”œβ”€β”€ config.py                 # Model config, expert counts, sampling params
β”‚   β”œβ”€β”€ server.py                 # SGLang subprocess launcher
β”‚   β”œβ”€β”€ moe_hooks.py              # Expert routing capture (ring buffer)
β”‚   β”œβ”€β”€ cot_controller.py         # Thinking budget resolution
β”‚   └── observability.py          # Prometheus metric recording
β”‚
β”œβ”€β”€ circuit_tracer/               # Mechanistic interpretability module
β”‚   β”œβ”€β”€ api.py                    # FastAPI router (/circuit/* endpoints)
β”‚   β”œβ”€β”€ config/
β”‚   β”‚   └── model_config.py       # Architecture specs (layers, dims, experts)
β”‚   β”œβ”€β”€ collectors/
β”‚   β”‚   β”œβ”€β”€ sglang_hooks.py       # Forward hooks for activation capture (IPC)
β”‚   β”‚   β”œβ”€β”€ sglang_model.py       # SGLang interface (tokenize, forward, cache)
β”‚   β”‚   β”œβ”€β”€ model_loader.py       # Direct model loading (HookedModel)
β”‚   β”‚   β”œβ”€β”€ activation_collector.py # Batch activation collection for training
β”‚   β”‚   └── architecture_map.py   # Layer-by-layer architecture parsing
β”‚   β”œβ”€β”€ saes/
β”‚   β”‚   β”œβ”€β”€ sparse_autoencoder.py # SAE: encode β†’ JumpReLU β†’ decode
β”‚   β”‚   β”œβ”€β”€ vision_sae.py         # Vision encoder SAE + projection SAE
β”‚   β”‚   β”œβ”€β”€ trainer.py            # SAE training loop (MSE + L1)
β”‚   β”‚   └── registry.py           # On-demand SAE management (68 total)
β”‚   β”œβ”€β”€ transcoders/
β”‚   β”‚   └── registry.py           # Transcoder registry (language + vision)
β”‚   β”œβ”€β”€ attribution/
β”‚   β”‚   β”œβ”€β”€ replacement_model.py  # Linearized model + graph construction
β”‚   β”‚   β”œβ”€β”€ graph.py              # Node/Edge/Graph data structures
β”‚   β”‚   └── pruning.py            # Backward importance pruning
β”‚   β”œβ”€β”€ interventions/
β”‚   β”‚   └── steering.py           # Feature clamping and ablation
β”‚   β”œβ”€β”€ visualization/
β”‚   β”‚   └── export.py             # JSON, summary, and Graphviz DOT export
β”‚   β”œβ”€β”€ metrics.py                # Prometheus gauges for registries
β”‚   └── data/
β”‚       β”œβ”€β”€ checkpoints/          # Trained SAE/transcoder weights
β”‚       β”œβ”€β”€ activations/          # Collected activation datasets
β”‚       └── graphs/               # Exported attribution graph JSONs
β”‚
β”œβ”€β”€ frontend/                     # React TypeScript UI
β”‚   β”œβ”€β”€ package.json              # Dependencies (React 18, Vite, Tailwind)
β”‚   β”œβ”€β”€ vite.config.ts            # Dev server config (port 3000, API proxy)
β”‚   └── src/
β”‚       β”œβ”€β”€ components/
β”‚       β”‚   β”œβ”€β”€ ChatInterface.tsx           # Main chat UI
β”‚       β”‚   β”œβ”€β”€ CircuitTracerDashboard.tsx  # Trace + architecture UI
β”‚       β”‚   β”œβ”€β”€ MoEExpertPanel.tsx          # Expert routing heatmap
β”‚       β”‚   β”œβ”€β”€ ThinkingPanel.tsx           # CoT display
β”‚       β”‚   β”œβ”€β”€ ObservabilityDash.tsx       # System metrics
β”‚       β”‚   β”œβ”€β”€ LoginScreen.tsx             # Authentication
β”‚       β”‚   └── ...
β”‚       β”œβ”€β”€ hooks/
β”‚       β”‚   └── useChat.ts        # WebSocket chat management
β”‚       └── types/
β”‚           └── index.ts          # TypeScript interfaces
β”‚
β”œβ”€β”€ deploy/
β”‚   └── nginx/
β”‚       └── creditscope.conf      # nginx reverse proxy config
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ collect_activations_bf16.py   # Collect activations from BF16 model
β”‚   β”œβ”€β”€ check_activation_health.py    # Verify activations (no inf/nan, std check)
β”‚   β”œβ”€β”€ retrain_from_saved_activations.py  # Train SAEs/TCs on activations
β”‚   β”œβ”€β”€ push_to_hf.py                # Push trained checkpoints to HF
β”‚   β”œβ”€β”€ setup.sh                     # Full dev environment setup
β”‚   β”œβ”€β”€ run_dev.sh                   # Service launcher (--no-inference, --profile)
β”‚   β”œβ”€β”€ start-dev.sh                 # PID-managed launcher (--status, --stop)
β”‚   β”œβ”€β”€ setup_nginx_http.sh          # nginx + SSL setup
β”‚   └── watchdog.sh                  # Health monitoring for cron
β”‚
β”œβ”€β”€ grafana/                      # Grafana dashboard provisioning
β”œβ”€β”€ start_services.sh             # Quick-start script
β”œβ”€β”€ pyproject.toml                # Python project metadata + dependencies
β”œβ”€β”€ .env                          # Environment configuration
└── .env.example                  # Configuration template

Development

Push Trained Models to Hugging Face

Use a Hugging Face model repo for trained SAE and transcoder checkpoints. Do not publish trained weights to the dataset backup repo unless you intentionally want them stored as raw data artifacts.

The trained model files are written under:

circuit_tracer/data/checkpoints/

The collected activation datasets are written under:

circuit_tracer/data/activations/

To publish trained checkpoints to Hugging Face:

source .venv/bin/activate
export HF_TOKEN="$HUGGING_FACE_HUB_TOKEN"

Create a private model repo:

python - <<'PY'
import os
from huggingface_hub import HfApi

api = HfApi(token=os.environ["HF_TOKEN"])
api.create_repo(
  repo_id="sarel/creditscope-trained-models",
  repo_type="model",
  private=True,
  exist_ok=True,
)
print("created_or_exists=sarel/creditscope-trained-models")
PY

Upload all trained checkpoints:

python - <<'PY'
import os
from huggingface_hub import HfApi

api = HfApi(token=os.environ["HF_TOKEN"])
api.upload_folder(
  folder_path="circuit_tracer/data/checkpoints",
  path_in_repo="checkpoints",
  repo_id="sarel/creditscope-trained-models",
  repo_type="model",
  commit_message="Upload trained SAE and transcoder checkpoints",
)
print("uploaded checkpoints to sarel/creditscope-trained-models")
PY

If you also want to publish activation tensors for reproducibility, keep those in a dataset repo instead:

python - <<'PY'
import os
from huggingface_hub import HfApi

api = HfApi(token=os.environ["HF_TOKEN"])
api.create_repo(
  repo_id="sarel/creditscope-activation-data",
  repo_type="dataset",
  private=True,
  exist_ok=True,
)
api.upload_folder(
  folder_path="circuit_tracer/data/activations",
  path_in_repo="activations",
  repo_id="sarel/creditscope-activation-data",
  repo_type="dataset",
  commit_message="Upload collected activation tensors",
)
print("uploaded activations to sarel/creditscope-activation-data")
PY

Recommended split:

  • Model repo: circuit_tracer/data/checkpoints/
  • Dataset repo: circuit_tracer/data/activations/, data/creditscope.db, exported graphs
source .venv/bin/activate

# Run tests
pytest

# Lint
ruff check .

# Type check
mypy backend inference circuit_tracer

# Format
ruff format .

Running Without GPU

./scripts/run_dev.sh --no-inference

The backend and frontend will start without the inference server. Chat will be unavailable, but you can develop on the UI, credit tools, and database.

Production Hardening

Add a watchdog cron job for auto-restart:

* * * * * cd /home/ubuntu/creditscope && ./scripts/watchdog.sh

Environment variables for watchdog:

  • WATCHDOG_BACKEND_URL β€” defaults to http://127.0.0.1:8080/health
  • WATCHDOG_INFERENCE_URL β€” defaults to http://127.0.0.1:8000/model_info
  • WATCHDOG_RESTART_COOLDOWN_SECONDS β€” defaults to 120

Training SAEs and Transcoders from Scratch

This section describes how to collect fresh activations and train SAEs/TCs from zero β€” no pre-existing checkpoints or activations needed.

Overview

1. Collect activations     ──→  2. Train SAEs + TCs  ──→  3. Push to HF  ──→  4. Run app
   (BF16 model + dataset)       (from saved .npy)         (checkpoints)       (load checkpoints)

Step 1: Collect Activations

The collection script loads the BF16 model, runs forward passes on financial text, and captures the residual stream (pre and post) at each target layer.

Data source: sarel/creditscope-fino1-activations (HuggingFace dataset with text column of financial reasoning text).

# Collect 1M tokens at layers 0, 10, 30, 39
HF_TOKEN=<your_token> python scripts/collect_activations_bf16.py \
    --model Qwen/Qwen3.5-35B-A3B \
    --dataset sarel/creditscope-fino1-activations \
    --layers 0,10,30,39 \
    --target-tokens 1000000 \
    --batch-size 4 \
    --max-seq-len 512 \
    --output-dir circuit_tracer/data/activations

# To skip HF upload:
#   --no-upload

What it does:

  1. Installs a pure-PyTorch causal_conv1d patch (needed for DeltaNet layers)
  2. Loads the BF16 model (~70GB VRAM)
  3. Runs a sanity check β€” verifies activation std is in range [1e-6, 100] (catches FP8 dequant failures)
  4. Iterates through dataset texts in batches, capturing pre (input to layer) and post (output of layer) activations
  5. Saves chunks as .npy files (float32), ~50K tokens each
  6. Computes and saves per-layer normalization statistics
  7. Uploads everything to sarel/creditscope-activations-v2

Expected output structure:

circuit_tracer/data/activations/
β”œβ”€β”€ layer_0_residual_pre/
β”‚   β”œβ”€β”€ chunk_0000.npy    # [~50000, 2048] float32
β”‚   β”œβ”€β”€ chunk_0001.npy
β”‚   └── ...
β”œβ”€β”€ layer_0_residual_post/
β”‚   └── ...
β”œβ”€β”€ layer_10_residual_pre/
β”‚   └── ...
β”œβ”€β”€ ...
β”œβ”€β”€ normalization_stats.json
└── capture_config.json

Expected activation statistics (BF16 model):

Layer Residual Pre std Residual Post std
0 ~0.03 ~0.03
10 ~0.3 ~0.3
30 ~0.3 ~0.3
39 ~0.7–0.9 ~0.7–0.9

Why BF16 and not FP8? The FP8 model (Qwen3.5-35B-A3B-FP8) does not load correctly via transformers β€” the weight_scale_inv dequantization tensors are ignored, producing corrupted weights. FP8 is just weight compression; the BF16 model produces nearly identical activations. If SGLang becomes available (fixes for sgl_kernel SM120), you can collect from FP8 via SGLang instead.

Step 2: Train SAEs and Transcoders

# Train all SAEs and TCs from the collected activations
python scripts/retrain_from_saved_activations.py

This script:

  1. Loads activation chunks from circuit_tracer/data/activations/
  2. Trains one JumpReLU SAE per layer (d_model=2048 β†’ 16384 features)
  3. Trains one MoE Transcoder per layer (maps pre β†’ post activations)
  4. Applies normalization when activation std > 1.0 (targets std=0.01)
  5. Saves checkpoints as sae_l{N}.pt and tc_l{N}.pt
  6. Logs training metrics to W&B (if configured)

Training parameters:

Parameter SAE Transcoder
Architecture JumpReLU autoencoder MoE transcoder
Input dim 2048 2048 (pre) β†’ 2048 (post)
Feature dim 16384 (8x expansion) 16384
Learning rate 3e-4 1e-4
Batch size 4096 tokens 2048 tokens
Steps 50,000 100,000
Loss MSE + L1 sparsity MSE reconstruction

Step 3: Push to HuggingFace

# Upload trained checkpoints
python scripts/push_to_hf.py

This uploads sae_l*.pt, tc_l*.pt, normalization_stats.json, and architecture_map.json to sarel/creditscope-trained-models.

Step 4: Verify

After training, verify the SAEs produce meaningful features:

# Quick sanity check
python -c "
import torch
from circuit_tracer.saes.sparse_autoencoder import SparseAutoencoder

data = torch.load('circuit_tracer/data/checkpoints/sae_l0.pt', map_location='cpu', weights_only=False)
sae = SparseAutoencoder(**data['config'])
sae.load_state_dict(data['state_dict'])

# Feed random input and check reconstruction
x = torch.randn(10, 2048) * 0.03  # match layer 0 std
out = sae(x)
print(f'Reconstruction MSE: {((x - out.x_hat)**2).mean():.6f}')
print(f'Active features (L0): {(out.z > 0).float().sum(dim=1).mean():.0f}')
print(f'Expected: L0 ~ 50-200, MSE << input variance')
"

HuggingFace Repos

Repo Type Contents
sarel/creditscope-trained-models model SAE/TC checkpoints, architecture map, deployment guide
sarel/creditscope-activations-v2 dataset Activation captures from BF16 model
sarel/creditscope-fino1-activations dataset Source financial text dataset (input for collection)

License

MIT License β€” see LICENSE for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support