YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

CreditScope

Agentic Credit Scoring with MoE Observability and Circuit Tracing

CreditScope is an AI-powered credit analysis platform built on Qwen3.5-35B-A3B-FP8, a 35-billion-parameter mixture-of-experts language model. It provides a full-stack application for credit scoring, real-time MoE expert routing visualization, chain-of-thought reasoning control, and mechanistic interpretability via circuit tracing with sparse autoencoders.

Architecture Overview
Machine Setup from Scratch
Configuration
Running the Application
Services and Ports
Algorithm Deep Dive
API Reference
Project Structure
Development
Training SAEs and Transcoders from Scratch

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                            nginx (:80/:443/:20003)                     │
│              reverse proxy + WebSocket upgrade + SSL                   │
└───────┬──────────────────────┬──────────────────────────────────────────┘
        │                      │
        ▼                      ▼
┌───────────────┐      ┌──────────────────────────────────────────────────┐
│ React Frontend│      │               FastAPI Backend (:8080)            │
│  Vite + TS    │      │                                                  │
│  Tailwind CSS │      │  ┌──────────┐ ┌────────────┐ ┌───────────────┐  │
│  Port 3000    │◄────►│  │  Agent   │ │  Credit    │ │   Circuit     │  │
│               │      │  │  (ReAct) │ │  Tools     │ │   Tracer API  │  │
└───────────────┘      │  └────┬─────┘ └────────────┘ └───────┬───────┘  │
                       │       │                              │          │
                       └───────┼──────────────────────────────┼──────────┘
                               │                              │
                               ▼                              ▼
                       ┌──────────────────────────────────────────────────┐
                       │          SGLang Inference Server (:8000)          │
                       │                                                  │
                       │  Qwen3.5-35B-A3B-FP8 (40 layers, 256 experts)   │
                       │                                                  │
                       │  ┌──────────────┐    ┌────────────────────────┐  │
                       │  │  MoE Hooks   │    │  Residual Capture Hooks│  │
                       │  │  (routing    │    │  (activation tensors   │  │
                       │  │   telemetry) │    │   via filesystem IPC)  │  │
                       │  └──────────────┘    └────────────────────────┘  │
                       └──────────────────────────────────────────────────┘

Key design decision: A single instance of the 35B model serves both chat inference and circuit tracing. The circuit tracer captures activations from the running SGLang server via forward hooks and filesystem-based IPC, avoiding the need to load a second copy of the model (which would require ~70GB additional VRAM).

Machine Setup from Scratch

These instructions provision a fresh Ubuntu server with GPU support for running CreditScope natively.

1. Hardware Requirements

Component	Minimum	Recommended
GPU	NVIDIA with 48GB+ VRAM	NVIDIA RTX PRO 6000 (96GB) or A100 80GB
CPU	8 cores	16+ cores
RAM	32GB	64GB+
Storage	100GB free	200GB+ (model weights ~35GB)
OS	Ubuntu 22.04 LTS	Ubuntu 24.04 LTS

For Blackwell-generation GPUs (RTX PRO 6000, sm_120), specific SGLang flags are required — see the SGLang flags section.

2. Install System Dependencies

sudo apt-get update && sudo apt-get upgrade -y
sudo apt-get install -y \
    python3 python3-pip python3-venv python3-dev \
    git curl wget nginx openssl lsof \
    build-essential

Install Node.js 18+:

curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs
node --version   # should be 18.x+
npm --version    # should be 9.x+

3. Install NVIDIA Drivers and CUDA

Skip this section if drivers are already installed (nvidia-smi works).

# Install NVIDIA driver (latest recommended)
sudo apt-get install -y nvidia-driver-565

# Install CUDA toolkit 12.x
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-6

# Verify
nvidia-smi
nvcc --version

4. Clone and Configure

cd /home/ubuntu
git clone https://github.com/sarelWeinberger/creditscope.git
cd creditscope
cp .env.example .env

Edit .env and set:

# Required
AUTH_USERS=admin@creditscope.local        # comma-separated allowed emails
AUTH_PASSWORD=your-secure-password        # shared login password
AUTH_SECRET_KEY=$(openssl rand -hex 32)   # session signing key

# Set your server's public IP for CORS
CORS_ORIGINS=http://localhost:3000,http://YOUR_PUBLIC_IP

# Optional: HuggingFace token if model is gated
HUGGING_FACE_HUB_TOKEN=hf_xxxxxxxxxxxxx

5. Create Python Virtual Environment

python3 -m venv .venv
source .venv/bin/activate

# Install the project with all dependencies
pip install -e ".[dev,backend,inference,circuit]"

For Blackwell GPUs, you may need a specific CuDNN version:

pip install nvidia-cudnn-cu12==9.16.0.29

6. Install Frontend Dependencies

cd frontend
npm install
cd ..

7. Set Up nginx Reverse Proxy

chmod +x scripts/setup_nginx_http.sh
./scripts/setup_nginx_http.sh

This configures nginx to:

Route / to the Vite frontend on port 3000
Route /api/ to the FastAPI backend on port 8080
Upgrade /api/chat/ws connections to WebSocket
Listen on ports 80, 443 (self-signed SSL), and 20003

8. Download Model Weights

The model downloads automatically on first SGLang start. To pre-download:

source .venv/bin/activate
python3 -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3.5-35B-A3B-FP8')"

This downloads ~~35GB to `~~/.cache/huggingface/`.

9. Start Everything

chmod +x start_services.sh
./start_services.sh

This starts:

SGLang on port 8000 (waits ~60s for model load)
FastAPI backend on port 8080
Frontend must be started separately:

cd frontend && npm run dev &

Verify:

curl -s http://127.0.0.1:8000/v1/models | head -5   # SGLang
curl -s http://127.0.0.1:8080/health                 # Backend
curl -s http://127.0.0.1:3000/                        # Frontend
curl -s http://YOUR_PUBLIC_IP:20003/                  # Public (nginx)

Configuration

All configuration is in .env. Key variables:

Variable	Default	Description
`MODEL_PATH`	`Qwen/Qwen3.5-35B-A3B-FP8`	HuggingFace model ID
`CONTEXT_LENGTH`	`4096`	Max context window
`TP_SIZE`	`1`	Tensor parallelism (number of GPUs)
`MEM_FRACTION_STATIC`	`0.98`	Fraction of GPU VRAM for KV cache
`SGLANG_PORT`	`8000`	Inference server port
`BACKEND_PORT`	`8080`	FastAPI backend port
`FRONTEND_PORT`	`3000`	Vite dev server port
`DATABASE_URL`	`sqlite:///./data/creditscope.db`	Database path
`SEED_DB`	`true`	Seed sample customers on startup
`DEFAULT_THINKING_BUDGET`	`standard`	CoT budget preset
`AUTH_USERS`	—	Comma-separated allowed login emails
`AUTH_PASSWORD`	—	Shared login password
`AUTH_SECRET_KEY`	—	HMAC key for session cookies

SGLang Flags for Blackwell GPUs

Blackwell-generation GPUs (sm_120: RTX PRO 6000, RTX 5090, etc.) require specific flags:

SGLANG_EXTRA_ARGS="
  --attention-backend triton         # Only triton/trtllm_mha supported on Blackwell
  --fp8-gemm-backend triton          # flashinfer FP8 unsupported on sm_120
  --disable-cuda-graph               # Stability on newer architectures
  --max-mamba-cache-size 16          # Limit DeltaNet state cache
  --skip-server-warmup               # Faster startup
  --chunked-prefill-size 512         # Reduce memory peaks
  --max-running-requests 2           # Limit concurrency
  --max-total-tokens 65536           # KV cache token budget
"

Also set these environment variables before launching SGLang:

SGLANG_ENABLE_JIT_DEEPGEMM=0        # Avoid DeepGemm recipe errors on sm_120
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True  # Better memory allocation

Thinking Budget Presets

Preset	Tokens	Use Case
`none`	0	Direct responses only
`minimal`	128	Simple lookups
`short`	512	Quick calculations
`standard`	2,048	Normal analysis
`extended`	8,192	Complex reasoning
`deep`	32,768	Thorough investigation
`unlimited`	-1	No limit

Running the Application

Quick Start (All Services)

./start_services.sh
cd frontend && npm run dev &

Individual Services

# SGLang inference server
SGLANG_ENABLE_JIT_DEEPGEMM=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  PYTHONPATH=/home/ubuntu/creditscope \
  .venv/bin/python -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-35B-A3B-FP8 \
    --port 8000 --tp-size 1 --mem-fraction-static 0.50 \
    --context-length 2048 --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder --enable-metrics \
    --attention-backend triton --fp8-gemm-backend triton \
    --disable-cuda-graph --skip-server-warmup \
    --forward-hooks '[{"name":"residual_capture","target_modules":["model.layers.*"],"hook_factory":"circuit_tracer.collectors.sglang_hooks:residual_capture_factory","config":{}}]'

# FastAPI backend
.venv/bin/uvicorn backend.main:app --host 127.0.0.1 --port 8080 --reload

# Frontend
cd frontend && npm run dev

Docker Deployment

cp .env.example .env
docker compose up -d --build

Requires Docker, docker-compose-v2, and NVIDIA Container Toolkit for GPU support.

Logs

tail -f /tmp/sglang.log     # SGLang inference
tail -f /tmp/backend.log     # FastAPI backend

Services and Ports

Service	Port	Protocol	Description
SGLang	8000	HTTP	OpenAI-compatible inference API
Backend	8080	HTTP/WS	FastAPI + agent + circuit tracer
Frontend	3000	HTTP	Vite React dev server
nginx	80/443/20003	HTTP/HTTPS	Public reverse proxy
Prometheus	9090	HTTP	Metrics (Docker only)
Grafana	3001	HTTP	Dashboards (Docker only)

Algorithm Deep Dive

1. Model Architecture — Qwen3.5-35B-A3B-FP8

CreditScope runs on a hybrid attention + MoE architecture:

Input tokens
     │
     ▼
┌─────────────┐
│  Embedding  │  d_model = 2048
└─────┬───────┘
      │
      ▼
┌─────────────────────────────────────────────────┐
│  40 Decoder Layers (repeating pattern of 4):    │
│                                                  │
│  Layer N+0: DeltaNet (linear) attention + MoE   │
│  Layer N+1: DeltaNet (linear) attention + MoE   │
│  Layer N+2: DeltaNet (linear) attention + MoE   │
│  Layer N+3: Standard (full) attention   + MoE   │
│                                                  │
│  Each MoE layer: 256 experts, top-8 routing     │
│  Per-expert intermediate size: 512              │
│                                                  │
│  Full attention at layers: 3,7,11,...,35,39     │
│  DeltaNet attention at all other layers         │
└─────────────────────────────────────────────────┘
      │
      ▼
┌─────────────┐
│  LM Head    │  → next-token logits
└─────────────┘

Vision Encoder (27 ViT layers):
  Hidden size: 1152 → projects to d_model=2048
  Intermediate size: 4304
  16 attention heads, patch size 16

DeltaNet layers use linear attention (O(n) vs O(n^2)), which makes the model efficient for long sequences. Standard attention layers every 4th position provide full-context mixing.

Mixture of Experts (MoE): Every layer has 256 experts but only routes each token to the top 8. This gives the model 35B total parameters but only ~3B active per token, enabling fast inference.

2. ReAct Agent Loop

The agent uses a Reason-Act-Observe loop to answer credit analysis queries:

User Query: "Evaluate John Smith for a $50,000 business loan"
     │
     ▼
┌─────────────────────────────────────┐
│  1. REASON (thinking tokens)        │
│     "I need to check credit score,  │
│      DTI ratio, and collateral..."  │
│                                     │
│  2. ACT (tool call)                 │
│     calculate_credit_score(id=42)   │
│                                     │
│  3. OBSERVE (tool result)           │
│     Score: 720, Grade: B            │
│                                     │
│  4. REASON again                    │
│     "Score is good, need DTI next"  │
│                                     │
│  5. ACT                             │
│     calculate_dti(id=42, amount=50k)│
│                                     │
│  ... (up to 8 steps) ...            │
│                                     │
│  FINAL: Synthesize response         │
└─────────────────────────────────────┘

Available Credit Tools:

Tool	Description
`calculate_credit_score`	Weighted score from payment history (35%), utilization (30%), age (15%), mix (10%), inquiries (10%)
`calculate_dti`	Front-end and back-end debt-to-income ratios with risk classification
`analyze_payment_history`	Delinquency patterns, on-time rate, severity scoring
`evaluate_collateral`	Loan-to-value ratio, haircut-adjusted value, coverage ratio
`structure_loan`	Amortization schedule, monthly payment calculation
`apply_risk_adjustments`	Regulatory and behavioral risk adjustments to base score

3. MoE Expert Routing Observability

During every inference request, forward hooks on the MoE gate modules capture:

For each of the 40 MoE layers:
  ┌─────────────────────────────────────────────┐
  │ router_logits: [num_tokens × 256 experts]   │
  │      │                                      │
  │      ▼ softmax + top-8 selection            │
  │ selected_experts: [num_tokens × 8]          │
  │ gating_weights:   [num_tokens × 8]          │
  │                                              │
  │ Metrics computed:                            │
  │   - Expert load distribution                │
  │   - Shannon entropy of routing distribution │
  │   - Per-expert activation frequency         │
  └─────────────────────────────────────────────┘

Shannon entropy measures routing diversity:

Low entropy → tokens concentrate on few experts (specialized)
High entropy → tokens spread evenly across experts (generic)

The frontend displays this as a real-time heatmap of expert activations across layers.

4. Circuit Tracing with Sparse Autoencoders

Circuit tracing discovers which internal features of the model drive a specific prediction. The pipeline has five stages:

Stage 1: Activation Capture

When a trace is requested, the backend creates a sentinel file (/tmp/circuit_trace_capture). Forward hooks registered in the SGLang process detect this and save the residual stream output from each decoder layer:

SGLang Process                      Backend Process
─────────────────                   ────────────────
                                    1. Create sentinel file
model.layers[0](x) → hook saves
  /tmp/circuit_trace_activations/post_0.pt
model.layers[1](x) → hook saves
  /tmp/circuit_trace_activations/post_1.pt
...
model.layers[39](x) → hook saves
  /tmp/circuit_trace_activations/post_39.pt
                                    2. Remove sentinel
                                    3. Load all post_*.pt files

This filesystem-based IPC has zero overhead during normal chat — the hook checks one stat() call per layer and returns immediately if the sentinel doesn't exist.

Stage 2: Sparse Autoencoder Feature Extraction

Each SAE decomposes a 2048-dimensional residual stream vector into ~16,384 sparse features:

Residual stream x ∈ R^2048
     │
     ▼
x_centered = x - bias
     │
     ▼
pre_act = W_enc · x_centered + b_enc     (2048 → 16384)
     │
     ▼
z = JumpReLU(pre_act)                     Sparse: ~50 active out of 16384
     │                                    z_i = pre_act_i  if pre_act_i > θ_i
     ▼                                    z_i = 0          otherwise
x_hat = W_dec · z + bias                  (16384 → 2048)

JumpReLU activation (from Anthropic's scaling monosemanticity work) learns a per-feature threshold θ_i. Features only activate when their pre-activation exceeds this learned threshold, giving cleaner sparsity than standard ReLU.

Training objective:

L = ||x - x_hat||² + λ · ||z||₁
     ─────────────   ──────────
     reconstruction   sparsity
         loss         penalty (λ = 3×10⁻⁴)

SAE Registry manages 68 SAEs across the full model:

40 language SAEs (one per decoder layer, 2048 → 16384 features)
27 vision SAEs (one per ViT layer, 1152 → 9216 features)
1 projection SAE (vision→language bridge, 1152 → 2048)

SAEs are created on-demand when a layer is first traced, avoiding allocating all 68 at startup.

Stage 3: Attribution Graph Construction

The graph represents causal flow from input tokens through features to the output prediction:

Nodes:
  - Input: one per token position
  - Feature: (layer, position, feature_idx) with activation value
  - Output: the target token prediction

Edges:
  - Feature → Output: activation value (how much this feature contributes)
  - Feature → Feature: virtual weight × source activation

Virtual weights between features in adjacent layers are computed as:

W_virtual = W_dec_src^T · W_enc_tgt^T

where:
  W_dec_src: decoder weights of source layer's SAE  (d_model × n_features)
  W_enc_tgt: encoder weights of target layer's SAE  (n_features × d_model)
  W_virtual: (n_features_src × n_features_tgt)

Attribution(src→tgt) = activation_src × W_virtual[src_feat, tgt_feat]

This captures the linear pathway: how much a source feature's decoder direction projects onto the target feature's encoder direction.

Stage 4: Graph Pruning

Raw graphs can have thousands of nodes. Pruning keeps only high-impact nodes:

1. Score each node by backward-propagated importance:
   - Output nodes get importance = 1.0
   - For each edge (src → tgt):
     importance[src] += |edge.weight| × importance[tgt]

2. Rank feature nodes by importance score

3. Keep top 10% (configurable), always keeping input/output nodes

4. Drop edges between pruned nodes

Stage 5: Feature Steering (Causal Validation)

Once a circuit is identified, steering validates whether those features actually control the output:

Baseline:
  model("Analyze loan risk") → "The applicant shows moderate risk..."

Intervention (clamp feature 6392 at layer 39 to 0):
  model("Analyze loan risk") → "The applicant appears to be..."
                                 ▲ different output confirms
                                   feature 6392 was causal

Steering works by:

Running the model normally to get a baseline output
Registering a forward hook at the target layer that:
- Encodes the residual stream through the SAE
- Modifies the specified feature activation(s)
- Decodes back to residual stream space
Generating again with the hook active
Comparing the outputs

5. Chain-of-Thought Budget Control

The thinking budget system controls how many tokens the model spends on internal reasoning before responding:

User query arrives
     │
     ▼
┌───────────────────────────┐
│ Budget Resolution:        │
│   "standard" → 2048 tkns │
│   "deep"     → 32768 tkns│
│   "none"     → 0 (skip)  │
└───────────────────────────┘
     │
     ▼
SGLang API call with:
  max_completion_tokens = budget + response_limit
  thinking { type: "enabled", budget_tokens: 2048 }
     │
     ▼
Model generates:
  <think>I need to evaluate... [up to budget tokens]</think>
  The credit analysis shows... [response tokens]

The frontend displays thinking content in a collapsible panel with token count and duration.

API Reference

Authentication

POST /api/auth/login     { email, password } → session cookie
GET  /api/auth/me        → current user info

Chat

POST /api/chat           Process chat message with agent
WS   /api/chat/ws        WebSocket streaming (thinking + response deltas)

Customers

GET  /api/customers                  List customers (paginated)
GET  /api/customers/{id}             Customer details
GET  /api/customers/{id}/credit-report   Full credit report

Circuit Tracer

POST /api/circuit/trace              Trace a prompt → attribution graph
GET  /api/circuit/architecture       Model architecture map (lang + vision)
GET  /api/circuit/saes               List SAE checkpoints
GET  /api/circuit/transcoders        List transcoder checkpoints
GET  /api/circuit/registry/status    Registry summary (counts, config)
POST /api/circuit/steer              Run feature steering intervention

Observability

GET  /api/observability/moe/current       Current MoE expert activations
GET  /api/observability/moe/history       Historical expert routing data
GET  /api/observability/thinking/sessions  Thinking session data

Configuration

GET  /api/thinking/budgets           Available thinking budget presets
POST /api/thinking/config            Update thinking configuration

System

GET  /health             Health check
GET  /metrics            Prometheus metrics (text format)

Project Structure

creditscope/
├── backend/                      # FastAPI backend
│   ├── main.py                   # App entry point, middleware, router mounting
│   ├── auth.py                   # Session cookie authentication (HMAC)
│   ├── agent/
│   │   ├── orchestrator.py       # ReAct agent loop (up to 8 tool steps)
│   │   ├── tool_registry.py      # Tool dispatch and execution
│   │   ├── prompts.py            # System prompt and tool definitions
│   │   └── image_handler.py      # Document OCR processing
│   ├── db/
│   │   ├── models.py             # SQLAlchemy ORM (Customer, Loan, Document)
│   │   ├── queries.py            # Database query functions
│   │   └── seed.py               # Sample data seeder (55 customers)
│   ├── routers/
│   │   ├── chat.py               # Chat endpoint + WebSocket streaming
│   │   ├── customers.py          # Customer CRUD endpoints
│   │   ├── auth.py               # Login/logout endpoints
│   │   ├── history.py            # Conversation history
│   │   ├── observability.py      # MoE and thinking metrics
│   │   └── thinking.py           # CoT configuration
│   ├── schemas/                  # Pydantic request/response models
│   └── tools/
│       ├── credit_score.py       # Weighted credit score calculation
│       ├── debt_to_income.py     # DTI ratio computation
│       ├── payment_history.py    # Payment pattern analysis
│       ├── collateral_eval.py    # LTV and coverage assessment
│       ├── loan_structure.py     # Amortization and term calculation
│       └── risk_adjustment.py    # Regulatory risk adjustments
│
├── inference/                    # SGLang inference configuration
│   ├── config.py                 # Model config, expert counts, sampling params
│   ├── server.py                 # SGLang subprocess launcher
│   ├── moe_hooks.py              # Expert routing capture (ring buffer)
│   ├── cot_controller.py         # Thinking budget resolution
│   └── observability.py          # Prometheus metric recording
│
├── circuit_tracer/               # Mechanistic interpretability module
│   ├── api.py                    # FastAPI router (/circuit/* endpoints)
│   ├── config/
│   │   └── model_config.py       # Architecture specs (layers, dims, experts)
│   ├── collectors/
│   │   ├── sglang_hooks.py       # Forward hooks for activation capture (IPC)
│   │   ├── sglang_model.py       # SGLang interface (tokenize, forward, cache)
│   │   ├── model_loader.py       # Direct model loading (HookedModel)
│   │   ├── activation_collector.py # Batch activation collection for training
│   │   └── architecture_map.py   # Layer-by-layer architecture parsing
│   ├── saes/
│   │   ├── sparse_autoencoder.py # SAE: encode → JumpReLU → decode
│   │   ├── vision_sae.py         # Vision encoder SAE + projection SAE
│   │   ├── trainer.py            # SAE training loop (MSE + L1)
│   │   └── registry.py           # On-demand SAE management (68 total)
│   ├── transcoders/
│   │   └── registry.py           # Transcoder registry (language + vision)
│   ├── attribution/
│   │   ├── replacement_model.py  # Linearized model + graph construction
│   │   ├── graph.py              # Node/Edge/Graph data structures
│   │   └── pruning.py            # Backward importance pruning
│   ├── interventions/
│   │   └── steering.py           # Feature clamping and ablation
│   ├── visualization/
│   │   └── export.py             # JSON, summary, and Graphviz DOT export
│   ├── metrics.py                # Prometheus gauges for registries
│   └── data/
│       ├── checkpoints/          # Trained SAE/transcoder weights
│       ├── activations/          # Collected activation datasets
│       └── graphs/               # Exported attribution graph JSONs
│
├── frontend/                     # React TypeScript UI
│   ├── package.json              # Dependencies (React 18, Vite, Tailwind)
│   ├── vite.config.ts            # Dev server config (port 3000, API proxy)
│   └── src/
│       ├── components/
│       │   ├── ChatInterface.tsx           # Main chat UI
│       │   ├── CircuitTracerDashboard.tsx  # Trace + architecture UI
│       │   ├── MoEExpertPanel.tsx          # Expert routing heatmap
│       │   ├── ThinkingPanel.tsx           # CoT display
│       │   ├── ObservabilityDash.tsx       # System metrics
│       │   ├── LoginScreen.tsx             # Authentication
│       │   └── ...
│       ├── hooks/
│       │   └── useChat.ts        # WebSocket chat management
│       └── types/
│           └── index.ts          # TypeScript interfaces
│
├── deploy/
│   └── nginx/
│       └── creditscope.conf      # nginx reverse proxy config
│
├── scripts/
│   ├── collect_activations_bf16.py   # Collect activations from BF16 model
│   ├── check_activation_health.py    # Verify activations (no inf/nan, std check)
│   ├── retrain_from_saved_activations.py  # Train SAEs/TCs on activations
│   ├── push_to_hf.py                # Push trained checkpoints to HF
│   ├── setup.sh                     # Full dev environment setup
│   ├── run_dev.sh                   # Service launcher (--no-inference, --profile)
│   ├── start-dev.sh                 # PID-managed launcher (--status, --stop)
│   ├── setup_nginx_http.sh          # nginx + SSL setup
│   └── watchdog.sh                  # Health monitoring for cron
│
├── grafana/                      # Grafana dashboard provisioning
├── start_services.sh             # Quick-start script
├── pyproject.toml                # Python project metadata + dependencies
├── .env                          # Environment configuration
└── .env.example                  # Configuration template

Development

Push Trained Models to Hugging Face

Use a Hugging Face model repo for trained SAE and transcoder checkpoints. Do not publish trained weights to the dataset backup repo unless you intentionally want them stored as raw data artifacts.

The trained model files are written under:

circuit_tracer/data/checkpoints/

The collected activation datasets are written under:

circuit_tracer/data/activations/

To publish trained checkpoints to Hugging Face:

source .venv/bin/activate
export HF_TOKEN="$HUGGING_FACE_HUB_TOKEN"

Create a private model repo:

python - <<'PY'
import os
from huggingface_hub import HfApi

api = HfApi(token=os.environ["HF_TOKEN"])
api.create_repo(
  repo_id="sarel/creditscope-trained-models",
  repo_type="model",
  private=True,
  exist_ok=True,
)
print("created_or_exists=sarel/creditscope-trained-models")
PY

Upload all trained checkpoints:

python - <<'PY'
import os
from huggingface_hub import HfApi

api = HfApi(token=os.environ["HF_TOKEN"])
api.upload_folder(
  folder_path="circuit_tracer/data/checkpoints",
  path_in_repo="checkpoints",
  repo_id="sarel/creditscope-trained-models",
  repo_type="model",
  commit_message="Upload trained SAE and transcoder checkpoints",
)
print("uploaded checkpoints to sarel/creditscope-trained-models")
PY

If you also want to publish activation tensors for reproducibility, keep those in a dataset repo instead:

python - <<'PY'
import os
from huggingface_hub import HfApi

api = HfApi(token=os.environ["HF_TOKEN"])
api.create_repo(
  repo_id="sarel/creditscope-activation-data",
  repo_type="dataset",
  private=True,
  exist_ok=True,
)
api.upload_folder(
  folder_path="circuit_tracer/data/activations",
  path_in_repo="activations",
  repo_id="sarel/creditscope-activation-data",
  repo_type="dataset",
  commit_message="Upload collected activation tensors",
)
print("uploaded activations to sarel/creditscope-activation-data")
PY

Recommended split:

Model repo: circuit_tracer/data/checkpoints/
Dataset repo: circuit_tracer/data/activations/, data/creditscope.db, exported graphs

source .venv/bin/activate

# Run tests
pytest

# Lint
ruff check .

# Type check
mypy backend inference circuit_tracer

# Format
ruff format .

Running Without GPU

./scripts/run_dev.sh --no-inference

The backend and frontend will start without the inference server. Chat will be unavailable, but you can develop on the UI, credit tools, and database.

Production Hardening

Add a watchdog cron job for auto-restart:

* * * * * cd /home/ubuntu/creditscope && ./scripts/watchdog.sh

Environment variables for watchdog:

WATCHDOG_BACKEND_URL — defaults to http://127.0.0.1:8080/health
WATCHDOG_INFERENCE_URL — defaults to http://127.0.0.1:8000/model_info
WATCHDOG_RESTART_COOLDOWN_SECONDS — defaults to 120

Training SAEs and Transcoders from Scratch

This section describes how to collect fresh activations and train SAEs/TCs from zero — no pre-existing checkpoints or activations needed.

Overview

1. Collect activations     ──→  2. Train SAEs + TCs  ──→  3. Push to HF  ──→  4. Run app
   (BF16 model + dataset)       (from saved .npy)         (checkpoints)       (load checkpoints)

Step 1: Collect Activations

The collection script loads the BF16 model, runs forward passes on financial text, and captures the residual stream (pre and post) at each target layer.

Data source: sarel/creditscope-fino1-activations (HuggingFace dataset with text column of financial reasoning text).

# Collect 1M tokens at layers 0, 10, 30, 39
HF_TOKEN=<your_token> python scripts/collect_activations_bf16.py \
    --model Qwen/Qwen3.5-35B-A3B \
    --dataset sarel/creditscope-fino1-activations \
    --layers 0,10,30,39 \
    --target-tokens 1000000 \
    --batch-size 4 \
    --max-seq-len 512 \
    --output-dir circuit_tracer/data/activations

# To skip HF upload:
#   --no-upload

What it does:

Installs a pure-PyTorch causal_conv1d patch (needed for DeltaNet layers)
Loads the BF16 model (~70GB VRAM)
Runs a sanity check — verifies activation std is in range [1e-6, 100] (catches FP8 dequant failures)
Iterates through dataset texts in batches, capturing pre (input to layer) and post (output of layer) activations
Saves chunks as .npy files (float32), ~50K tokens each
Computes and saves per-layer normalization statistics
Uploads everything to sarel/creditscope-activations-v2

Expected output structure:

circuit_tracer/data/activations/
├── layer_0_residual_pre/
│   ├── chunk_0000.npy    # [~50000, 2048] float32
│   ├── chunk_0001.npy
│   └── ...
├── layer_0_residual_post/
│   └── ...
├── layer_10_residual_pre/
│   └── ...
├── ...
├── normalization_stats.json
└── capture_config.json

Expected activation statistics (BF16 model):

Layer	Residual Pre std	Residual Post std
0	~0.03	~0.03
10	~0.3	~0.3
30	~0.3	~0.3
39	~0.7–0.9	~0.7–0.9

Why BF16 and not FP8? The FP8 model (Qwen3.5-35B-A3B-FP8) does not load correctly via transformers — the weight_scale_inv dequantization tensors are ignored, producing corrupted weights. FP8 is just weight compression; the BF16 model produces nearly identical activations. If SGLang becomes available (fixes for sgl_kernel SM120), you can collect from FP8 via SGLang instead.

Step 2: Train SAEs and Transcoders

# Train all SAEs and TCs from the collected activations
python scripts/retrain_from_saved_activations.py

This script:

Loads activation chunks from circuit_tracer/data/activations/
Trains one JumpReLU SAE per layer (d_model=2048 → 16384 features)
Trains one MoE Transcoder per layer (maps pre → post activations)
Applies normalization when activation std > 1.0 (targets std=0.01)
Saves checkpoints as sae_l{N}.pt and tc_l{N}.pt
Logs training metrics to W&B (if configured)

Training parameters:

Parameter	SAE	Transcoder
Architecture	JumpReLU autoencoder	MoE transcoder
Input dim	2048	2048 (pre) → 2048 (post)
Feature dim	16384 (8x expansion)	16384
Learning rate	3e-4	1e-4
Batch size	4096 tokens	2048 tokens
Steps	50,000	100,000
Loss	MSE + L1 sparsity	MSE reconstruction

Step 3: Push to HuggingFace

# Upload trained checkpoints
python scripts/push_to_hf.py

This uploads sae_l*.pt, tc_l*.pt, normalization_stats.json, and architecture_map.json to sarel/creditscope-trained-models.

Step 4: Verify

After training, verify the SAEs produce meaningful features:

# Quick sanity check
python -c "
import torch
from circuit_tracer.saes.sparse_autoencoder import SparseAutoencoder

data = torch.load('circuit_tracer/data/checkpoints/sae_l0.pt', map_location='cpu', weights_only=False)
sae = SparseAutoencoder(**data['config'])
sae.load_state_dict(data['state_dict'])

# Feed random input and check reconstruction
x = torch.randn(10, 2048) * 0.03  # match layer 0 std
out = sae(x)
print(f'Reconstruction MSE: {((x - out.x_hat)**2).mean():.6f}')
print(f'Active features (L0): {(out.z > 0).float().sum(dim=1).mean():.0f}')
print(f'Expected: L0 ~ 50-200, MSE << input variance')
"

HuggingFace Repos

Repo	Type	Contents
`sarel/creditscope-trained-models`	model	SAE/TC checkpoints, architecture map, deployment guide
`sarel/creditscope-activations-v2`	dataset	Activation captures from BF16 model
`sarel/creditscope-fino1-activations`	dataset	Source financial text dataset (input for collection)

License

MIT License — see LICENSE for details.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

CreditScope

Table of Contents

Architecture Overview

Machine Setup from Scratch

1. Hardware Requirements

2. Install System Dependencies

3. Install NVIDIA Drivers and CUDA

4. Clone and Configure

5. Create Python Virtual Environment

6. Install Frontend Dependencies

7. Set Up nginx Reverse Proxy

8. Download Model Weights

9. Start Everything

Configuration

SGLang Flags for Blackwell GPUs

Thinking Budget Presets

Running the Application

Quick Start (All Services)

Individual Services

Docker Deployment

Logs

Services and Ports

Algorithm Deep Dive

1. Model Architecture — Qwen3.5-35B-A3B-FP8

2. ReAct Agent Loop

3. MoE Expert Routing Observability

4. Circuit Tracing with Sparse Autoencoders

Stage 1: Activation Capture

Stage 2: Sparse Autoencoder Feature Extraction

Stage 3: Attribution Graph Construction

Stage 4: Graph Pruning

Stage 5: Feature Steering (Causal Validation)

5. Chain-of-Thought Budget Control

API Reference

Authentication

Chat

Customers

Circuit Tracer

Observability

Configuration

System

Project Structure

Development

Push Trained Models to Hugging Face

Running Without GPU

Production Hardening

Training SAEs and Transcoders from Scratch

Overview

Step 1: Collect Activations

Step 2: Train SAEs and Transcoders

Step 3: Push to HuggingFace

Step 4: Verify

HuggingFace Repos

License