YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
CreditScope
Agentic Credit Scoring with MoE Observability and Circuit Tracing
CreditScope is an AI-powered credit analysis platform built on Qwen3.5-35B-A3B-FP8, a 35-billion-parameter mixture-of-experts language model. It provides a full-stack application for credit scoring, real-time MoE expert routing visualization, chain-of-thought reasoning control, and mechanistic interpretability via circuit tracing with sparse autoencoders.
Table of Contents
- Architecture Overview
- Machine Setup from Scratch
- Configuration
- Running the Application
- Services and Ports
- Algorithm Deep Dive
- API Reference
- Project Structure
- Development
- Training SAEs and Transcoders from Scratch
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β nginx (:80/:443/:20003) β
β reverse proxy + WebSocket upgrade + SSL β
βββββββββ¬βββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β React Frontendβ β FastAPI Backend (:8080) β
β Vite + TS β β β
β Tailwind CSS β β ββββββββββββ ββββββββββββββ βββββββββββββββββ β
β Port 3000 βββββββΊβ β Agent β β Credit β β Circuit β β
β β β β (ReAct) β β Tools β β Tracer API β β
βββββββββββββββββ β ββββββ¬ββββββ ββββββββββββββ βββββββββ¬ββββββββ β
β β β β
βββββββββΌβββββββββββββββββββββββββββββββΌβββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SGLang Inference Server (:8000) β
β β
β Qwen3.5-35B-A3B-FP8 (40 layers, 256 experts) β
β β
β ββββββββββββββββ ββββββββββββββββββββββββββ β
β β MoE Hooks β β Residual Capture Hooksβ β
β β (routing β β (activation tensors β β
β β telemetry) β β via filesystem IPC) β β
β ββββββββββββββββ ββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key design decision: A single instance of the 35B model serves both chat inference and circuit tracing. The circuit tracer captures activations from the running SGLang server via forward hooks and filesystem-based IPC, avoiding the need to load a second copy of the model (which would require ~70GB additional VRAM).
Machine Setup from Scratch
These instructions provision a fresh Ubuntu server with GPU support for running CreditScope natively.
1. Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA with 48GB+ VRAM | NVIDIA RTX PRO 6000 (96GB) or A100 80GB |
| CPU | 8 cores | 16+ cores |
| RAM | 32GB | 64GB+ |
| Storage | 100GB free | 200GB+ (model weights ~35GB) |
| OS | Ubuntu 22.04 LTS | Ubuntu 24.04 LTS |
For Blackwell-generation GPUs (RTX PRO 6000, sm_120), specific SGLang flags are required β see the SGLang flags section.
2. Install System Dependencies
sudo apt-get update && sudo apt-get upgrade -y
sudo apt-get install -y \
python3 python3-pip python3-venv python3-dev \
git curl wget nginx openssl lsof \
build-essential
Install Node.js 18+:
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs
node --version # should be 18.x+
npm --version # should be 9.x+
3. Install NVIDIA Drivers and CUDA
Skip this section if drivers are already installed (nvidia-smi works).
# Install NVIDIA driver (latest recommended)
sudo apt-get install -y nvidia-driver-565
# Install CUDA toolkit 12.x
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-6
# Verify
nvidia-smi
nvcc --version
4. Clone and Configure
cd /home/ubuntu
git clone https://github.com/sarelWeinberger/creditscope.git
cd creditscope
cp .env.example .env
Edit .env and set:
# Required
AUTH_USERS=admin@creditscope.local # comma-separated allowed emails
AUTH_PASSWORD=your-secure-password # shared login password
AUTH_SECRET_KEY=$(openssl rand -hex 32) # session signing key
# Set your server's public IP for CORS
CORS_ORIGINS=http://localhost:3000,http://YOUR_PUBLIC_IP
# Optional: HuggingFace token if model is gated
HUGGING_FACE_HUB_TOKEN=hf_xxxxxxxxxxxxx
5. Create Python Virtual Environment
python3 -m venv .venv
source .venv/bin/activate
# Install the project with all dependencies
pip install -e ".[dev,backend,inference,circuit]"
For Blackwell GPUs, you may need a specific CuDNN version:
pip install nvidia-cudnn-cu12==9.16.0.29
6. Install Frontend Dependencies
cd frontend
npm install
cd ..
7. Set Up nginx Reverse Proxy
chmod +x scripts/setup_nginx_http.sh
./scripts/setup_nginx_http.sh
This configures nginx to:
- Route
/to the Vite frontend on port 3000 - Route
/api/to the FastAPI backend on port 8080 - Upgrade
/api/chat/wsconnections to WebSocket - Listen on ports 80, 443 (self-signed SSL), and 20003
8. Download Model Weights
The model downloads automatically on first SGLang start. To pre-download:
source .venv/bin/activate
python3 -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3.5-35B-A3B-FP8')"
This downloads 35GB to `/.cache/huggingface/`.
9. Start Everything
chmod +x start_services.sh
./start_services.sh
This starts:
- SGLang on port 8000 (waits ~60s for model load)
- FastAPI backend on port 8080
- Frontend must be started separately:
cd frontend && npm run dev &
Verify:
curl -s http://127.0.0.1:8000/v1/models | head -5 # SGLang
curl -s http://127.0.0.1:8080/health # Backend
curl -s http://127.0.0.1:3000/ # Frontend
curl -s http://YOUR_PUBLIC_IP:20003/ # Public (nginx)
Configuration
All configuration is in .env. Key variables:
| Variable | Default | Description |
|---|---|---|
MODEL_PATH |
Qwen/Qwen3.5-35B-A3B-FP8 |
HuggingFace model ID |
CONTEXT_LENGTH |
4096 |
Max context window |
TP_SIZE |
1 |
Tensor parallelism (number of GPUs) |
MEM_FRACTION_STATIC |
0.98 |
Fraction of GPU VRAM for KV cache |
SGLANG_PORT |
8000 |
Inference server port |
BACKEND_PORT |
8080 |
FastAPI backend port |
FRONTEND_PORT |
3000 |
Vite dev server port |
DATABASE_URL |
sqlite:///./data/creditscope.db |
Database path |
SEED_DB |
true |
Seed sample customers on startup |
DEFAULT_THINKING_BUDGET |
standard |
CoT budget preset |
AUTH_USERS |
β | Comma-separated allowed login emails |
AUTH_PASSWORD |
β | Shared login password |
AUTH_SECRET_KEY |
β | HMAC key for session cookies |
SGLang Flags for Blackwell GPUs
Blackwell-generation GPUs (sm_120: RTX PRO 6000, RTX 5090, etc.) require specific flags:
SGLANG_EXTRA_ARGS="
--attention-backend triton # Only triton/trtllm_mha supported on Blackwell
--fp8-gemm-backend triton # flashinfer FP8 unsupported on sm_120
--disable-cuda-graph # Stability on newer architectures
--max-mamba-cache-size 16 # Limit DeltaNet state cache
--skip-server-warmup # Faster startup
--chunked-prefill-size 512 # Reduce memory peaks
--max-running-requests 2 # Limit concurrency
--max-total-tokens 65536 # KV cache token budget
"
Also set these environment variables before launching SGLang:
SGLANG_ENABLE_JIT_DEEPGEMM=0 # Avoid DeepGemm recipe errors on sm_120
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True # Better memory allocation
Thinking Budget Presets
| Preset | Tokens | Use Case |
|---|---|---|
none |
0 | Direct responses only |
minimal |
128 | Simple lookups |
short |
512 | Quick calculations |
standard |
2,048 | Normal analysis |
extended |
8,192 | Complex reasoning |
deep |
32,768 | Thorough investigation |
unlimited |
-1 | No limit |
Running the Application
Quick Start (All Services)
./start_services.sh
cd frontend && npm run dev &
Individual Services
# SGLang inference server
SGLANG_ENABLE_JIT_DEEPGEMM=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
PYTHONPATH=/home/ubuntu/creditscope \
.venv/bin/python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-35B-A3B-FP8 \
--port 8000 --tp-size 1 --mem-fraction-static 0.50 \
--context-length 2048 --reasoning-parser qwen3 \
--tool-call-parser qwen3_coder --enable-metrics \
--attention-backend triton --fp8-gemm-backend triton \
--disable-cuda-graph --skip-server-warmup \
--forward-hooks '[{"name":"residual_capture","target_modules":["model.layers.*"],"hook_factory":"circuit_tracer.collectors.sglang_hooks:residual_capture_factory","config":{}}]'
# FastAPI backend
.venv/bin/uvicorn backend.main:app --host 127.0.0.1 --port 8080 --reload
# Frontend
cd frontend && npm run dev
Docker Deployment
cp .env.example .env
docker compose up -d --build
Requires Docker, docker-compose-v2, and NVIDIA Container Toolkit for GPU support.
Logs
tail -f /tmp/sglang.log # SGLang inference
tail -f /tmp/backend.log # FastAPI backend
Services and Ports
| Service | Port | Protocol | Description |
|---|---|---|---|
| SGLang | 8000 | HTTP | OpenAI-compatible inference API |
| Backend | 8080 | HTTP/WS | FastAPI + agent + circuit tracer |
| Frontend | 3000 | HTTP | Vite React dev server |
| nginx | 80/443/20003 | HTTP/HTTPS | Public reverse proxy |
| Prometheus | 9090 | HTTP | Metrics (Docker only) |
| Grafana | 3001 | HTTP | Dashboards (Docker only) |
Algorithm Deep Dive
1. Model Architecture β Qwen3.5-35B-A3B-FP8
CreditScope runs on a hybrid attention + MoE architecture:
Input tokens
β
βΌ
βββββββββββββββ
β Embedding β d_model = 2048
βββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β 40 Decoder Layers (repeating pattern of 4): β
β β
β Layer N+0: DeltaNet (linear) attention + MoE β
β Layer N+1: DeltaNet (linear) attention + MoE β
β Layer N+2: DeltaNet (linear) attention + MoE β
β Layer N+3: Standard (full) attention + MoE β
β β
β Each MoE layer: 256 experts, top-8 routing β
β Per-expert intermediate size: 512 β
β β
β Full attention at layers: 3,7,11,...,35,39 β
β DeltaNet attention at all other layers β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββ
β LM Head β β next-token logits
βββββββββββββββ
Vision Encoder (27 ViT layers):
Hidden size: 1152 β projects to d_model=2048
Intermediate size: 4304
16 attention heads, patch size 16
DeltaNet layers use linear attention (O(n) vs O(n^2)), which makes the model efficient for long sequences. Standard attention layers every 4th position provide full-context mixing.
Mixture of Experts (MoE): Every layer has 256 experts but only routes each token to the top 8. This gives the model 35B total parameters but only ~3B active per token, enabling fast inference.
2. ReAct Agent Loop
The agent uses a Reason-Act-Observe loop to answer credit analysis queries:
User Query: "Evaluate John Smith for a $50,000 business loan"
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β 1. REASON (thinking tokens) β
β "I need to check credit score, β
β DTI ratio, and collateral..." β
β β
β 2. ACT (tool call) β
β calculate_credit_score(id=42) β
β β
β 3. OBSERVE (tool result) β
β Score: 720, Grade: B β
β β
β 4. REASON again β
β "Score is good, need DTI next" β
β β
β 5. ACT β
β calculate_dti(id=42, amount=50k)β
β β
β ... (up to 8 steps) ... β
β β
β FINAL: Synthesize response β
βββββββββββββββββββββββββββββββββββββββ
Available Credit Tools:
| Tool | Description |
|---|---|
calculate_credit_score |
Weighted score from payment history (35%), utilization (30%), age (15%), mix (10%), inquiries (10%) |
calculate_dti |
Front-end and back-end debt-to-income ratios with risk classification |
analyze_payment_history |
Delinquency patterns, on-time rate, severity scoring |
evaluate_collateral |
Loan-to-value ratio, haircut-adjusted value, coverage ratio |
structure_loan |
Amortization schedule, monthly payment calculation |
apply_risk_adjustments |
Regulatory and behavioral risk adjustments to base score |
3. MoE Expert Routing Observability
During every inference request, forward hooks on the MoE gate modules capture:
For each of the 40 MoE layers:
βββββββββββββββββββββββββββββββββββββββββββββββ
β router_logits: [num_tokens Γ 256 experts] β
β β β
β βΌ softmax + top-8 selection β
β selected_experts: [num_tokens Γ 8] β
β gating_weights: [num_tokens Γ 8] β
β β
β Metrics computed: β
β - Expert load distribution β
β - Shannon entropy of routing distribution β
β - Per-expert activation frequency β
βββββββββββββββββββββββββββββββββββββββββββββββ
Shannon entropy measures routing diversity:
- Low entropy β tokens concentrate on few experts (specialized)
- High entropy β tokens spread evenly across experts (generic)
The frontend displays this as a real-time heatmap of expert activations across layers.
4. Circuit Tracing with Sparse Autoencoders
Circuit tracing discovers which internal features of the model drive a specific prediction. The pipeline has five stages:
Stage 1: Activation Capture
When a trace is requested, the backend creates a sentinel file (/tmp/circuit_trace_capture). Forward hooks registered in the SGLang process detect this and save the residual stream output from each decoder layer:
SGLang Process Backend Process
βββββββββββββββββ ββββββββββββββββ
1. Create sentinel file
model.layers[0](x) β hook saves
/tmp/circuit_trace_activations/post_0.pt
model.layers[1](x) β hook saves
/tmp/circuit_trace_activations/post_1.pt
...
model.layers[39](x) β hook saves
/tmp/circuit_trace_activations/post_39.pt
2. Remove sentinel
3. Load all post_*.pt files
This filesystem-based IPC has zero overhead during normal chat β the hook checks one stat() call per layer and returns immediately if the sentinel doesn't exist.
Stage 2: Sparse Autoencoder Feature Extraction
Each SAE decomposes a 2048-dimensional residual stream vector into ~16,384 sparse features:
Residual stream x β R^2048
β
βΌ
x_centered = x - bias
β
βΌ
pre_act = W_enc Β· x_centered + b_enc (2048 β 16384)
β
βΌ
z = JumpReLU(pre_act) Sparse: ~50 active out of 16384
β z_i = pre_act_i if pre_act_i > ΞΈ_i
βΌ z_i = 0 otherwise
x_hat = W_dec Β· z + bias (16384 β 2048)
JumpReLU activation (from Anthropic's scaling monosemanticity work) learns a per-feature threshold ΞΈ_i. Features only activate when their pre-activation exceeds this learned threshold, giving cleaner sparsity than standard ReLU.
Training objective:
L = ||x - x_hat||Β² + Ξ» Β· ||z||β
βββββββββββββ ββββββββββ
reconstruction sparsity
loss penalty (Ξ» = 3Γ10β»β΄)
SAE Registry manages 68 SAEs across the full model:
- 40 language SAEs (one per decoder layer, 2048 β 16384 features)
- 27 vision SAEs (one per ViT layer, 1152 β 9216 features)
- 1 projection SAE (visionβlanguage bridge, 1152 β 2048)
SAEs are created on-demand when a layer is first traced, avoiding allocating all 68 at startup.
Stage 3: Attribution Graph Construction
The graph represents causal flow from input tokens through features to the output prediction:
Nodes:
- Input: one per token position
- Feature: (layer, position, feature_idx) with activation value
- Output: the target token prediction
Edges:
- Feature β Output: activation value (how much this feature contributes)
- Feature β Feature: virtual weight Γ source activation
Virtual weights between features in adjacent layers are computed as:
W_virtual = W_dec_src^T Β· W_enc_tgt^T
where:
W_dec_src: decoder weights of source layer's SAE (d_model Γ n_features)
W_enc_tgt: encoder weights of target layer's SAE (n_features Γ d_model)
W_virtual: (n_features_src Γ n_features_tgt)
Attribution(srcβtgt) = activation_src Γ W_virtual[src_feat, tgt_feat]
This captures the linear pathway: how much a source feature's decoder direction projects onto the target feature's encoder direction.
Stage 4: Graph Pruning
Raw graphs can have thousands of nodes. Pruning keeps only high-impact nodes:
1. Score each node by backward-propagated importance:
- Output nodes get importance = 1.0
- For each edge (src β tgt):
importance[src] += |edge.weight| Γ importance[tgt]
2. Rank feature nodes by importance score
3. Keep top 10% (configurable), always keeping input/output nodes
4. Drop edges between pruned nodes
Stage 5: Feature Steering (Causal Validation)
Once a circuit is identified, steering validates whether those features actually control the output:
Baseline:
model("Analyze loan risk") β "The applicant shows moderate risk..."
Intervention (clamp feature 6392 at layer 39 to 0):
model("Analyze loan risk") β "The applicant appears to be..."
β² different output confirms
feature 6392 was causal
Steering works by:
- Running the model normally to get a baseline output
- Registering a forward hook at the target layer that:
- Encodes the residual stream through the SAE
- Modifies the specified feature activation(s)
- Decodes back to residual stream space
- Generating again with the hook active
- Comparing the outputs
5. Chain-of-Thought Budget Control
The thinking budget system controls how many tokens the model spends on internal reasoning before responding:
User query arrives
β
βΌ
βββββββββββββββββββββββββββββ
β Budget Resolution: β
β "standard" β 2048 tkns β
β "deep" β 32768 tknsβ
β "none" β 0 (skip) β
βββββββββββββββββββββββββββββ
β
βΌ
SGLang API call with:
max_completion_tokens = budget + response_limit
thinking { type: "enabled", budget_tokens: 2048 }
β
βΌ
Model generates:
<think>I need to evaluate... [up to budget tokens]</think>
The credit analysis shows... [response tokens]
The frontend displays thinking content in a collapsible panel with token count and duration.
API Reference
Authentication
POST /api/auth/login { email, password } β session cookie
GET /api/auth/me β current user info
Chat
POST /api/chat Process chat message with agent
WS /api/chat/ws WebSocket streaming (thinking + response deltas)
Customers
GET /api/customers List customers (paginated)
GET /api/customers/{id} Customer details
GET /api/customers/{id}/credit-report Full credit report
Circuit Tracer
POST /api/circuit/trace Trace a prompt β attribution graph
GET /api/circuit/architecture Model architecture map (lang + vision)
GET /api/circuit/saes List SAE checkpoints
GET /api/circuit/transcoders List transcoder checkpoints
GET /api/circuit/registry/status Registry summary (counts, config)
POST /api/circuit/steer Run feature steering intervention
Observability
GET /api/observability/moe/current Current MoE expert activations
GET /api/observability/moe/history Historical expert routing data
GET /api/observability/thinking/sessions Thinking session data
Configuration
GET /api/thinking/budgets Available thinking budget presets
POST /api/thinking/config Update thinking configuration
System
GET /health Health check
GET /metrics Prometheus metrics (text format)
Project Structure
creditscope/
βββ backend/ # FastAPI backend
β βββ main.py # App entry point, middleware, router mounting
β βββ auth.py # Session cookie authentication (HMAC)
β βββ agent/
β β βββ orchestrator.py # ReAct agent loop (up to 8 tool steps)
β β βββ tool_registry.py # Tool dispatch and execution
β β βββ prompts.py # System prompt and tool definitions
β β βββ image_handler.py # Document OCR processing
β βββ db/
β β βββ models.py # SQLAlchemy ORM (Customer, Loan, Document)
β β βββ queries.py # Database query functions
β β βββ seed.py # Sample data seeder (55 customers)
β βββ routers/
β β βββ chat.py # Chat endpoint + WebSocket streaming
β β βββ customers.py # Customer CRUD endpoints
β β βββ auth.py # Login/logout endpoints
β β βββ history.py # Conversation history
β β βββ observability.py # MoE and thinking metrics
β β βββ thinking.py # CoT configuration
β βββ schemas/ # Pydantic request/response models
β βββ tools/
β βββ credit_score.py # Weighted credit score calculation
β βββ debt_to_income.py # DTI ratio computation
β βββ payment_history.py # Payment pattern analysis
β βββ collateral_eval.py # LTV and coverage assessment
β βββ loan_structure.py # Amortization and term calculation
β βββ risk_adjustment.py # Regulatory risk adjustments
β
βββ inference/ # SGLang inference configuration
β βββ config.py # Model config, expert counts, sampling params
β βββ server.py # SGLang subprocess launcher
β βββ moe_hooks.py # Expert routing capture (ring buffer)
β βββ cot_controller.py # Thinking budget resolution
β βββ observability.py # Prometheus metric recording
β
βββ circuit_tracer/ # Mechanistic interpretability module
β βββ api.py # FastAPI router (/circuit/* endpoints)
β βββ config/
β β βββ model_config.py # Architecture specs (layers, dims, experts)
β βββ collectors/
β β βββ sglang_hooks.py # Forward hooks for activation capture (IPC)
β β βββ sglang_model.py # SGLang interface (tokenize, forward, cache)
β β βββ model_loader.py # Direct model loading (HookedModel)
β β βββ activation_collector.py # Batch activation collection for training
β β βββ architecture_map.py # Layer-by-layer architecture parsing
β βββ saes/
β β βββ sparse_autoencoder.py # SAE: encode β JumpReLU β decode
β β βββ vision_sae.py # Vision encoder SAE + projection SAE
β β βββ trainer.py # SAE training loop (MSE + L1)
β β βββ registry.py # On-demand SAE management (68 total)
β βββ transcoders/
β β βββ registry.py # Transcoder registry (language + vision)
β βββ attribution/
β β βββ replacement_model.py # Linearized model + graph construction
β β βββ graph.py # Node/Edge/Graph data structures
β β βββ pruning.py # Backward importance pruning
β βββ interventions/
β β βββ steering.py # Feature clamping and ablation
β βββ visualization/
β β βββ export.py # JSON, summary, and Graphviz DOT export
β βββ metrics.py # Prometheus gauges for registries
β βββ data/
β βββ checkpoints/ # Trained SAE/transcoder weights
β βββ activations/ # Collected activation datasets
β βββ graphs/ # Exported attribution graph JSONs
β
βββ frontend/ # React TypeScript UI
β βββ package.json # Dependencies (React 18, Vite, Tailwind)
β βββ vite.config.ts # Dev server config (port 3000, API proxy)
β βββ src/
β βββ components/
β β βββ ChatInterface.tsx # Main chat UI
β β βββ CircuitTracerDashboard.tsx # Trace + architecture UI
β β βββ MoEExpertPanel.tsx # Expert routing heatmap
β β βββ ThinkingPanel.tsx # CoT display
β β βββ ObservabilityDash.tsx # System metrics
β β βββ LoginScreen.tsx # Authentication
β β βββ ...
β βββ hooks/
β β βββ useChat.ts # WebSocket chat management
β βββ types/
β βββ index.ts # TypeScript interfaces
β
βββ deploy/
β βββ nginx/
β βββ creditscope.conf # nginx reverse proxy config
β
βββ scripts/
β βββ collect_activations_bf16.py # Collect activations from BF16 model
β βββ check_activation_health.py # Verify activations (no inf/nan, std check)
β βββ retrain_from_saved_activations.py # Train SAEs/TCs on activations
β βββ push_to_hf.py # Push trained checkpoints to HF
β βββ setup.sh # Full dev environment setup
β βββ run_dev.sh # Service launcher (--no-inference, --profile)
β βββ start-dev.sh # PID-managed launcher (--status, --stop)
β βββ setup_nginx_http.sh # nginx + SSL setup
β βββ watchdog.sh # Health monitoring for cron
β
βββ grafana/ # Grafana dashboard provisioning
βββ start_services.sh # Quick-start script
βββ pyproject.toml # Python project metadata + dependencies
βββ .env # Environment configuration
βββ .env.example # Configuration template
Development
Push Trained Models to Hugging Face
Use a Hugging Face model repo for trained SAE and transcoder checkpoints. Do not publish trained weights to the dataset backup repo unless you intentionally want them stored as raw data artifacts.
The trained model files are written under:
circuit_tracer/data/checkpoints/
The collected activation datasets are written under:
circuit_tracer/data/activations/
To publish trained checkpoints to Hugging Face:
source .venv/bin/activate
export HF_TOKEN="$HUGGING_FACE_HUB_TOKEN"
Create a private model repo:
python - <<'PY'
import os
from huggingface_hub import HfApi
api = HfApi(token=os.environ["HF_TOKEN"])
api.create_repo(
repo_id="sarel/creditscope-trained-models",
repo_type="model",
private=True,
exist_ok=True,
)
print("created_or_exists=sarel/creditscope-trained-models")
PY
Upload all trained checkpoints:
python - <<'PY'
import os
from huggingface_hub import HfApi
api = HfApi(token=os.environ["HF_TOKEN"])
api.upload_folder(
folder_path="circuit_tracer/data/checkpoints",
path_in_repo="checkpoints",
repo_id="sarel/creditscope-trained-models",
repo_type="model",
commit_message="Upload trained SAE and transcoder checkpoints",
)
print("uploaded checkpoints to sarel/creditscope-trained-models")
PY
If you also want to publish activation tensors for reproducibility, keep those in a dataset repo instead:
python - <<'PY'
import os
from huggingface_hub import HfApi
api = HfApi(token=os.environ["HF_TOKEN"])
api.create_repo(
repo_id="sarel/creditscope-activation-data",
repo_type="dataset",
private=True,
exist_ok=True,
)
api.upload_folder(
folder_path="circuit_tracer/data/activations",
path_in_repo="activations",
repo_id="sarel/creditscope-activation-data",
repo_type="dataset",
commit_message="Upload collected activation tensors",
)
print("uploaded activations to sarel/creditscope-activation-data")
PY
Recommended split:
- Model repo:
circuit_tracer/data/checkpoints/ - Dataset repo:
circuit_tracer/data/activations/,data/creditscope.db, exported graphs
source .venv/bin/activate
# Run tests
pytest
# Lint
ruff check .
# Type check
mypy backend inference circuit_tracer
# Format
ruff format .
Running Without GPU
./scripts/run_dev.sh --no-inference
The backend and frontend will start without the inference server. Chat will be unavailable, but you can develop on the UI, credit tools, and database.
Production Hardening
Add a watchdog cron job for auto-restart:
* * * * * cd /home/ubuntu/creditscope && ./scripts/watchdog.sh
Environment variables for watchdog:
WATCHDOG_BACKEND_URLβ defaults tohttp://127.0.0.1:8080/healthWATCHDOG_INFERENCE_URLβ defaults tohttp://127.0.0.1:8000/model_infoWATCHDOG_RESTART_COOLDOWN_SECONDSβ defaults to120
Training SAEs and Transcoders from Scratch
This section describes how to collect fresh activations and train SAEs/TCs from zero β no pre-existing checkpoints or activations needed.
Overview
1. Collect activations βββ 2. Train SAEs + TCs βββ 3. Push to HF βββ 4. Run app
(BF16 model + dataset) (from saved .npy) (checkpoints) (load checkpoints)
Step 1: Collect Activations
The collection script loads the BF16 model, runs forward passes on financial text, and captures the residual stream (pre and post) at each target layer.
Data source: sarel/creditscope-fino1-activations (HuggingFace dataset with text column of financial reasoning text).
# Collect 1M tokens at layers 0, 10, 30, 39
HF_TOKEN=<your_token> python scripts/collect_activations_bf16.py \
--model Qwen/Qwen3.5-35B-A3B \
--dataset sarel/creditscope-fino1-activations \
--layers 0,10,30,39 \
--target-tokens 1000000 \
--batch-size 4 \
--max-seq-len 512 \
--output-dir circuit_tracer/data/activations
# To skip HF upload:
# --no-upload
What it does:
- Installs a pure-PyTorch
causal_conv1dpatch (needed for DeltaNet layers) - Loads the BF16 model (~70GB VRAM)
- Runs a sanity check β verifies activation std is in range [1e-6, 100] (catches FP8 dequant failures)
- Iterates through dataset texts in batches, capturing
pre(input to layer) andpost(output of layer) activations - Saves chunks as
.npyfiles (float32), ~50K tokens each - Computes and saves per-layer normalization statistics
- Uploads everything to
sarel/creditscope-activations-v2
Expected output structure:
circuit_tracer/data/activations/
βββ layer_0_residual_pre/
β βββ chunk_0000.npy # [~50000, 2048] float32
β βββ chunk_0001.npy
β βββ ...
βββ layer_0_residual_post/
β βββ ...
βββ layer_10_residual_pre/
β βββ ...
βββ ...
βββ normalization_stats.json
βββ capture_config.json
Expected activation statistics (BF16 model):
| Layer | Residual Pre std | Residual Post std |
|---|---|---|
| 0 | ~0.03 | ~0.03 |
| 10 | ~0.3 | ~0.3 |
| 30 | ~0.3 | ~0.3 |
| 39 | ~0.7β0.9 | ~0.7β0.9 |
Why BF16 and not FP8? The FP8 model (
Qwen3.5-35B-A3B-FP8) does not load correctly viatransformersβ theweight_scale_invdequantization tensors are ignored, producing corrupted weights. FP8 is just weight compression; the BF16 model produces nearly identical activations. If SGLang becomes available (fixes for sgl_kernel SM120), you can collect from FP8 via SGLang instead.
Step 2: Train SAEs and Transcoders
# Train all SAEs and TCs from the collected activations
python scripts/retrain_from_saved_activations.py
This script:
- Loads activation chunks from
circuit_tracer/data/activations/ - Trains one JumpReLU SAE per layer (d_model=2048 β 16384 features)
- Trains one MoE Transcoder per layer (maps pre β post activations)
- Applies normalization when activation std > 1.0 (targets std=0.01)
- Saves checkpoints as
sae_l{N}.ptandtc_l{N}.pt - Logs training metrics to W&B (if configured)
Training parameters:
| Parameter | SAE | Transcoder |
|---|---|---|
| Architecture | JumpReLU autoencoder | MoE transcoder |
| Input dim | 2048 | 2048 (pre) β 2048 (post) |
| Feature dim | 16384 (8x expansion) | 16384 |
| Learning rate | 3e-4 | 1e-4 |
| Batch size | 4096 tokens | 2048 tokens |
| Steps | 50,000 | 100,000 |
| Loss | MSE + L1 sparsity | MSE reconstruction |
Step 3: Push to HuggingFace
# Upload trained checkpoints
python scripts/push_to_hf.py
This uploads sae_l*.pt, tc_l*.pt, normalization_stats.json, and architecture_map.json to sarel/creditscope-trained-models.
Step 4: Verify
After training, verify the SAEs produce meaningful features:
# Quick sanity check
python -c "
import torch
from circuit_tracer.saes.sparse_autoencoder import SparseAutoencoder
data = torch.load('circuit_tracer/data/checkpoints/sae_l0.pt', map_location='cpu', weights_only=False)
sae = SparseAutoencoder(**data['config'])
sae.load_state_dict(data['state_dict'])
# Feed random input and check reconstruction
x = torch.randn(10, 2048) * 0.03 # match layer 0 std
out = sae(x)
print(f'Reconstruction MSE: {((x - out.x_hat)**2).mean():.6f}')
print(f'Active features (L0): {(out.z > 0).float().sum(dim=1).mean():.0f}')
print(f'Expected: L0 ~ 50-200, MSE << input variance')
"
HuggingFace Repos
| Repo | Type | Contents |
|---|---|---|
sarel/creditscope-trained-models |
model | SAE/TC checkpoints, architecture map, deployment guide |
sarel/creditscope-activations-v2 |
dataset | Activation captures from BF16 model |
sarel/creditscope-fino1-activations |
dataset | Source financial text dataset (input for collection) |
License
MIT License β see LICENSE for details.