Design System Extractor v2 β Project Context
Architecture Overview
Stage 0: Configuration Stage 1: Discovery & Extraction Stage 2: AI Analysis Stage 3: Export
ββββββββββββββββββββ ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ ββββββββββββββββ
β HF Token Setup β ββββββ> β URL Discovery (sitemap/ β ββββββ> β Layer 1: Rule Engine β ββ> β Figma Tokens β
β Benchmark Select β β crawl) + Token Extraction β β Layer 2: Benchmarks β β JSON Export β
ββββββββββββββββββββ β (Desktop + Mobile CSS) β β Layer 3: LLM Agents (x3) β ββββββββββββββββ
ββββββββββββββββββββββββββββ β Layer 4: HEAD Synthesizerβ
ββββββββββββββββββββββββββββ
Stage 1: Discovery & Extraction (Rule-Based, Free)
- Discover Pages: Fetches sitemap.xml or crawls site to find pages
- Extract Tokens: Playwright visits each page at 2 viewports (Desktop 1440px, Mobile 375px), extracts computed CSS for colors, typography, spacing, radius, shadows
- User Review: Interactive tables with Accept/Reject checkboxes + visual previews
Stage 2: AI-Powered Analysis (4 Layers)
| Layer |
Type |
What It Does |
Cost |
| Layer 1 |
Rule Engine |
Type scale detection, AA contrast checking, spacing grid analysis, color statistics |
FREE |
| Layer 2 |
Benchmark Research |
Compare against Material Design 3, Apple HIG, Tailwind, etc. |
~$0.001 |
| Layer 3 |
LLM Agents (x3) |
AURORA (Brand ID) + ATLAS (Benchmark) + SENTINEL (Best Practices) |
~$0.002 |
| Layer 4 |
HEAD Synthesizer |
NEXUS combines all outputs into final recommendations |
~$0.001 |
Stage 3: Export
- Apply/reject individual color, typography, spacing recommendations
- Export Figma Tokens Studio-compatible JSON
Agent Roster
| Agent |
Codename |
Model |
Temp |
Input |
Output |
Specialty |
| Brand Identifier |
AURORA |
Qwen/Qwen2.5-72B-Instruct |
0.4 |
Color tokens + semantic CSS analysis |
Brand primary/secondary/accent, palette strategy, cohesion score, semantic names |
Creative/visual reasoning, color harmony assessment |
| Benchmark Advisor |
ATLAS |
meta-llama/Llama-3.3-70B-Instruct |
0.25 |
User's type scale, spacing, font sizes + benchmark comparison data |
Recommended benchmark, alignment changes, pros/cons |
128K context for large benchmark data, comparative reasoning |
| Best Practices Validator |
SENTINEL |
Qwen/Qwen2.5-72B-Instruct |
0.2 |
Rule Engine results (typography, accessibility, spacing, color stats) |
Overall score (0-100), check results, prioritized fix list |
Methodical rule-following, precise judgment |
| HEAD Synthesizer |
NEXUS |
meta-llama/Llama-3.3-70B-Instruct |
0.3 |
All 3 agent outputs + Rule Engine facts |
Executive summary, scores, top 3 actions, color/type/spacing recs |
128K context for combined inputs, synthesis capability |
Why These Models
- Qwen 72B (AURORA, SENTINEL): Strong creative reasoning for brand analysis; methodical structured output for best practices. Available on HF serverless without gated access.
- Llama 3.3 70B (ATLAS, NEXUS): 128K context window handles large combined inputs from multiple agents. Excellent comparative and synthesis reasoning.
- Fallback: Qwen/Qwen2.5-7B-Instruct (free tier, available when primary models fail)
Temperature Rationale
- 0.4 (AURORA): Allows creative interpretation of color stories and palette harmony
- 0.25 (ATLAS): Analytical comparison needs consistency but some flexibility for trade-off reasoning
- 0.2 (SENTINEL): Strict rule evaluation β consistency is critical for compliance scoring
- 0.3 (NEXUS): Balanced β needs to synthesize creatively but stay grounded in agent data
Evaluation & Scoring
Self-Evaluation (All Agents)
Each agent includes a self_evaluation block in its JSON output:
{
"confidence": 8,
"reasoning": "Clear usage patterns with 20+ colors",
"data_quality": "good",
"flags": []
}
AURORA Scoring Rubric (Cohesion 1-10)
- 9-10: Clear harmony rule, distinct brand colors, consistent palette
- 7-8: Mostly harmonious, clear brand identity
- 5-6: Some relationships visible but not systematic
- 3-4: Random palette, no clear strategy
- 1-2: Conflicting colors, no brand identity
SENTINEL Scoring Rubric (Overall 0-100)
Weighted checks:
- AA Compliance: 25 points
- Type Scale Consistency: 15 points
- Base Size Accessible: 15 points
- Spacing Grid: 15 points
- Type Scale Standard Ratio: 10 points
- Color Count: 10 points
- No Near-Duplicates: 10 points
NEXUS Scoring Rubric (Overall 0-100)
- 90-100: Production-ready, minor polishing only
- 75-89: Solid foundation, 2-3 targeted improvements
- 60-74: Functional but needs focused attention
- 40-59: Significant gaps requiring systematic improvement
- 20-39: Major rework needed
- 0-19: Fundamental redesign recommended
Evaluation Summary (Logged After Analysis)
βββββββββββββββββββββββββββββββββββββββββββββββββββ
π AGENT EVALUATION SUMMARY
βββββββββββββββββββββββββββββββββββββββββββββββββββ
π¨ AURORA (Brand ID): confidence=8/10, data=good
π’ ATLAS (Benchmark): confidence=7/10, data=good
β
SENTINEL (Practices): confidence=9/10, data=good, score=72/100
π§ NEXUS (Synthesis): confidence=8/10, data=good, overall=65/100
βββββββββββββββββββββββββββββββββββββββββββββββββββ
User Journey
- Enter HF Token β Required for LLM inference (free tier works)
- Enter Website URL β The site to extract design tokens from
- Discover Pages β Auto-finds pages via sitemap or crawling
- Select Pages β Check/uncheck pages to include (max 10)
- Extract Tokens β Scans selected pages at Desktop + Mobile viewports
- Review Stage 1 β Interactive tables: Colors, Typography, Spacing, Radius, Shadows, Semantic Colors. Each tab has a data table + visual preview accordion. Accept/reject individual tokens.
- Proceed to Stage 2 β Select benchmarks to compare against
- Run AI Analysis β 4-layer pipeline executes (Rule Engine -> Benchmarks -> LLM Agents -> Synthesis)
- Review Analysis β Dashboard with scores, recommendations, benchmark comparison, color recs
- Apply Upgrades β Accept/reject individual recommendations
- Export JSON β Download Figma Tokens Studio-compatible JSON
File Structure
| File |
Responsibility |
app.py |
Main Gradio UI β all stages, CSS, event bindings, formatting functions |
agents/llm_agents.py |
4 LLM agent classes (AURORA, ATLAS, SENTINEL, NEXUS) + dataclasses |
agents/semantic_analyzer.py |
Semantic color categorization (brand, text, background, etc.) |
config/settings.py |
Model routing, env var loading, agent-to-model mapping |
core/hf_inference.py |
HF Inference API client, model registry, temperature mapping |
core/preview_generator.py |
HTML preview generators for Stage 1 visual previews |
core/rule_engine.py |
Layer 1: Type scale, AA contrast, spacing grid, color stats |
core/benchmarks.py |
Benchmark definitions (Material Design 3, Apple HIG, etc.) |
core/extractor.py |
Playwright-based CSS token extraction |
core/discovery.py |
Page discovery via sitemap.xml / crawling |
Configuration
Environment Variables
| Variable |
Default |
Description |
HF_TOKEN |
(required) |
HuggingFace API token |
BRAND_IDENTIFIER_MODEL |
Qwen/Qwen2.5-72B-Instruct |
Model for AURORA |
BENCHMARK_ADVISOR_MODEL |
meta-llama/Llama-3.3-70B-Instruct |
Model for ATLAS |
BEST_PRACTICES_MODEL |
Qwen/Qwen2.5-72B-Instruct |
Model for SENTINEL |
HEAD_SYNTHESIZER_MODEL |
meta-llama/Llama-3.3-70B-Instruct |
Model for NEXUS |
FALLBACK_MODEL |
Qwen/Qwen2.5-7B-Instruct |
Fallback when primary fails |
HF_MAX_NEW_TOKENS |
2048 |
Max tokens per LLM response |
HF_TEMPERATURE |
0.3 |
Global default temperature |
MAX_PAGES |
20 |
Max pages to discover |
BROWSER_TIMEOUT |
30000 |
Playwright timeout (ms) |
Model Override Examples
export BRAND_IDENTIFIER_MODEL="meta-llama/Llama-3.3-70B-Instruct"
export BEST_PRACTICES_MODEL="meta-llama/Llama-3.3-70B-Instruct"
export BRAND_IDENTIFIER_MODEL="Qwen/Qwen2.5-7B-Instruct"
export BENCHMARK_ADVISOR_MODEL="mistralai/Mixtral-8x7B-Instruct-v0.1"