Spaces:
Running
Running
File size: 11,547 Bytes
34b05c6 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 2cdada4 4bfbb64 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
# languagebench - System Architecture
\[AI-generated, not 100% up-to-date\]
This diagram shows the complete data flow from model discovery through evaluation to frontend visualization.
```mermaid
flowchart TD
%% Model Sources
A1["important_models<br/>Static Curated List<br/>~34 models"] --> D[load_models]
A4["blocklist<br/>Exclusions"] --> D
%% Model Processing
D --> |"Combine & Dedupe"| E["Dynamic Model List<br/>Validated Models"]
E --> |get_or_metadata| F["OpenRouter API<br/>Model Metadata"]
F --> |get_hf_metadata| G["HuggingFace API<br/>Model Details"]
G --> H["Enriched Model DataFrame"]
H --> |Save| I[models.json]
%% Model Validation & Cost Filtering
H --> |"Validate Models<br/>Check API Availability<br/>No User Data Training"| H1["Valid Models Only<br/>Cost β€ $15/1M tokens"]
H1 --> H2["Robust Model List<br/>Default: Top 40 models"]
%% Language Data
J["languages.py<br/>BCP-47 + Population<br/>Glottolog Families"] --> K["Languages Sorted by Speakers<br/>Default: Up to 1000 languages"]
%% Task Registry with Unified Prompting
L["tasks.py<br/>7 Evaluation Tasks"] --> M["Task Functions<br/>Unified English Zero-Shot<br/>Reasoning Template"]
M --> M1["translation_from/to<br/>BLEU + ChrF"]
M --> M2["classification<br/>Accuracy"]
M --> M3["mmlu<br/>Accuracy"]
M --> M4["arc<br/>Accuracy"]
M --> M5["truthfulqa<br/>Accuracy"]
M --> M6["mgsm<br/>Accuracy"]
%% On-the-fly Translation with Origin Tagging
subgraph OTF [On-the-fly Dataset Translation]
direction LR
DS_raw["Raw English Dataset<br/>"] --> Google_Translate["Google Translate API"]
Google_Translate --> DS_translated["Translated Dataset<br/>e.g., MGSM/ARC<br/>Origin: 'machine'"]
DS_native["Native Dataset<br/>e.g., AfriMMLU/Global-MMLU<br/>Origin: 'human'"]
end
%% Evaluation Pipeline
H2 --> |"models ID<br/>Default: 40 models"| N["main.py / main_gcs.py<br/>evaluate"]
K --> |"languages bcp_47<br/>Default: 1000 languages"| N
L --> |"tasks.items"| N
N --> |"Filter by model.tasks<br/>Filter by valid task languages"| O["Valid Combinations<br/>Model Γ Language Γ Task"]
O --> |"10 samples each"| P["Evaluation Execution<br/>Batch Processing<br/>Batch Size: 2000"]
%% Task Execution with Origin Tracking
P --> Q1[translate_and_evaluate<br/>Origin: 'human']
P --> Q2[classify_and_evaluate<br/>Origin: 'human']
P --> Q3[mmlu_and_evaluate<br/>Origin: 'human'<br/>no on-the-fly; uses auto-translated if available]
P --> Q4[arc_and_evaluate<br/>Origin: 'human'/'machine']
P --> Q5[truthfulqa_and_evaluate<br/>Origin: 'human'<br/>no on-the-fly; relies on available datasets]
P --> Q6[mgsm_and_evaluate<br/>Origin: 'human'/'machine']
%% API Calls with Error Handling
Q1 --> |"complete() API<br/>Rate Limiting<br/>Reasoning: Low Effort"| R["OpenRouter<br/>Model Inference"]
Q2 --> |"complete() API<br/>Rate Limiting<br/>Reasoning: Low Effort"| R
Q3 --> |"complete() API<br/>Rate Limiting<br/>Reasoning: Low Effort"| R
Q4 --> |"complete() API<br/>Rate Limiting<br/>Reasoning: Low Effort"| R
Q5 --> |"complete() API<br/>Rate Limiting<br/>Reasoning: Low Effort"| R
Q6 --> |"complete() API<br/>Rate Limiting<br/>Reasoning: Low Effort"| R
%% Results Processing with Origin Aggregation
R --> |Scores| S["Result Aggregation<br/>Mean by model+lang+task+origin<br/>Bootstrap Confidence Intervals"]
S --> |Save| T["results.json<br/>results-detailed.json"]
%% Backend & Frontend with Origin-Specific Metrics
T --> |Read| U[backend.py]
I --> |Read| U
U --> |make_model_table| V["Model Rankings<br/>Origin-Specific Metrics<br/>Confidence Intervals"]
U --> |make_country_table| W["Country Aggregation"]
U --> |make_language_tier_history| V2["Language Tier History<br/>Top 1, 2-20, 20-200"]
U --> |make_license_history| V3["License History<br/>Open-source vs Commercial"]
U --> |"API Endpoint"| X["FastAPI /api/data<br/>arc_accuracy_human<br/>arc_accuracy_machine<br/>language_tier_history<br/>license_history"]
X --> |"JSON Response"| Y["Frontend React App"]
%% UI Components
Y --> Z1["WorldMap.js<br/>Country Visualization"]
Y --> Z2["ModelTable.js<br/>Model Rankings"]
Y --> Z3["LanguageTable.js<br/>Language Coverage"]
Y --> Z4["DatasetTable.js<br/>Task Performance"]
Y --> Z5["LanguageTierHistoryPlot.js<br/>Tier-based Trends"]
Y --> Z6["LicenseHistoryPlot.js<br/>License-based Trends"]
%% Data Sources with Origin Information
subgraph DS ["Data Sources"]
DS1["FLORES+<br/>Translation Sentences<br/>Origin: 'human'"]
DS2["MMLU Variants<br/>AfriMMLU/Global-MMLU/MMMLU<br/>HF Auto-translated MMLU<br/>Origin: 'human' or 'machine'"]
DS3["Uhuru ARC Easy<br/>Auto-translated ARC<br/>Origin: 'human' or 'machine'"]
DS4["Uhura TruthfulQA<br/>Auto-translated TruthfulQA<br/>Origin: 'human' or 'machine'"]
DS5["MGSM Variants<br/>MGSM/AfriMGSM/GSM8K-X<br/>Auto-translated GSM<br/>Origin: 'human' or 'machine'"]
end
DS1 --> Q1
DS2 --> Q3
DS3 --> Q4
DS4 --> Q5
DS5 --> Q6
%% No on-the-fly DS_translated for MMLU anymore; only HF auto-translated used
DS_translated --> Q4
DS_translated --> Q5
DS_native --> Q3
DS_native --> Q4
DS_native --> Q5
%% Styling - Neutral colors that work in both dark and light modes
classDef modelSource fill:#f8f9fa,stroke:#6c757d,color:#212529
classDef evaluation fill:#e9ecef,stroke:#495057,color:#212529
classDef api fill:#dee2e6,stroke:#6c757d,color:#212529
classDef storage fill:#d1ecf1,stroke:#0c5460,color:#0c5460
classDef frontend fill:#f8d7da,stroke:#721c24,color:#721c24
classDef translation fill:#d4edda,stroke:#155724,color:#155724
class A1,A4 modelSource
class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
class R,F,G,X api
class T,I storage
class Y,Z1,Z2,Z3,Z4,Z5,Z6 frontend
class Google_Translate,DS_translated,DS_native translation
```
## Architecture Components
### π΅ Model Discovery (Light Gray)
- **Static Curated Models**: Handpicked important models (~34 models) for comprehensive evaluation
- **Dynamic Popular Models**: Web scraping capability available but currently disabled
- **Quality Control**: Blocklist for problematic or incompatible models
- **Model Validation**: API availability checks, cost filtering (β€$15/1M tokens), and exclusion of providers that train on user data
- **Default Selection**: Top 40 models by default (configurable via N_MODELS)
- **Metadata Enrichment**: Rich model information from OpenRouter and HuggingFace APIs
### π£ Evaluation Pipeline (Medium Gray)
- **7 Active Tasks**: Translation (bidirectional), Classification, MMLU, ARC, TruthfulQA, MGSM
- **Unified English Zero-Shot Prompting**: All tasks use English instructions with target language content
- **Reasoning Template**: Tasks use structured reasoning format with `<reasoning>...</reasoning><final_answer>...</final_answer>` tags
- **Origin Tagging**: Distinguishes between human-translated ('human') and machine-translated ('machine') data
- **Combinatorial Approach**: Systematic evaluation across Model Γ Language Γ Task combinations
- **Sample-based**: 10 evaluations per combination for statistical reliability (configurable via N_SENTENCES)
- **Batch Processing**: 2000 tasks per batch with rate limiting and error resilience
- **Language Filtering**: Pre-computed valid languages per task to filter invalid combinations
- **Default Scale**: 40 models Γ 1000 languages Γ 7 tasks Γ 10 samples (configurable via environment variables)
- **Dual Deployment**: `main.py` for local/GitHub, `main_gcs.py` for Google Cloud with GCS storage
### π API Integration (Light Gray)
- **OpenRouter**: Primary model inference API for all language model tasks
- **Rate Limiting**: Async rate limiters (20 req/s OpenRouter, 10 req/s Google Translate, 5 req/s HuggingFace)
- **Reasoning Configuration**: Low-effort reasoning mode enabled for efficiency
- **Error Handling**: Graceful handling of timeouts, rate limits, filtered content, and model unavailability
- **HuggingFace**: Model metadata and open-source model information via HfApi
- **Google Translate**: Specialized translation API for on-the-fly dataset translation (when needed)
### π’ Data Storage (Cyan)
- **results.json**: Aggregated evaluation scores with origin-specific metrics
- **results-detailed.json**: Detailed results with individual sample scores for bootstrap CI calculation
- **models.json**: Dynamic model list with metadata and validation status
- **languages.json**: Language information with population data, Glottolog families, and script information
- **Immutable Log**: Results are cached and merged to avoid re-computation
### π‘ Frontend Visualization (Light Red)
- **WorldMap**: Interactive country-level visualization with language selection
- **ModelTable**: Ranked model performance leaderboard with origin-specific columns and confidence intervals
- **LanguageTable**: Language coverage and speaker statistics with confidence intervals
- **DatasetTable**: Task-specific performance breakdowns with human/machine distinction
- **LanguageTierHistoryPlot**: Historical trends for language tiers (Top 1, Top 2-20, Top 20-200)
- **LicenseHistoryPlot**: Historical trends comparing open-source vs commercial models
- **Confidence Intervals**: Bootstrap-based 95% confidence intervals for all metrics
### π΅ Translation & Origin Tracking (Light Green)
- **Dataset-Based Translation**: Uses HuggingFace auto-translated datasets (MMLU, ARC, TruthfulQA, MGSM) when available
- **On-the-fly Translation**: Google Translate API available but primarily used for translation tasks
- **Origin Tagging**: Automatic classification of data sources (human vs. machine translated)
- **Separate Metrics**: Frontend displays distinct scores for human and machine-translated data
- **Dataset Variants**: Supports multiple dataset variants (e.g., AfriMMLU, Global-MMLU, MMMLU for MMLU)
## Data Flow Summary
1. **Model Discovery**: Load curated models (~34) β validate API availability and cost (β€$15/1M tokens) β exclude providers training on user data β enrich with metadata from OpenRouter and HuggingFace
2. **Evaluation Setup**: Generate all valid Model Γ Language Γ Task combinations (default: 40 models Γ 1000 languages) with pre-computed language filtering and origin tracking
3. **Task Execution**: Run evaluations using unified English prompting with reasoning templates, batch processing (2000 per batch), and rate limiting
4. **Result Processing**: Aggregate scores by model+language+task+origin, compute bootstrap confidence intervals, and save to JSON files (results.json and results-detailed.json)
5. **Backend Serving**: FastAPI serves processed data with origin-specific metrics, confidence intervals, language tier history, and license history via REST API
6. **Frontend Display**: React app visualizes data through interactive components (WorldMap, ModelTable, LanguageTable, DatasetTable, LanguageTierHistoryPlot, LicenseHistoryPlot) with transparency indicators and confidence intervals
This architecture enables scalable, automated evaluation of AI language models across diverse languages and tasks while providing real-time insights through an intuitive web interface with methodological transparency and statistical rigor. |