| # Architecture Documentation | |
| ## System Architecture | |
| ### High-Level Overview | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Client Layer β | |
| β (Web Apps, Mobile Apps, Other Services) β | |
| βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ | |
| β | |
| β HTTP/REST API | |
| β | |
| βββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ | |
| β API Gateway Layer β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β FastAPI Application β β | |
| β β - Request Validation β β | |
| β β - Authentication (optional) β β | |
| β β - Rate Limiting β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ | |
| β | |
| β | |
| βββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ | |
| β Application Layer β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Monitoring Middleware β β | |
| β β - Prediction Logging β β | |
| β β - Data Drift Detection β β | |
| β β - Performance Tracking β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Inference Engine β β | |
| β β - Async Processing β β | |
| β β - Batch Handling β β | |
| β β - Error Handling β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ | |
| β | |
| β Model Calls | |
| β | |
| βββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ | |
| β Model Layer β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Transformer Models β β | |
| β β - Russian BERT β β | |
| β β - RoBERTa β β | |
| β β - DistilBERT β β | |
| β β - Ensemble Models β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ | |
| β | |
| β | |
| βββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ | |
| β Data Layer β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Tokenization β β | |
| β β - HuggingFace Tokenizers β β | |
| β β - Subword Tokenization β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Model Architecture Details | |
| ### Transformer Model Flow | |
| ``` | |
| Input Text: "ΠΡΡΠΈΠ½ ΠΎΠ±ΡΡΠ²ΠΈΠ» ΠΎ Π½ΠΎΠ²ΡΡ ΠΌΠ΅ΡΠ°Ρ ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΊΠΈ ΡΠΊΠΎΠ½ΠΎΠΌΠΈΠΊΠΈ" | |
| β | |
| βββΊ Text Preprocessing | |
| β βββΊ Normalize: "ΠΏΡΡΠΈΠ½ ΠΎΠ±ΡΡΠ²ΠΈΠ» ΠΎ Π½ΠΎΠ²ΡΡ ΠΌΠ΅ΡΠ°Ρ ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΊΠΈ ΡΠΊΠΎΠ½ΠΎΠΌΠΈΠΊΠΈ" | |
| β | |
| βββΊ Tokenization (HuggingFace) | |
| β βββΊ Tokens: ["[CLS]", "ΠΏΡΡΠΈΠ½", "ΠΎΠ±ΡΡΠ²ΠΈΠ»", "ΠΎ", "Π½ΠΎΠ²ΡΡ ", "ΠΌΠ΅ΡΠ°Ρ ", ...] | |
| β βββΊ Token IDs: [101, 1234, 5678, ...] | |
| β | |
| βββΊ Embedding Layer | |
| β βββΊ [batch, seq_len, 768] | |
| β | |
| βββΊ BERT Encoder (12 layers) | |
| β βββΊ Multi-Head Self-Attention (12 heads) | |
| β βββΊ Feed-Forward Network | |
| β βββΊ Layer Normalization | |
| β βββΊ Residual Connections | |
| β βββΊ Output: [batch, seq_len, 768] | |
| β | |
| βββΊ Pooling | |
| β βββΊ [CLS] token or Attention Pooling | |
| β βββΊ [batch, 768] | |
| β | |
| βββΊ Classification Head | |
| β βββΊ Dropout(0.3) | |
| β βββΊ Linear(768 β 768) + ReLU | |
| β βββΊ Dropout(0.3) | |
| β βββΊ Linear(768 β num_labels) | |
| β βββΊ Output: [batch, num_labels] | |
| β | |
| βββΊ Sigmoid Activation | |
| β βββΊ Probabilities: [batch, num_labels] | |
| β | |
| βββΊ Threshold Filtering (0.5) | |
| βββΊ Final Tags: ["ΠΏΠΎΠ»ΠΈΡΠΈΠΊΠ°", "ΡΠΊΠΎΠ½ΠΎΠΌΠΈΠΊΠ°"] | |
| ``` | |
| ### Ensemble Architecture | |
| ``` | |
| Input: Title + Snippet | |
| β | |
| βββΊ Model 1 (Russian BERT) | |
| β βββΊ Predictions: [0.9, 0.7, 0.3, ...] | |
| β | |
| βββΊ Model 2 (RoBERTa) | |
| β βββΊ Predictions: [0.85, 0.75, 0.4, ...] | |
| β | |
| βββΊ Model 3 (DistilBERT) | |
| β βββΊ Predictions: [0.88, 0.72, 0.35, ...] | |
| β | |
| βββΊ Ensemble Combination | |
| βββΊ Weighted Average (weights: [0.4, 0.3, 0.3]) | |
| βββΊ Final Predictions: [0.88, 0.73, 0.35, ...] | |
| ``` | |
| ## Data Flow | |
| ### Training Data Flow | |
| ``` | |
| Raw TSV Files | |
| β | |
| βββΊ Load Data (pandas) | |
| β βββΊ Filter nulls | |
| β | |
| βββΊ Text Preprocessing | |
| β βββΊ Normalize text | |
| β βββΊ Lowercase | |
| β βββΊ Remove special chars | |
| β | |
| βββΊ Tag Processing | |
| β βββΊ Split tags | |
| β βββΊ Filter by frequency | |
| β βββΊ Create label mapping | |
| β | |
| βββΊ Data Splitting | |
| β βββΊ Train (dates < 2018-10-01) | |
| β βββΊ Validation (2018-10-01 to 2018-12-01) | |
| β βββΊ Test (dates >= 2018-12-01) | |
| β | |
| βββΊ Dataset Creation | |
| β βββΊ Tokenization | |
| β βββΊ Padding/Truncation | |
| β βββΊ Multi-hot encoding | |
| β | |
| βββΊ DataLoader | |
| βββΊ Batches for training | |
| ``` | |
| ### Inference Data Flow | |
| ``` | |
| API Request | |
| β | |
| βββΊ Request Validation (Pydantic) | |
| β βββΊ Validate title, snippet, threshold | |
| β | |
| βββΊ Text Preprocessing | |
| β βββΊ Normalize and clean | |
| β | |
| βββΊ Tokenization | |
| β βββΊ Convert to token IDs | |
| β | |
| βββΊ Model Inference | |
| β βββΊ Forward pass through BERT | |
| β | |
| βββΊ Post-processing | |
| β βββΊ Sigmoid activation | |
| β βββΊ Threshold filtering | |
| β βββΊ Top-K selection | |
| β | |
| βββΊ Monitoring | |
| β βββΊ Log prediction | |
| β βββΊ Record for drift detection | |
| β βββΊ Track performance | |
| β | |
| βββΊ Response | |
| βββΊ JSON with predictions | |
| ``` | |
| ## Component Interactions | |
| ### Training Pipeline | |
| ``` | |
| Config (Hydra) | |
| β | |
| βββΊ Data Loading | |
| β βββΊ Dataset Creation | |
| β | |
| βββΊ Model Initialization | |
| β βββΊ Load Pre-trained BERT | |
| β | |
| βββΊ Training Loop | |
| β βββΊ Forward Pass | |
| β βββΊ Loss Calculation | |
| β βββΊ Backward Pass | |
| β βββΊ Optimizer Step | |
| β | |
| βββΊ Validation | |
| β βββΊ Metrics Calculation | |
| β | |
| βββΊ Experiment Tracking | |
| β βββΊ WandB Logging | |
| β βββΊ MLflow Tracking | |
| β βββΊ DVC Versioning | |
| β | |
| βββΊ Model Checkpointing | |
| βββΊ Save Best Model | |
| ``` | |
| ### API Request Flow | |
| ``` | |
| HTTP Request | |
| β | |
| βββΊ CORS Middleware | |
| β | |
| βββΊ Monitoring Middleware | |
| β βββΊ Start timer | |
| β | |
| βββΊ Request Validation | |
| β βββΊ Pydantic validation | |
| β | |
| βββΊ Inference | |
| β βββΊ Text preprocessing | |
| β βββΊ Tokenization | |
| β βββΊ Model forward pass | |
| β βββΊ Post-processing | |
| β | |
| βββΊ Monitoring | |
| β βββΊ Log prediction | |
| β βββΊ Check drift | |
| β βββΊ Update metrics | |
| β | |
| βββΊ HTTP Response | |
| βββΊ JSON with predictions | |
| ``` | |
| ## Technology Stack | |
| ### Core ML | |
| - **PyTorch**: Deep learning framework | |
| - **PyTorch Lightning**: Training framework | |
| - **Transformers**: HuggingFace transformers library | |
| - **Russian BERT**: DeepPavlov/rubert-base-cased | |
| ### API & Web | |
| - **FastAPI**: Modern Python web framework | |
| - **Uvicorn**: ASGI server | |
| - **Pydantic**: Data validation | |
| ### MLOps | |
| - **WandB**: Experiment tracking | |
| - **MLflow**: Model registry | |
| - **DVC**: Data versioning | |
| - **Optuna**: Hyperparameter tuning | |
| - **Hydra**: Configuration management | |
| ### Infrastructure | |
| - **Docker**: Containerization | |
| - **GitHub Actions**: CI/CD | |
| - **Nginx**: Reverse proxy (optional) | |
| ### Monitoring | |
| - **Custom Monitoring**: Performance, drift, logging | |
| - **Prometheus** (optional): Metrics collection | |
| - **Grafana** (optional): Visualization | |
| ## Scalability Considerations | |
| ### Horizontal Scaling | |
| - Stateless API design | |
| - Load balancer support | |
| - Multiple worker processes | |
| - Container orchestration (Kubernetes) | |
| ### Performance Optimization | |
| - Async inference | |
| - Batch processing | |
| - Model quantization (future) | |
| - GPU acceleration | |
| - Caching (future) | |
| ### High Availability | |
| - Health checks | |
| - Graceful degradation | |
| - Circuit breakers (future) | |
| - Retry mechanisms | |