Spaces:

solarevat
/

multilabel-news-classifier

Sleeping

App Files Files Community

multilabel-news-classifier / docs /ARCHITECTURE.md

Solareva Taisia

chore(release): initial public snapshot

198ccb0 2 months ago

preview code

raw

history blame contribute delete

12.1 kB

Architecture Documentation

System Architecture

High-Level Overview

┌─────────────────────────────────────────────────────────────┐
│                    Client Layer                              │
│  (Web Apps, Mobile Apps, Other Services)                    │
└───────────────────────┬─────────────────────────────────────┘
                        │
                        │ HTTP/REST API
                        │
┌───────────────────────▼─────────────────────────────────────┐
│                  API Gateway Layer                            │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  FastAPI Application                                  │  │
│  │  - Request Validation                                 │  │
│  │  - Authentication (optional)                          │  │
│  │  - Rate Limiting                                      │  │
│  └──────────────────────────────────────────────────────┘  │
└───────────────────────┬─────────────────────────────────────┘
                        │
                        │
┌───────────────────────▼─────────────────────────────────────┐
│              Application Layer                                │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Monitoring Middleware                                │  │
│  │  - Prediction Logging                                 │  │
│  │  - Data Drift Detection                              │  │
│  │  - Performance Tracking                               │  │
│  └──────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Inference Engine                                     │  │
│  │  - Async Processing                                   │  │
│  │  - Batch Handling                                     │  │
│  │  - Error Handling                                     │  │
│  └──────────────────────────────────────────────────────┘  │
└───────────────────────┬─────────────────────────────────────┘
                        │
                        │ Model Calls
                        │
┌───────────────────────▼─────────────────────────────────────┐
│              Model Layer                                      │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Transformer Models                                   │  │
│  │  - Russian BERT                                      │  │
│  │  - RoBERTa                                           │  │
│  │  - DistilBERT                                        │  │
│  │  - Ensemble Models                                   │  │
│  └──────────────────────────────────────────────────────┘  │
└───────────────────────┬─────────────────────────────────────┘
                        │
                        │
┌───────────────────────▼─────────────────────────────────────┐
│              Data Layer                                       │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Tokenization                                         │  │
│  │  - HuggingFace Tokenizers                             │  │
│  │  - Subword Tokenization                               │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Model Architecture Details

Transformer Model Flow

Input Text: "Путин объявил о новых мерах поддержки экономики"
    │
    ├─► Text Preprocessing
    │   └─► Normalize: "путин объявил о новых мерах поддержки экономики"
    │
    ├─► Tokenization (HuggingFace)
    │   └─► Tokens: ["[CLS]", "путин", "объявил", "о", "новых", "мерах", ...]
    │   └─► Token IDs: [101, 1234, 5678, ...]
    │
    ├─► Embedding Layer
    │   └─► [batch, seq_len, 768]
    │
    ├─► BERT Encoder (12 layers)
    │   ├─► Multi-Head Self-Attention (12 heads)
    │   ├─► Feed-Forward Network
    │   ├─► Layer Normalization
    │   └─► Residual Connections
    │   └─► Output: [batch, seq_len, 768]
    │
    ├─► Pooling
    │   └─► [CLS] token or Attention Pooling
    │   └─► [batch, 768]
    │
    ├─► Classification Head
    │   ├─► Dropout(0.3)
    │   ├─► Linear(768 → 768) + ReLU
    │   ├─► Dropout(0.3)
    │   └─► Linear(768 → num_labels)
    │   └─► Output: [batch, num_labels]
    │
    ├─► Sigmoid Activation
    │   └─► Probabilities: [batch, num_labels]
    │
    └─► Threshold Filtering (0.5)
        └─► Final Tags: ["политика", "экономика"]

Ensemble Architecture

Input: Title + Snippet
    │
    ├─► Model 1 (Russian BERT)
    │   └─► Predictions: [0.9, 0.7, 0.3, ...]
    │
    ├─► Model 2 (RoBERTa)
    │   └─► Predictions: [0.85, 0.75, 0.4, ...]
    │
    ├─► Model 3 (DistilBERT)
    │   └─► Predictions: [0.88, 0.72, 0.35, ...]
    │
    └─► Ensemble Combination
        ├─► Weighted Average (weights: [0.4, 0.3, 0.3])
        └─► Final Predictions: [0.88, 0.73, 0.35, ...]

Data Flow

Training Data Flow

Raw TSV Files
    │
    ├─► Load Data (pandas)
    │   └─► Filter nulls
    │
    ├─► Text Preprocessing
    │   ├─► Normalize text
    │   ├─► Lowercase
    │   └─► Remove special chars
    │
    ├─► Tag Processing
    │   ├─► Split tags
    │   ├─► Filter by frequency
    │   └─► Create label mapping
    │
    ├─► Data Splitting
    │   ├─► Train (dates < 2018-10-01)
    │   ├─► Validation (2018-10-01 to 2018-12-01)
    │   └─► Test (dates >= 2018-12-01)
    │
    ├─► Dataset Creation
    │   ├─► Tokenization
    │   ├─► Padding/Truncation
    │   └─► Multi-hot encoding
    │
    └─► DataLoader
        └─► Batches for training

Inference Data Flow

API Request
    │
    ├─► Request Validation (Pydantic)
    │   └─► Validate title, snippet, threshold
    │
    ├─► Text Preprocessing
    │   └─► Normalize and clean
    │
    ├─► Tokenization
    │   └─► Convert to token IDs
    │
    ├─► Model Inference
    │   └─► Forward pass through BERT
    │
    ├─► Post-processing
    │   ├─► Sigmoid activation
    │   ├─► Threshold filtering
    │   └─► Top-K selection
    │
    ├─► Monitoring
    │   ├─► Log prediction
    │   ├─► Record for drift detection
    │   └─► Track performance
    │
    └─► Response
        └─► JSON with predictions

Component Interactions

Training Pipeline

Config (Hydra)
    │
    ├─► Data Loading
    │   └─► Dataset Creation
    │
    ├─► Model Initialization
    │   └─► Load Pre-trained BERT
    │
    ├─► Training Loop
    │   ├─► Forward Pass
    │   ├─► Loss Calculation
    │   ├─► Backward Pass
    │   └─► Optimizer Step
    │
    ├─► Validation
    │   └─► Metrics Calculation
    │
    ├─► Experiment Tracking
    │   ├─► WandB Logging
    │   ├─► MLflow Tracking
    │   └─► DVC Versioning
    │
    └─► Model Checkpointing
        └─► Save Best Model

API Request Flow

HTTP Request
    │
    ├─► CORS Middleware
    │
    ├─► Monitoring Middleware
    │   └─► Start timer
    │
    ├─► Request Validation
    │   └─► Pydantic validation
    │
    ├─► Inference
    │   ├─► Text preprocessing
    │   ├─► Tokenization
    │   ├─► Model forward pass
    │   └─► Post-processing
    │
    ├─► Monitoring
    │   ├─► Log prediction
    │   ├─► Check drift
    │   └─► Update metrics
    │
    └─► HTTP Response
        └─► JSON with predictions

Technology Stack

Core ML

PyTorch: Deep learning framework
PyTorch Lightning: Training framework
Transformers: HuggingFace transformers library
Russian BERT: DeepPavlov/rubert-base-cased

API & Web

FastAPI: Modern Python web framework
Uvicorn: ASGI server
Pydantic: Data validation

MLOps

WandB: Experiment tracking
MLflow: Model registry
DVC: Data versioning
Optuna: Hyperparameter tuning
Hydra: Configuration management

Infrastructure

Docker: Containerization
GitHub Actions: CI/CD
Nginx: Reverse proxy (optional)

Monitoring

Custom Monitoring: Performance, drift, logging
Prometheus (optional): Metrics collection
Grafana (optional): Visualization

Scalability Considerations

Horizontal Scaling

Stateless API design
Load balancer support
Multiple worker processes
Container orchestration (Kubernetes)

Performance Optimization

Async inference
Batch processing
Model quantization (future)
GPU acceleration
Caching (future)

High Availability

Health checks
Graceful degradation
Circuit breakers (future)
Retry mechanisms