Spaces:

solarevat
/

multilabel-news-classifier

Sleeping

App Files Files Community

multilabel-news-classifier / docs /ARCHITECTURE.md

Solareva Taisia

chore(release): initial public snapshot

198ccb0 2 months ago

preview code

raw

history blame contribute delete

12.1 kB

	# Architecture Documentation

	## System Architecture

	### High-Level Overview

	```
	┌─────────────────────────────────────────────────────────────┐
	│ Client Layer │
	│ (Web Apps, Mobile Apps, Other Services) │
	└───────────────────────┬─────────────────────────────────────┘
	│
	│ HTTP/REST API
	│
	┌───────────────────────▼─────────────────────────────────────┐
	│ API Gateway Layer │
	│ ┌──────────────────────────────────────────────────────┐ │
	│ │ FastAPI Application │ │
	│ │ - Request Validation │ │
	│ │ - Authentication (optional) │ │
	│ │ - Rate Limiting │ │
	│ └──────────────────────────────────────────────────────┘ │
	└───────────────────────┬─────────────────────────────────────┘
	│
	│
	┌───────────────────────▼─────────────────────────────────────┐
	│ Application Layer │
	│ ┌──────────────────────────────────────────────────────┐ │
	│ │ Monitoring Middleware │ │
	│ │ - Prediction Logging │ │
	│ │ - Data Drift Detection │ │
	│ │ - Performance Tracking │ │
	│ └──────────────────────────────────────────────────────┘ │
	│ ┌──────────────────────────────────────────────────────┐ │
	│ │ Inference Engine │ │
	│ │ - Async Processing │ │
	│ │ - Batch Handling │ │
	│ │ - Error Handling │ │
	│ └──────────────────────────────────────────────────────┘ │
	└───────────────────────┬─────────────────────────────────────┘
	│
	│ Model Calls
	│
	┌───────────────────────▼─────────────────────────────────────┐
	│ Model Layer │
	│ ┌──────────────────────────────────────────────────────┐ │
	│ │ Transformer Models │ │
	│ │ - Russian BERT │ │
	│ │ - RoBERTa │ │
	│ │ - DistilBERT │ │
	│ │ - Ensemble Models │ │
	│ └──────────────────────────────────────────────────────┘ │
	└───────────────────────┬─────────────────────────────────────┘
	│
	│
	┌───────────────────────▼─────────────────────────────────────┐
	│ Data Layer │
	│ ┌──────────────────────────────────────────────────────┐ │
	│ │ Tokenization │ │
	│ │ - HuggingFace Tokenizers │ │
	│ │ - Subword Tokenization │ │
	│ └──────────────────────────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────┘
	```

	## Model Architecture Details

	### Transformer Model Flow

	```
	Input Text: "Путин объявил о новых мерах поддержки экономики"
	│
	├─► Text Preprocessing
	│ └─► Normalize: "путин объявил о новых мерах поддержки экономики"
	│
	├─► Tokenization (HuggingFace)
	│ └─► Tokens: ["[CLS]", "путин", "объявил", "о", "новых", "мерах", ...]
	│ └─► Token IDs: [101, 1234, 5678, ...]
	│
	├─► Embedding Layer
	│ └─► [batch, seq_len, 768]
	│
	├─► BERT Encoder (12 layers)
	│ ├─► Multi-Head Self-Attention (12 heads)
	│ ├─► Feed-Forward Network
	│ ├─► Layer Normalization
	│ └─► Residual Connections
	│ └─► Output: [batch, seq_len, 768]
	│
	├─► Pooling
	│ └─► [CLS] token or Attention Pooling
	│ └─► [batch, 768]
	│
	├─► Classification Head
	│ ├─► Dropout(0.3)
	│ ├─► Linear(768 → 768) + ReLU
	│ ├─► Dropout(0.3)
	│ └─► Linear(768 → num_labels)
	│ └─► Output: [batch, num_labels]
	│
	├─► Sigmoid Activation
	│ └─► Probabilities: [batch, num_labels]
	│
	└─► Threshold Filtering (0.5)
	└─► Final Tags: ["политика", "экономика"]
	```

	### Ensemble Architecture

	```
	Input: Title + Snippet
	│
	├─► Model 1 (Russian BERT)
	│ └─► Predictions: [0.9, 0.7, 0.3, ...]
	│
	├─► Model 2 (RoBERTa)
	│ └─► Predictions: [0.85, 0.75, 0.4, ...]
	│
	├─► Model 3 (DistilBERT)
	│ └─► Predictions: [0.88, 0.72, 0.35, ...]
	│
	└─► Ensemble Combination
	├─► Weighted Average (weights: [0.4, 0.3, 0.3])
	└─► Final Predictions: [0.88, 0.73, 0.35, ...]
	```

	## Data Flow

	### Training Data Flow

	```
	Raw TSV Files
	│
	├─► Load Data (pandas)
	│ └─► Filter nulls
	│
	├─► Text Preprocessing
	│ ├─► Normalize text
	│ ├─► Lowercase
	│ └─► Remove special chars
	│
	├─► Tag Processing
	│ ├─► Split tags
	│ ├─► Filter by frequency
	│ └─► Create label mapping
	│
	├─► Data Splitting
	│ ├─► Train (dates < 2018-10-01)
	│ ├─► Validation (2018-10-01 to 2018-12-01)
	│ └─► Test (dates >= 2018-12-01)
	│
	├─► Dataset Creation
	│ ├─► Tokenization
	│ ├─► Padding/Truncation
	│ └─► Multi-hot encoding
	│
	└─► DataLoader
	└─► Batches for training
	```

	### Inference Data Flow

	```
	API Request
	│
	├─► Request Validation (Pydantic)
	│ └─► Validate title, snippet, threshold
	│
	├─► Text Preprocessing
	│ └─► Normalize and clean
	│
	├─► Tokenization
	│ └─► Convert to token IDs
	│
	├─► Model Inference
	│ └─► Forward pass through BERT
	│
	├─► Post-processing
	│ ├─► Sigmoid activation
	│ ├─► Threshold filtering
	│ └─► Top-K selection
	│
	├─► Monitoring
	│ ├─► Log prediction
	│ ├─► Record for drift detection
	│ └─► Track performance
	│
	└─► Response
	└─► JSON with predictions
	```

	## Component Interactions

	### Training Pipeline

	```
	Config (Hydra)
	│
	├─► Data Loading
	│ └─► Dataset Creation
	│
	├─► Model Initialization
	│ └─► Load Pre-trained BERT
	│
	├─► Training Loop
	│ ├─► Forward Pass
	│ ├─► Loss Calculation
	│ ├─► Backward Pass
	│ └─► Optimizer Step
	│
	├─► Validation
	│ └─► Metrics Calculation
	│
	├─► Experiment Tracking
	│ ├─► WandB Logging
	│ ├─► MLflow Tracking
	│ └─► DVC Versioning
	│
	└─► Model Checkpointing
	└─► Save Best Model
	```

	### API Request Flow

	```
	HTTP Request
	│
	├─► CORS Middleware
	│
	├─► Monitoring Middleware
	│ └─► Start timer
	│
	├─► Request Validation
	│ └─► Pydantic validation
	│
	├─► Inference
	│ ├─► Text preprocessing
	│ ├─► Tokenization
	│ ├─► Model forward pass
	│ └─► Post-processing
	│
	├─► Monitoring
	│ ├─► Log prediction
	│ ├─► Check drift
	│ └─► Update metrics
	│
	└─► HTTP Response
	└─► JSON with predictions
	```

	## Technology Stack

	### Core ML
	- PyTorch: Deep learning framework
	- PyTorch Lightning: Training framework
	- Transformers: HuggingFace transformers library
	- Russian BERT: DeepPavlov/rubert-base-cased

	### API & Web
	- FastAPI: Modern Python web framework
	- Uvicorn: ASGI server
	- Pydantic: Data validation

	### MLOps
	- WandB: Experiment tracking
	- MLflow: Model registry
	- DVC: Data versioning
	- Optuna: Hyperparameter tuning
	- Hydra: Configuration management

	### Infrastructure
	- Docker: Containerization
	- GitHub Actions: CI/CD
	- Nginx: Reverse proxy (optional)

	### Monitoring
	- Custom Monitoring: Performance, drift, logging
	- Prometheus (optional): Metrics collection
	- Grafana (optional): Visualization

	## Scalability Considerations

	### Horizontal Scaling
	- Stateless API design
	- Load balancer support
	- Multiple worker processes
	- Container orchestration (Kubernetes)

	### Performance Optimization
	- Async inference
	- Batch processing
	- Model quantization (future)
	- GPU acceleration
	- Caching (future)

	### High Availability
	- Health checks
	- Graceful degradation
	- Circuit breakers (future)
	- Retry mechanisms