Arpit-Bansal's picture
update docs
8720c05
# Data Schemas & Service Specifications
## Overview
This document details all data structures, schemas, API contracts, and data volume specifications for the Metro Train Scheduling Service.
---
## Table of Contents
1. [Core Data Models](#core-data-models)
2. [API Schemas](#api-schemas)
3. [Database Schemas](#database-schemas)
4. [Data Volume & Storage](#data-volume--storage)
5. [Service Resource Usage](#service-resource-usage)
---
## Core Data Models
All models use **Pydantic v2** for validation and serialization.
### 1. DaySchedule
**Purpose**: Complete daily schedule with all trainset assignments
```python
class DaySchedule(BaseModel):
schedule_id: str # "KMRL-2025-10-25"
date: str # "2025-10-25"
route: Route # Route details
trainsets: List[Trainset] # All train assignments
fleet_summary: FleetSummary # Fleet statistics
optimization_metrics: OptimizationMetrics
alerts: List[Alert] # Warnings/issues
generated_at: datetime
generated_by: str = "ML-Optimizer"
```
**Size**: ~45 KB per schedule (30 trains, full day)
**Example**:
```json
{
"schedule_id": "KMRL-2025-10-25",
"date": "2025-10-25",
"route": {...},
"trainsets": [...],
"fleet_summary": {
"total_trainsets": 30,
"in_service": 24,
"standby": 4,
"maintenance": 2
},
"optimization_metrics": {
"total_service_blocks": 156,
"avg_readiness_score": 0.87,
"mileage_variance_coefficient": 0.12
},
"generated_at": "2025-10-25T04:30:00+05:30"
}
```
---
### 2. Trainset
**Purpose**: Individual train assignment and status
```python
class Trainset(BaseModel):
trainset_id: str # "TS-001"
status: TrainHealthStatus # REVENUE_SERVICE, STANDBY, etc.
depot_bay: str # "BAY-01"
cumulative_km: int # 145250
readiness_score: float # 0.0-1.0
service_blocks: List[ServiceBlock] # Trip assignments
fitness_certificates: FitnessCertificates
job_cards: JobCards
branding: Branding
```
**Size**: ~1.5 KB per trainset
**Status Enum**:
```python
class TrainHealthStatus(str, Enum):
REVENUE_SERVICE = "REVENUE_SERVICE" # Active service
STANDBY = "STANDBY" # Ready, not assigned
MAINTENANCE = "MAINTENANCE" # Under repair
SCHEDULED_MAINTENANCE = "SCHEDULED_MAINTENANCE"
UNAVAILABLE = "UNAVAILABLE" # Out of service
```
**Distribution** (typical 30-train fleet):
- REVENUE_SERVICE: 22-24 trains (73-80%)
- STANDBY: 3-5 trains (10-17%)
- MAINTENANCE: 1-3 trains (3-10%)
- UNAVAILABLE: 0-2 trains (0-7%)
---
### 3. ServiceBlock
**Purpose**: Single trip assignment for a train
```python
class ServiceBlock(BaseModel):
block_id: str # "BLK-001-01"
start_time: str # "05:00"
end_time: str # "05:45"
start_station: str # "Aluva"
end_station: str # "Pettah"
direction: str # "UP" or "DOWN"
distance_km: float # 25.612
estimated_passengers: Optional[int] # 450
priority: str = "NORMAL" # NORMAL, HIGH, PEAK
```
**Size**: ~250 bytes per service block
**Daily Trips per Train**:
- Peak service train: 6-8 trips
- Standard service: 4-6 trips
- Average: ~5.2 trips per active train
**Total Service Blocks** (30-train fleet):
- 24 active trains × 5.2 trips = ~125 service blocks/day
---
### 4. Route
**Purpose**: Metro line configuration
```python
class Route(BaseModel):
route_id: str # "KMRL-LINE-01"
name: str # "Aluva-Pettah Line"
stations: List[Station] # 25 stations
total_distance_km: float # 25.612 km
avg_speed_kmh: int # 32-38 km/h
turnaround_time_minutes: int # 8-12 minutes
```
**KMRL Route Details**:
- **Stations**: 25 (Aluva to Pettah)
- **Distance**: 25.612 km
- **Average Speed**: 35 km/h
- **One-way Time**: ~44 minutes
- **Round Trip**: ~100 minutes (including turnarounds)
---
### 5. Station
**Purpose**: Individual station on route
```python
class Station(BaseModel):
station_id: str # "STN-001"
name: str # "Aluva"
code: str # "ALV"
distance_from_start_km: float # 0.0
platform_count: int # 2
facilities: List[str] # ["PARKING", "ELEVATOR"]
```
**Size**: ~200 bytes per station
**Total Stations**: 25 (fixed)
---
### 6. FitnessCertificates
**Purpose**: Regulatory compliance tracking
```python
class FitnessCertificates(BaseModel):
rolling_stock: FitnessCertificate # Train body/chassis
signalling: FitnessCertificate # Signal systems
telecom: FitnessCertificate # Communication systems
class FitnessCertificate(BaseModel):
valid_until: str # "2025-12-31"
status: CertificateStatus # VALID, EXPIRING_SOON, EXPIRED
class CertificateStatus(str, Enum):
VALID = "VALID" # > 30 days remaining
EXPIRING_SOON = "EXPIRING_SOON" # 7-30 days remaining
EXPIRED = "EXPIRED" # Past expiry date
```
**Validation Rules**:
- Trains with EXPIRED certificates: status = UNAVAILABLE
- Trains with EXPIRING_SOON: flagged in alerts, can operate
---
### 7. JobCards & Maintenance
**Purpose**: Maintenance tracking
```python
class JobCards(BaseModel):
open: int # Number of open job cards
blocking: List[str] # Critical issues: ["BRAKE_FAULT"]
# Example maintenance reasons
UNAVAILABLE_REASONS = [
"SCHEDULED_MAINTENANCE",
"BRAKE_SYSTEM_REPAIR",
"HVAC_REPLACEMENT",
"BOGIE_OVERHAUL",
"ELECTRICAL_FAULT",
"ACCIDENT_DAMAGE",
"PANTOGRAPH_REPAIR",
"DOOR_SYSTEM_FAULT"
]
```
**Impact on Scheduling**:
- 0 open cards: readiness = 1.0
- 1-2 cards: readiness = 0.9
- 3-4 cards: readiness = 0.7
- 5+ cards: readiness = 0.5, likely maintenance status
---
### 8. Branding
**Purpose**: Advertisement tracking
```python
class Branding(BaseModel):
advertiser: str # "COCACOLA-2024"
contract_hours_remaining: int # 450 hours
exposure_priority: str # LOW, MEDIUM, HIGH, CRITICAL
# Available advertisers
ADVERTISERS = [
"COCACOLA-2024",
"FLIPKART-FESTIVE",
"AMAZON-PRIME",
"RELIANCE-JIO",
"TATA-MOTORS",
"SAMSUNG-GALAXY",
"NONE"
]
```
**Priority Weights** (for optimization):
- CRITICAL: 4 points
- HIGH: 3 points
- MEDIUM: 2 points
- LOW: 1 point
- NONE: 0 points
**Scheduling Strategy**:
- HIGH/CRITICAL branded trains prioritized for peak hours
- Maximizes advertiser visibility during high-traffic periods
---
### 9. FleetSummary
**Purpose**: Aggregated fleet statistics
```python
class FleetSummary(BaseModel):
total_trainsets: int # 30
in_service: int # 24
standby: int # 4
maintenance: int # 2
unavailable: int # 0
availability_percent: float # 93.33
total_mileage_today: int # 3200 km
avg_trips_per_train: float # 5.2
```
**Size**: ~300 bytes
**Key Metrics**:
- **Availability %**: (in_service + standby) / total × 100
- **Target Availability**: ≥ 90%
- **Service Ratio**: in_service / (in_service + standby)
- **Target Service Ratio**: 85-90%
---
### 10. OptimizationMetrics
**Purpose**: Optimization quality measures
```python
class OptimizationMetrics(BaseModel):
total_service_blocks: int # 125
avg_readiness_score: float # 0.87
mileage_variance_coefficient: float # 0.12
branding_sla_compliance: float # 0.95
fitness_expiry_violations: int # 0
execution_time_ms: int # 1250
algorithm_used: str # "ensemble_ml" or "or_tools"
confidence_score: Optional[float] # 0.89 (if ML used)
```
**Size**: ~250 bytes
**Quality Thresholds**:
- avg_readiness_score: ≥ 0.80
- mileage_variance_coefficient: < 0.15
- branding_sla_compliance: ≥ 0.90
- fitness_expiry_violations: 0
---
## API Schemas
### Request: ScheduleRequest
**Endpoint**: `POST /api/v1/generate`
```python
class ScheduleRequest(BaseModel):
date: str # "2025-10-25"
num_trains: int = 25 # 25-40
num_stations: int = 25 # Fixed for KMRL
min_service_trains: int = 22 # Minimum active
min_standby_trains: int = 3 # Minimum backup
# Optional overrides
peak_hours: Optional[List[int]] = None # [7,8,9,17,18,19]
force_optimization: bool = False # Skip ML, use OR-Tools
```
**Size**: ~150 bytes per request
**Validation**:
- `num_trains`: 25 ≤ n ≤ 40
- `num_stations`: Fixed at 25 (KMRL specific)
- `min_service_trains`: ≤ num_trains - 3
- `min_standby_trains`: ≥ 2
**Example**:
```json
{
"date": "2025-10-25",
"num_trains": 30,
"num_stations": 25,
"min_service_trains": 24,
"min_standby_trains": 4
}
```
---
### Response: DaySchedule
**Status**: 200 OK
**Content-Type**: application/json
**Size**: 45-55 KB (depends on fleet size)
**Headers**:
```
X-Algorithm-Used: ensemble_ml | or_tools | greedy
X-Confidence-Score: 0.89 (if ML)
X-Execution-Time-Ms: 1250
```
---
### Error Responses
**400 Bad Request**:
```json
{
"error": "Validation Error",
"details": {
"num_trains": "Must be between 25 and 40"
}
}
```
**500 Internal Server Error**:
```json
{
"error": "Optimization Failed",
"message": "Unable to find feasible schedule",
"timestamp": "2025-10-25T10:30:00Z"
}
```
---
## Database Schemas
### Schedule Storage (JSON Files)
**Location**: `data/schedules/`
**Naming**: `{schedule_id}_{timestamp}.json`
**Example**: `KMRL-2025-10-25_20251025_043000.json`
**Structure**:
```json
{
"schedule": {DaySchedule},
"metadata": {
"recorded_at": "2025-10-25T04:30:00",
"quality_score": 87.5,
"algorithm_used": "ensemble_ml",
"confidence": 0.89
},
"saved_at": "2025-10-25T04:30:15"
}
```
**Size per File**: ~48 KB
---
### Model Storage (Pickle Files)
**Location**: `models/`
**Files**:
1. `models_latest.pkl` - Current ensemble (all 5 models)
2. `models_{timestamp}.pkl` - Historical snapshots
3. `training_history.json` - Training metrics log
**Model File Contents**:
```python
{
"models": {
"gradient_boosting": GradientBoostingRegressor(),
"random_forest": RandomForestRegressor(),
"xgboost": XGBRegressor(),
"lightgbm": LGBMRegressor(),
"catboost": CatBoostRegressor()
},
"ensemble_weights": {
"xgboost": 0.215,
"lightgbm": 0.208,
...
},
"best_model_name": "xgboost",
"last_trained": datetime(2025, 10, 25, 4, 30),
"config": {
"version": "v1.0.0",
"features": [...],
"models_trained": [...]
}
}
```
**Size**: ~15-25 MB (all 5 models combined)
---
### Training History (JSON)
**Location**: `models/training_history.json`
**Structure**:
```json
[
{
"timestamp": "2025-10-23T12:00:00",
"metrics": {
"gradient_boosting": {
"train_r2": 0.8912,
"test_r2": 0.8234,
"test_rmse": 13.45
},
...
},
"best_model": "xgboost",
"ensemble_weights": {...},
"config": {
"models_trained": [...],
"version": "v1.0.0"
}
},
...
]
```
**Growth**: ~1 KB per training run
**Retention**: All training runs (pruned after 1000 entries)
---
## Data Volume & Storage
### Production Estimates
#### Daily Operations
**Per Day** (single schedule generation):
- 1 schedule file: ~48 KB
- API request/response: ~50 KB total
- Logs: ~10 KB
**Total per day**: ~108 KB
#### Monthly Operations (30 days)
**Schedule files**:
- 30 schedules × 48 KB = 1.44 MB
**Model files**:
- 1 retraining (every 48 hours) = 15 retrainings/month
- 15 × 25 MB = 375 MB
**Training history**:
- 15 entries × 1 KB = 15 KB
**Total per month**: ~377 MB
#### Annual Storage (1 year)
**Schedule data**:
- 365 schedules × 48 KB = 17.5 MB
**Model snapshots**:
- 182 retrainings × 25 MB = 4.55 GB
**Training history**:
- 182 KB
**Total per year**: ~4.57 GB
**With retention policy** (keep last 100 schedules, 50 models):
- Schedules: 100 × 48 KB = 4.8 MB
- Models: 50 × 25 MB = 1.25 GB
- History: 182 KB
**Total with retention**: ~1.26 GB
---
### ML Training Data Requirements
#### Minimum Training Dataset
**Initial training**: 100 schedules
- Storage: 100 × 48 KB = 4.8 MB
- Generation time: ~15 minutes (automated)
- Training time: 5-10 minutes
**Optimal training**: 500 schedules
- Storage: 500 × 48 KB = 24 MB
- Provides better generalization
- Covers more edge cases
#### Feature Matrix Size
**Per schedule**: 10 features × 8 bytes (float64) = 80 bytes
**Training set** (100 schedules):
- Features (X): 100 × 80 bytes = 8 KB
- Target (y): 100 × 8 bytes = 800 bytes
- Total: ~9 KB (minimal)
**Full dataset** (1000 schedules):
- Features: 80 KB
- Target: 8 KB
- Total: ~88 KB
**Memory during training**:
- Dataset: ~88 KB
- Models (5 × ~5 MB): ~25 MB
- Working memory: ~50 MB
- **Total**: ~75 MB
---
### Optimization Service Resource Usage
#### OR-Tools Optimization
**Input data**:
- 30 trains × 1.5 KB = 45 KB
- 25 stations × 200 bytes = 5 KB
- Constraints: ~10 KB
- **Total input**: ~60 KB
**Memory usage**:
- Solver state: ~10 MB
- Solution space: ~20 MB
- **Peak memory**: ~30 MB
**Execution time**: 1-5 seconds (CPU-bound)
**CPU utilization**: 100% single core
---
#### ML Ensemble Prediction
**Input data**:
- Feature vector: 10 × 8 bytes = 80 bytes
- **Total input**: < 1 KB
**Memory usage**:
- Loaded models: ~25 MB (shared)
- Prediction workspace: ~1 MB
- **Peak memory**: ~26 MB
**Execution time**: 50-100 milliseconds
**CPU utilization**: 20-30% single core
---
#### Greedy Optimization
**Input data**: ~60 KB (same as OR-Tools)
**Memory usage**:
- State tracking: ~5 MB
- Priority queue: ~2 MB
- **Peak memory**: ~7 MB
**Execution time**: < 1 second
**CPU utilization**: 50-70% single core
---
## Service Resource Usage
### DataService (FastAPI)
**Base memory**: 150 MB (Python + FastAPI + dependencies)
**Per request overhead**: ~10 MB
**Concurrent requests** (typical): 1-5
**Total memory** (under load): 200-250 MB
**Disk I/O**:
- Read: Minimal (configuration only)
- Write: ~50 KB per schedule generated
**Network**:
- Inbound: ~150 bytes (request)
- Outbound: ~50 KB (response)
---
### SelfTrainService
**Base memory**: 200 MB (Python + ML libraries)
**During training**:
- Dataset loading: +20 MB
- Model training: +100 MB (peak)
- **Total during training**: ~320 MB
**During inference** (loaded models):
- Models in memory: +25 MB
- **Total during inference**: ~225 MB
**Disk I/O**:
- Read: 5 MB (load schedules)
- Write: 25 MB (save models)
**Frequency**:
- Training: Every 48 hours
- Inference: Per schedule request (if confidence ≥ 75%)
---
### Retraining Service (Background)
**Memory**: ~50 MB (idle), ~320 MB (during training)
**CPU**:
- Idle: < 1%
- Training: 100% (5-10 minutes every 48 hours)
**Disk I/O**:
- Check interval: Every 60 minutes
- Read: ~1 MB (check schedule count)
- Write: ~25 MB (when retraining)
---
## Data Flow Summary
### Schedule Generation Request
```
Client Request (150 bytes)
FastAPI Parser (~1 KB in memory)
Feature Extraction (80 bytes)
ML Prediction (25 MB models loaded) OR OR-Tools (30 MB solver)
Schedule Generation (45 KB output)
JSON Serialization (~50 KB response)
Storage (48 KB file)
```
**Total data processed**: ~50 KB per request
**Response time**: 0.1-5 seconds
---
### Model Training Cycle
```
Load Schedules (100 × 48 KB = 4.8 MB)
Extract Features (100 × 80 bytes = 8 KB)
Train 5 Models (5-10 minutes, 100% CPU)
Save Models (25 MB pickle file)
Update History (1 KB append)
```
**Total data processed**: ~30 MB
**Frequency**: Every 48 hours
---
## Configuration Data
### Service Configuration
**Location**: `SelfTrainService/config.py`
**Size**: ~5 KB
**Key Parameters**:
```python
{
"RETRAIN_INTERVAL_HOURS": 48,
"MIN_SCHEDULES_FOR_TRAINING": 100,
"MODEL_TYPES": ["gradient_boosting", "xgboost", ...],
"USE_ENSEMBLE": true,
"ML_CONFIDENCE_THRESHOLD": 0.75,
"FEATURES": [10 feature names],
"EPOCHS": 100,
"LEARNING_RATE": 0.001
}
```
---
## Data Retention Policies
### Recommended Retention
**Schedule files**:
- Keep last 365 days (17.5 MB)
- Archive older to compressed storage
**Model snapshots**:
- Keep last 50 models (~1.25 GB)
- Delete older snapshots
- Keep 1 model per month for historical reference
**Training history**:
- Keep all entries (grows slowly)
- Compress after 1000 entries
**Logs**:
- Application logs: 30 days
- Error logs: 90 days
- Audit logs: 1 year
---
## Scaling Considerations
### Horizontal Scaling
**API Service** (DataService):
- Stateless - easy to scale
- Load balancer distributes requests
- Each instance: ~250 MB memory
**ML Service** (SelfTrainService):
- Share model files via NFS/S3
- Only one instance should train (avoid conflicts)
- Multiple instances can serve predictions
### Vertical Scaling
**Memory requirements**:
- Minimum: 1 GB RAM
- Recommended: 2 GB RAM
- Optimal: 4 GB RAM (allows concurrent training + serving)
**CPU requirements**:
- Minimum: 1 core
- Recommended: 2 cores (1 for API, 1 for training)
- Optimal: 4 cores (parallel model training)
**Storage requirements**:
- Minimum: 5 GB
- Recommended: 20 GB
- Optimal: 50 GB (1-year retention)
---
## Performance Benchmarks
### Schedule Generation Performance
| Fleet Size | Algorithm | Time | Memory | Output Size |
|------------|-----------|------|--------|-------------|
| 25 trains | ML | 0.08s | 225 MB | 38 KB |
| 30 trains | ML | 0.10s | 225 MB | 45 KB |
| 40 trains | ML | 0.12s | 225 MB | 60 KB |
| 25 trains | OR-Tools | 1.2s | 30 MB | 38 KB |
| 30 trains | OR-Tools | 2.8s | 30 MB | 45 KB |
| 40 trains | OR-Tools | 4.5s | 30 MB | 60 KB |
| 25 trains | Greedy | 0.3s | 7 MB | 38 KB |
| 30 trains | Greedy | 0.5s | 7 MB | 45 KB |
| 40 trains | Greedy | 0.8s | 7 MB | 60 KB |
### Training Performance
| Dataset Size | Training Time | Memory | Model Size |
|--------------|---------------|--------|------------|
| 100 schedules | 3 min | 320 MB | 20 MB |
| 500 schedules | 8 min | 350 MB | 24 MB |
| 1000 schedules | 15 min | 400 MB | 28 MB |
---
**Document Version**: 1.0.0
**Last Updated**: November 2, 2025
**Maintained By**: ML-Service Team