Data Schemas & Service Specifications
Overview
This document details all data structures, schemas, API contracts, and data volume specifications for the Metro Train Scheduling Service.
Table of Contents
Core Data Models
All models use Pydantic v2 for validation and serialization.
1. DaySchedule
Purpose: Complete daily schedule with all trainset assignments
class DaySchedule(BaseModel):
schedule_id: str # "KMRL-2025-10-25"
date: str # "2025-10-25"
route: Route # Route details
trainsets: List[Trainset] # All train assignments
fleet_summary: FleetSummary # Fleet statistics
optimization_metrics: OptimizationMetrics
alerts: List[Alert] # Warnings/issues
generated_at: datetime
generated_by: str = "ML-Optimizer"
Size: ~45 KB per schedule (30 trains, full day)
Example:
{
"schedule_id": "KMRL-2025-10-25",
"date": "2025-10-25",
"route": {...},
"trainsets": [...],
"fleet_summary": {
"total_trainsets": 30,
"in_service": 24,
"standby": 4,
"maintenance": 2
},
"optimization_metrics": {
"total_service_blocks": 156,
"avg_readiness_score": 0.87,
"mileage_variance_coefficient": 0.12
},
"generated_at": "2025-10-25T04:30:00+05:30"
}
2. Trainset
Purpose: Individual train assignment and status
class Trainset(BaseModel):
trainset_id: str # "TS-001"
status: TrainHealthStatus # REVENUE_SERVICE, STANDBY, etc.
depot_bay: str # "BAY-01"
cumulative_km: int # 145250
readiness_score: float # 0.0-1.0
service_blocks: List[ServiceBlock] # Trip assignments
fitness_certificates: FitnessCertificates
job_cards: JobCards
branding: Branding
Size: ~1.5 KB per trainset
Status Enum:
class TrainHealthStatus(str, Enum):
REVENUE_SERVICE = "REVENUE_SERVICE" # Active service
STANDBY = "STANDBY" # Ready, not assigned
MAINTENANCE = "MAINTENANCE" # Under repair
SCHEDULED_MAINTENANCE = "SCHEDULED_MAINTENANCE"
UNAVAILABLE = "UNAVAILABLE" # Out of service
Distribution (typical 30-train fleet):
- REVENUE_SERVICE: 22-24 trains (73-80%)
- STANDBY: 3-5 trains (10-17%)
- MAINTENANCE: 1-3 trains (3-10%)
- UNAVAILABLE: 0-2 trains (0-7%)
3. ServiceBlock
Purpose: Single trip assignment for a train
class ServiceBlock(BaseModel):
block_id: str # "BLK-001-01"
start_time: str # "05:00"
end_time: str # "05:45"
start_station: str # "Aluva"
end_station: str # "Pettah"
direction: str # "UP" or "DOWN"
distance_km: float # 25.612
estimated_passengers: Optional[int] # 450
priority: str = "NORMAL" # NORMAL, HIGH, PEAK
Size: ~250 bytes per service block
Daily Trips per Train:
- Peak service train: 6-8 trips
- Standard service: 4-6 trips
- Average: ~5.2 trips per active train
Total Service Blocks (30-train fleet):
- 24 active trains Γ 5.2 trips = ~125 service blocks/day
4. Route
Purpose: Metro line configuration
class Route(BaseModel):
route_id: str # "KMRL-LINE-01"
name: str # "Aluva-Pettah Line"
stations: List[Station] # 25 stations
total_distance_km: float # 25.612 km
avg_speed_kmh: int # 32-38 km/h
turnaround_time_minutes: int # 8-12 minutes
KMRL Route Details:
- Stations: 25 (Aluva to Pettah)
- Distance: 25.612 km
- Average Speed: 35 km/h
- One-way Time: ~44 minutes
- Round Trip: ~100 minutes (including turnarounds)
5. Station
Purpose: Individual station on route
class Station(BaseModel):
station_id: str # "STN-001"
name: str # "Aluva"
code: str # "ALV"
distance_from_start_km: float # 0.0
platform_count: int # 2
facilities: List[str] # ["PARKING", "ELEVATOR"]
Size: ~200 bytes per station
Total Stations: 25 (fixed)
6. FitnessCertificates
Purpose: Regulatory compliance tracking
class FitnessCertificates(BaseModel):
rolling_stock: FitnessCertificate # Train body/chassis
signalling: FitnessCertificate # Signal systems
telecom: FitnessCertificate # Communication systems
class FitnessCertificate(BaseModel):
valid_until: str # "2025-12-31"
status: CertificateStatus # VALID, EXPIRING_SOON, EXPIRED
class CertificateStatus(str, Enum):
VALID = "VALID" # > 30 days remaining
EXPIRING_SOON = "EXPIRING_SOON" # 7-30 days remaining
EXPIRED = "EXPIRED" # Past expiry date
Validation Rules:
- Trains with EXPIRED certificates: status = UNAVAILABLE
- Trains with EXPIRING_SOON: flagged in alerts, can operate
7. JobCards & Maintenance
Purpose: Maintenance tracking
class JobCards(BaseModel):
open: int # Number of open job cards
blocking: List[str] # Critical issues: ["BRAKE_FAULT"]
# Example maintenance reasons
UNAVAILABLE_REASONS = [
"SCHEDULED_MAINTENANCE",
"BRAKE_SYSTEM_REPAIR",
"HVAC_REPLACEMENT",
"BOGIE_OVERHAUL",
"ELECTRICAL_FAULT",
"ACCIDENT_DAMAGE",
"PANTOGRAPH_REPAIR",
"DOOR_SYSTEM_FAULT"
]
Impact on Scheduling:
- 0 open cards: readiness = 1.0
- 1-2 cards: readiness = 0.9
- 3-4 cards: readiness = 0.7
- 5+ cards: readiness = 0.5, likely maintenance status
8. Branding
Purpose: Advertisement tracking
class Branding(BaseModel):
advertiser: str # "COCACOLA-2024"
contract_hours_remaining: int # 450 hours
exposure_priority: str # LOW, MEDIUM, HIGH, CRITICAL
# Available advertisers
ADVERTISERS = [
"COCACOLA-2024",
"FLIPKART-FESTIVE",
"AMAZON-PRIME",
"RELIANCE-JIO",
"TATA-MOTORS",
"SAMSUNG-GALAXY",
"NONE"
]
Priority Weights (for optimization):
- CRITICAL: 4 points
- HIGH: 3 points
- MEDIUM: 2 points
- LOW: 1 point
- NONE: 0 points
Scheduling Strategy:
- HIGH/CRITICAL branded trains prioritized for peak hours
- Maximizes advertiser visibility during high-traffic periods
9. FleetSummary
Purpose: Aggregated fleet statistics
class FleetSummary(BaseModel):
total_trainsets: int # 30
in_service: int # 24
standby: int # 4
maintenance: int # 2
unavailable: int # 0
availability_percent: float # 93.33
total_mileage_today: int # 3200 km
avg_trips_per_train: float # 5.2
Size: ~300 bytes
Key Metrics:
- Availability %: (in_service + standby) / total Γ 100
- Target Availability: β₯ 90%
- Service Ratio: in_service / (in_service + standby)
- Target Service Ratio: 85-90%
10. OptimizationMetrics
Purpose: Optimization quality measures
class OptimizationMetrics(BaseModel):
total_service_blocks: int # 125
avg_readiness_score: float # 0.87
mileage_variance_coefficient: float # 0.12
branding_sla_compliance: float # 0.95
fitness_expiry_violations: int # 0
execution_time_ms: int # 1250
algorithm_used: str # "ensemble_ml" or "or_tools"
confidence_score: Optional[float] # 0.89 (if ML used)
Size: ~250 bytes
Quality Thresholds:
- avg_readiness_score: β₯ 0.80
- mileage_variance_coefficient: < 0.15
- branding_sla_compliance: β₯ 0.90
- fitness_expiry_violations: 0
API Schemas
Request: ScheduleRequest
Endpoint: POST /api/v1/generate
class ScheduleRequest(BaseModel):
date: str # "2025-10-25"
num_trains: int = 25 # 25-40
num_stations: int = 25 # Fixed for KMRL
min_service_trains: int = 22 # Minimum active
min_standby_trains: int = 3 # Minimum backup
# Optional overrides
peak_hours: Optional[List[int]] = None # [7,8,9,17,18,19]
force_optimization: bool = False # Skip ML, use OR-Tools
Size: ~150 bytes per request
Validation:
num_trains: 25 β€ n β€ 40num_stations: Fixed at 25 (KMRL specific)min_service_trains: β€ num_trains - 3min_standby_trains: β₯ 2
Example:
{
"date": "2025-10-25",
"num_trains": 30,
"num_stations": 25,
"min_service_trains": 24,
"min_standby_trains": 4
}
Response: DaySchedule
Status: 200 OK
Content-Type: application/json
Size: 45-55 KB (depends on fleet size)
Headers:
X-Algorithm-Used: ensemble_ml | or_tools | greedy
X-Confidence-Score: 0.89 (if ML)
X-Execution-Time-Ms: 1250
Error Responses
400 Bad Request:
{
"error": "Validation Error",
"details": {
"num_trains": "Must be between 25 and 40"
}
}
500 Internal Server Error:
{
"error": "Optimization Failed",
"message": "Unable to find feasible schedule",
"timestamp": "2025-10-25T10:30:00Z"
}
Database Schemas
Schedule Storage (JSON Files)
Location: data/schedules/
Naming: {schedule_id}_{timestamp}.json
Example: KMRL-2025-10-25_20251025_043000.json
Structure:
{
"schedule": {DaySchedule},
"metadata": {
"recorded_at": "2025-10-25T04:30:00",
"quality_score": 87.5,
"algorithm_used": "ensemble_ml",
"confidence": 0.89
},
"saved_at": "2025-10-25T04:30:15"
}
Size per File: ~48 KB
Model Storage (Pickle Files)
Location: models/
Files:
models_latest.pkl- Current ensemble (all 5 models)models_{timestamp}.pkl- Historical snapshotstraining_history.json- Training metrics log
Model File Contents:
{
"models": {
"gradient_boosting": GradientBoostingRegressor(),
"random_forest": RandomForestRegressor(),
"xgboost": XGBRegressor(),
"lightgbm": LGBMRegressor(),
"catboost": CatBoostRegressor()
},
"ensemble_weights": {
"xgboost": 0.215,
"lightgbm": 0.208,
...
},
"best_model_name": "xgboost",
"last_trained": datetime(2025, 10, 25, 4, 30),
"config": {
"version": "v1.0.0",
"features": [...],
"models_trained": [...]
}
}
Size: ~15-25 MB (all 5 models combined)
Training History (JSON)
Location: models/training_history.json
Structure:
[
{
"timestamp": "2025-10-23T12:00:00",
"metrics": {
"gradient_boosting": {
"train_r2": 0.8912,
"test_r2": 0.8234,
"test_rmse": 13.45
},
...
},
"best_model": "xgboost",
"ensemble_weights": {...},
"config": {
"models_trained": [...],
"version": "v1.0.0"
}
},
...
]
Growth: ~1 KB per training run
Retention: All training runs (pruned after 1000 entries)
Data Volume & Storage
Production Estimates
Daily Operations
Per Day (single schedule generation):
- 1 schedule file: ~48 KB
- API request/response: ~50 KB total
- Logs: ~10 KB
Total per day: ~108 KB
Monthly Operations (30 days)
Schedule files:
- 30 schedules Γ 48 KB = 1.44 MB
Model files:
- 1 retraining (every 48 hours) = 15 retrainings/month
- 15 Γ 25 MB = 375 MB
Training history:
- 15 entries Γ 1 KB = 15 KB
Total per month: ~377 MB
Annual Storage (1 year)
Schedule data:
- 365 schedules Γ 48 KB = 17.5 MB
Model snapshots:
- 182 retrainings Γ 25 MB = 4.55 GB
Training history:
- 182 KB
Total per year: ~4.57 GB
With retention policy (keep last 100 schedules, 50 models):
- Schedules: 100 Γ 48 KB = 4.8 MB
- Models: 50 Γ 25 MB = 1.25 GB
- History: 182 KB
Total with retention: ~1.26 GB
ML Training Data Requirements
Minimum Training Dataset
Initial training: 100 schedules
- Storage: 100 Γ 48 KB = 4.8 MB
- Generation time: ~15 minutes (automated)
- Training time: 5-10 minutes
Optimal training: 500 schedules
- Storage: 500 Γ 48 KB = 24 MB
- Provides better generalization
- Covers more edge cases
Feature Matrix Size
Per schedule: 10 features Γ 8 bytes (float64) = 80 bytes
Training set (100 schedules):
- Features (X): 100 Γ 80 bytes = 8 KB
- Target (y): 100 Γ 8 bytes = 800 bytes
- Total: ~9 KB (minimal)
Full dataset (1000 schedules):
- Features: 80 KB
- Target: 8 KB
- Total: ~88 KB
Memory during training:
- Dataset: ~88 KB
- Models (5 Γ ~5 MB): ~25 MB
- Working memory: ~50 MB
- Total: ~75 MB
Optimization Service Resource Usage
OR-Tools Optimization
Input data:
- 30 trains Γ 1.5 KB = 45 KB
- 25 stations Γ 200 bytes = 5 KB
- Constraints: ~10 KB
- Total input: ~60 KB
Memory usage:
- Solver state: ~10 MB
- Solution space: ~20 MB
- Peak memory: ~30 MB
Execution time: 1-5 seconds (CPU-bound)
CPU utilization: 100% single core
ML Ensemble Prediction
Input data:
- Feature vector: 10 Γ 8 bytes = 80 bytes
- Total input: < 1 KB
Memory usage:
- Loaded models: ~25 MB (shared)
- Prediction workspace: ~1 MB
- Peak memory: ~26 MB
Execution time: 50-100 milliseconds
CPU utilization: 20-30% single core
Greedy Optimization
Input data: ~60 KB (same as OR-Tools)
Memory usage:
- State tracking: ~5 MB
- Priority queue: ~2 MB
- Peak memory: ~7 MB
Execution time: < 1 second
CPU utilization: 50-70% single core
Service Resource Usage
DataService (FastAPI)
Base memory: 150 MB (Python + FastAPI + dependencies)
Per request overhead: ~10 MB
Concurrent requests (typical): 1-5
Total memory (under load): 200-250 MB
Disk I/O:
- Read: Minimal (configuration only)
- Write: ~50 KB per schedule generated
Network:
- Inbound: ~150 bytes (request)
- Outbound: ~50 KB (response)
SelfTrainService
Base memory: 200 MB (Python + ML libraries)
During training:
- Dataset loading: +20 MB
- Model training: +100 MB (peak)
- Total during training: ~320 MB
During inference (loaded models):
- Models in memory: +25 MB
- Total during inference: ~225 MB
Disk I/O:
- Read: 5 MB (load schedules)
- Write: 25 MB (save models)
Frequency:
- Training: Every 48 hours
- Inference: Per schedule request (if confidence β₯ 75%)
Retraining Service (Background)
Memory: ~50 MB (idle), ~320 MB (during training)
CPU:
- Idle: < 1%
- Training: 100% (5-10 minutes every 48 hours)
Disk I/O:
- Check interval: Every 60 minutes
- Read: ~1 MB (check schedule count)
- Write: ~25 MB (when retraining)
Data Flow Summary
Schedule Generation Request
Client Request (150 bytes)
β
FastAPI Parser (~1 KB in memory)
β
Feature Extraction (80 bytes)
β
ML Prediction (25 MB models loaded) OR OR-Tools (30 MB solver)
β
Schedule Generation (45 KB output)
β
JSON Serialization (~50 KB response)
β
Storage (48 KB file)
Total data processed: ~50 KB per request
Response time: 0.1-5 seconds
Model Training Cycle
Load Schedules (100 Γ 48 KB = 4.8 MB)
β
Extract Features (100 Γ 80 bytes = 8 KB)
β
Train 5 Models (5-10 minutes, 100% CPU)
β
Save Models (25 MB pickle file)
β
Update History (1 KB append)
Total data processed: ~30 MB
Frequency: Every 48 hours
Configuration Data
Service Configuration
Location: SelfTrainService/config.py
Size: ~5 KB
Key Parameters:
{
"RETRAIN_INTERVAL_HOURS": 48,
"MIN_SCHEDULES_FOR_TRAINING": 100,
"MODEL_TYPES": ["gradient_boosting", "xgboost", ...],
"USE_ENSEMBLE": true,
"ML_CONFIDENCE_THRESHOLD": 0.75,
"FEATURES": [10 feature names],
"EPOCHS": 100,
"LEARNING_RATE": 0.001
}
Data Retention Policies
Recommended Retention
Schedule files:
- Keep last 365 days (17.5 MB)
- Archive older to compressed storage
Model snapshots:
- Keep last 50 models (~1.25 GB)
- Delete older snapshots
- Keep 1 model per month for historical reference
Training history:
- Keep all entries (grows slowly)
- Compress after 1000 entries
Logs:
- Application logs: 30 days
- Error logs: 90 days
- Audit logs: 1 year
Scaling Considerations
Horizontal Scaling
API Service (DataService):
- Stateless - easy to scale
- Load balancer distributes requests
- Each instance: ~250 MB memory
ML Service (SelfTrainService):
- Share model files via NFS/S3
- Only one instance should train (avoid conflicts)
- Multiple instances can serve predictions
Vertical Scaling
Memory requirements:
- Minimum: 1 GB RAM
- Recommended: 2 GB RAM
- Optimal: 4 GB RAM (allows concurrent training + serving)
CPU requirements:
- Minimum: 1 core
- Recommended: 2 cores (1 for API, 1 for training)
- Optimal: 4 cores (parallel model training)
Storage requirements:
- Minimum: 5 GB
- Recommended: 20 GB
- Optimal: 50 GB (1-year retention)
Performance Benchmarks
Schedule Generation Performance
| Fleet Size | Algorithm | Time | Memory | Output Size |
|---|---|---|---|---|
| 25 trains | ML | 0.08s | 225 MB | 38 KB |
| 30 trains | ML | 0.10s | 225 MB | 45 KB |
| 40 trains | ML | 0.12s | 225 MB | 60 KB |
| 25 trains | OR-Tools | 1.2s | 30 MB | 38 KB |
| 30 trains | OR-Tools | 2.8s | 30 MB | 45 KB |
| 40 trains | OR-Tools | 4.5s | 30 MB | 60 KB |
| 25 trains | Greedy | 0.3s | 7 MB | 38 KB |
| 30 trains | Greedy | 0.5s | 7 MB | 45 KB |
| 40 trains | Greedy | 0.8s | 7 MB | 60 KB |
Training Performance
| Dataset Size | Training Time | Memory | Model Size |
|---|---|---|---|
| 100 schedules | 3 min | 320 MB | 20 MB |
| 500 schedules | 8 min | 350 MB | 24 MB |
| 1000 schedules | 15 min | 400 MB | 28 MB |
Document Version: 1.0.0
Last Updated: November 2, 2025
Maintained By: ML-Service Team