Commit
Β·
8720c05
1
Parent(s):
e7bbf32
update docs
Browse files- docs/algorithms.md +604 -0
- docs/data-schemas.md +851 -0
- docs/integrate.md +0 -0
docs/algorithms.md
ADDED
|
@@ -0,0 +1,604 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Algorithms & Optimization Techniques
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
This document describes all algorithms, optimization techniques, and machine learning models used in the Metro Train Scheduling Service.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Table of Contents
|
| 10 |
+
|
| 11 |
+
1. [Machine Learning Algorithms](#machine-learning-algorithms)
|
| 12 |
+
2. [Optimization Algorithms](#optimization-algorithms)
|
| 13 |
+
3. [Hybrid Approach](#hybrid-approach)
|
| 14 |
+
4. [Feature Engineering](#feature-engineering)
|
| 15 |
+
5. [Performance Metrics](#performance-metrics)
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## Machine Learning Algorithms
|
| 20 |
+
|
| 21 |
+
### Ensemble Learning Architecture
|
| 22 |
+
|
| 23 |
+
The system employs a **5-model ensemble** approach for schedule quality prediction:
|
| 24 |
+
|
| 25 |
+
#### 1. Gradient Boosting (Scikit-learn)
|
| 26 |
+
**Algorithm**: Sequential ensemble of weak learners (decision trees)
|
| 27 |
+
|
| 28 |
+
**Parameters**:
|
| 29 |
+
- `n_estimators`: 100 trees
|
| 30 |
+
- `learning_rate`: 0.001
|
| 31 |
+
- `loss function`: Least squares regression
|
| 32 |
+
- `max_depth`: Auto (unlimited)
|
| 33 |
+
|
| 34 |
+
**Strengths**:
|
| 35 |
+
- Excellent baseline performance
|
| 36 |
+
- Handles non-linear relationships well
|
| 37 |
+
- Robust to outliers
|
| 38 |
+
|
| 39 |
+
**Use Case**: Primary baseline model for schedule quality prediction
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
#### 2. Random Forest (Scikit-learn)
|
| 44 |
+
**Algorithm**: Bagging ensemble of decision trees
|
| 45 |
+
|
| 46 |
+
**Parameters**:
|
| 47 |
+
- `n_estimators`: 100 trees
|
| 48 |
+
- `max_features`: Auto (βn_features)
|
| 49 |
+
- `n_jobs`: -1 (parallel processing)
|
| 50 |
+
- `random_state`: 42
|
| 51 |
+
|
| 52 |
+
**Strengths**:
|
| 53 |
+
- Low variance through averaging
|
| 54 |
+
- Handles missing data well
|
| 55 |
+
- Feature importance ranking
|
| 56 |
+
|
| 57 |
+
**Use Case**: Robust predictions with feature importance insights
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
#### 3. XGBoost (Extreme Gradient Boosting)
|
| 62 |
+
**Algorithm**: Optimized distributed gradient boosting
|
| 63 |
+
|
| 64 |
+
**Parameters**:
|
| 65 |
+
- `n_estimators`: 100
|
| 66 |
+
- `learning_rate`: 0.001
|
| 67 |
+
- `objective`: reg:squarederror
|
| 68 |
+
- `tree_method`: Auto
|
| 69 |
+
- `verbosity`: 0
|
| 70 |
+
|
| 71 |
+
**Technical Details**:
|
| 72 |
+
- Uses second-order gradients (Newton-Raphson)
|
| 73 |
+
- L1/L2 regularization to prevent overfitting
|
| 74 |
+
- Parallel tree construction
|
| 75 |
+
- Cache-aware block structure
|
| 76 |
+
|
| 77 |
+
**Strengths**:
|
| 78 |
+
- Typically best single-model performance
|
| 79 |
+
- Fast training and prediction
|
| 80 |
+
- Built-in cross-validation
|
| 81 |
+
|
| 82 |
+
**Use Case**: High-performance predictions, often selected as best model
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
#### 4. LightGBM (Microsoft)
|
| 87 |
+
**Algorithm**: Gradient-based One-Side Sampling (GOSS) + Exclusive Feature Bundling (EFB)
|
| 88 |
+
|
| 89 |
+
**Parameters**:
|
| 90 |
+
- `n_estimators`: 100
|
| 91 |
+
- `learning_rate`: 0.001
|
| 92 |
+
- `boosting_type`: gbdt
|
| 93 |
+
- `verbose`: -1
|
| 94 |
+
|
| 95 |
+
**Technical Details**:
|
| 96 |
+
- **GOSS**: Keeps instances with large gradients, randomly samples small gradients
|
| 97 |
+
- **EFB**: Bundles mutually exclusive features to reduce dimensions
|
| 98 |
+
- Leaf-wise tree growth (vs level-wise)
|
| 99 |
+
- Histogram-based splitting
|
| 100 |
+
|
| 101 |
+
**Strengths**:
|
| 102 |
+
- Fastest training time
|
| 103 |
+
- Low memory usage
|
| 104 |
+
- Handles large datasets efficiently
|
| 105 |
+
|
| 106 |
+
**Use Case**: Fast iteration during development, efficient production inference
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
#### 5. CatBoost (Yandex)
|
| 111 |
+
**Algorithm**: Ordered boosting with categorical feature handling
|
| 112 |
+
|
| 113 |
+
**Parameters**:
|
| 114 |
+
- `iterations`: 100
|
| 115 |
+
- `learning_rate`: 0.001
|
| 116 |
+
- `loss_function`: RMSE
|
| 117 |
+
- `verbose`: False
|
| 118 |
+
|
| 119 |
+
**Technical Details**:
|
| 120 |
+
- **Ordered Boosting**: Prevents target leakage in gradient calculation
|
| 121 |
+
- **Symmetric Trees**: Balanced tree structure
|
| 122 |
+
- Native categorical feature support
|
| 123 |
+
- Minimal hyperparameter tuning needed
|
| 124 |
+
|
| 125 |
+
**Strengths**:
|
| 126 |
+
- Best out-of-the-box performance
|
| 127 |
+
- Robust to overfitting
|
| 128 |
+
- Excellent with categorical data
|
| 129 |
+
|
| 130 |
+
**Use Case**: Robust predictions with minimal tuning
|
| 131 |
+
|
| 132 |
+
---
|
| 133 |
+
|
| 134 |
+
### Ensemble Strategy
|
| 135 |
+
|
| 136 |
+
#### Weighted Voting
|
| 137 |
+
```python
|
| 138 |
+
# Weight calculation (performance-based)
|
| 139 |
+
weight_i = RΒ²_score_i / Ξ£(RΒ²_scores)
|
| 140 |
+
|
| 141 |
+
# Final prediction
|
| 142 |
+
prediction = Ξ£(weight_i Γ prediction_i)
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
**Example Weights**:
|
| 146 |
+
```json
|
| 147 |
+
{
|
| 148 |
+
"xgboost": 0.215, // Best performer
|
| 149 |
+
"lightgbm": 0.208,
|
| 150 |
+
"gradient_boosting": 0.195,
|
| 151 |
+
"catboost": 0.195,
|
| 152 |
+
"random_forest": 0.187
|
| 153 |
+
}
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
#### Confidence Calculation
|
| 157 |
+
```python
|
| 158 |
+
# Ensemble confidence based on model agreement
|
| 159 |
+
predictions = [model.predict(features) for model in models]
|
| 160 |
+
std_dev = np.std(predictions)
|
| 161 |
+
|
| 162 |
+
# High agreement β High confidence
|
| 163 |
+
confidence = max(0.5, min(1.0, 1.0 - (std_dev / 50)))
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
**Confidence Threshold**: 0.75 (75%)
|
| 167 |
+
- If confidence β₯ 75%: Use ML prediction
|
| 168 |
+
- If confidence < 75%: Fall back to optimization
|
| 169 |
+
|
| 170 |
+
---
|
| 171 |
+
|
| 172 |
+
## Optimization Algorithms
|
| 173 |
+
|
| 174 |
+
### Constraint Programming (OR-Tools)
|
| 175 |
+
|
| 176 |
+
**Algorithm**: Google OR-Tools CP-SAT Solver
|
| 177 |
+
|
| 178 |
+
**Problem Type**: Constraint Satisfaction Problem (CSP)
|
| 179 |
+
|
| 180 |
+
#### Variables
|
| 181 |
+
```python
|
| 182 |
+
# Decision variables for each trainset
|
| 183 |
+
for train in trainsets:
|
| 184 |
+
for time_slot in operational_hours:
|
| 185 |
+
is_assigned[train, time_slot] = BoolVar()
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
#### Constraints
|
| 189 |
+
|
| 190 |
+
**1. Fleet Coverage**
|
| 191 |
+
```
|
| 192 |
+
Ξ£(active_trains_at_time_t) β₯ min_service_trains
|
| 193 |
+
β t β peak_hours
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
**2. Turnaround Time**
|
| 197 |
+
```
|
| 198 |
+
end_time[trip_i] + turnaround_time β€ start_time[trip_i+1]
|
| 199 |
+
β consecutive trips of same train
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
**3. Maintenance Windows**
|
| 203 |
+
```
|
| 204 |
+
if train.status == MAINTENANCE:
|
| 205 |
+
is_assigned[train, t] = False
|
| 206 |
+
β t β maintenance_window
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
**4. Fitness Certificates**
|
| 210 |
+
```
|
| 211 |
+
if certificate_expired(train):
|
| 212 |
+
is_assigned[train, t] = False
|
| 213 |
+
β t
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
**5. Mileage Balancing**
|
| 217 |
+
```
|
| 218 |
+
min_mileage β€ daily_km[train] β€ max_mileage
|
| 219 |
+
β trains in AVAILABLE status
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
**6. Depot Capacity**
|
| 223 |
+
```
|
| 224 |
+
Ξ£(trains_in_depot_at_t) β€ depot_capacity
|
| 225 |
+
β t β non_operational_hours
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
+
#### Objective Functions
|
| 229 |
+
|
| 230 |
+
**Multi-objective optimization** with weighted sum:
|
| 231 |
+
|
| 232 |
+
```python
|
| 233 |
+
objective = (
|
| 234 |
+
0.35 Γ maximize(service_coverage) +
|
| 235 |
+
0.25 Γ minimize(mileage_variance) +
|
| 236 |
+
0.20 Γ maximize(availability_utilization) +
|
| 237 |
+
0.10 Γ minimize(certificate_violations) +
|
| 238 |
+
0.10 Γ maximize(branding_exposure)
|
| 239 |
+
)
|
| 240 |
+
```
|
| 241 |
+
|
| 242 |
+
**Component Details**:
|
| 243 |
+
|
| 244 |
+
1. **Service Coverage** (35% weight)
|
| 245 |
+
- Maximize trains in service during peak hours
|
| 246 |
+
- Ensure minimum standby capacity
|
| 247 |
+
|
| 248 |
+
2. **Mileage Variance** (25% weight)
|
| 249 |
+
- Balance cumulative mileage across fleet
|
| 250 |
+
- Prevent overuse of specific trainsets
|
| 251 |
+
- Formula: `1 / (1 + coefficient_of_variation)`
|
| 252 |
+
|
| 253 |
+
3. **Availability Utilization** (20% weight)
|
| 254 |
+
- Maximize usage of available healthy trains
|
| 255 |
+
- Minimize idle time for service-ready trainsets
|
| 256 |
+
|
| 257 |
+
4. **Certificate Violations** (10% weight)
|
| 258 |
+
- Minimize assignments with expiring certificates
|
| 259 |
+
- Penalize near-expiry usage (< 30 days)
|
| 260 |
+
|
| 261 |
+
5. **Branding Exposure** (10% weight)
|
| 262 |
+
- Prioritize branded trains during peak hours
|
| 263 |
+
- Maximize visibility of high-priority advertisers
|
| 264 |
+
|
| 265 |
+
---
|
| 266 |
+
|
| 267 |
+
### Greedy Optimization
|
| 268 |
+
|
| 269 |
+
**Algorithm**: Priority-based greedy assignment
|
| 270 |
+
|
| 271 |
+
**Location**: `greedyOptim/` folder
|
| 272 |
+
|
| 273 |
+
#### Priority Scoring
|
| 274 |
+
```python
|
| 275 |
+
priority_score = (
|
| 276 |
+
0.40 Γ readiness_score +
|
| 277 |
+
0.25 Γ (1 - normalized_mileage) +
|
| 278 |
+
0.20 Γ certificate_validity_days +
|
| 279 |
+
0.10 Γ branding_priority +
|
| 280 |
+
0.05 Γ maintenance_gap_days
|
| 281 |
+
)
|
| 282 |
+
```
|
| 283 |
+
|
| 284 |
+
#### Assignment Process
|
| 285 |
+
|
| 286 |
+
1. **Sort trains by priority** (descending)
|
| 287 |
+
2. **Iterate through time slots** (5 AM β 11 PM)
|
| 288 |
+
3. **For each slot**:
|
| 289 |
+
- Select highest-priority available train
|
| 290 |
+
- Check constraints (turnaround, capacity)
|
| 291 |
+
- Assign if feasible
|
| 292 |
+
- Update train state (location, mileage)
|
| 293 |
+
4. **Fallback**: If no train available, flag as gap
|
| 294 |
+
|
| 295 |
+
**Complexity**: O(n Γ t) where n = trains, t = time slots
|
| 296 |
+
|
| 297 |
+
**Advantages**:
|
| 298 |
+
- Fast execution (< 1 second for 40 trains)
|
| 299 |
+
- Interpretable decisions
|
| 300 |
+
- Good for real-time adjustments
|
| 301 |
+
|
| 302 |
+
**Disadvantages**:
|
| 303 |
+
- May not find global optimum
|
| 304 |
+
- Sensitive to initial priority weights
|
| 305 |
+
|
| 306 |
+
---
|
| 307 |
+
|
| 308 |
+
### Genetic Algorithm
|
| 309 |
+
|
| 310 |
+
**Algorithm**: Evolutionary optimization
|
| 311 |
+
|
| 312 |
+
**Location**: `greedyOptim/genetic_algorithm.py`
|
| 313 |
+
|
| 314 |
+
#### Parameters
|
| 315 |
+
- **Population size**: 100 schedules
|
| 316 |
+
- **Generations**: 50 iterations
|
| 317 |
+
- **Crossover rate**: 0.8
|
| 318 |
+
- **Mutation rate**: 0.1
|
| 319 |
+
- **Selection**: Tournament (k=3)
|
| 320 |
+
|
| 321 |
+
#### Chromosome Encoding
|
| 322 |
+
```python
|
| 323 |
+
# Each chromosome = complete schedule
|
| 324 |
+
chromosome = [train_id_for_trip_0, train_id_for_trip_1, ..., train_id_for_trip_n]
|
| 325 |
+
```
|
| 326 |
+
|
| 327 |
+
#### Fitness Function
|
| 328 |
+
```python
|
| 329 |
+
fitness = (
|
| 330 |
+
service_quality_score -
|
| 331 |
+
constraint_violations Γ penalty_weight
|
| 332 |
+
)
|
| 333 |
+
```
|
| 334 |
+
|
| 335 |
+
#### Genetic Operators
|
| 336 |
+
|
| 337 |
+
**1. Crossover (Single-point)**
|
| 338 |
+
```python
|
| 339 |
+
parent1 = [T1, T2, T3, T4, T5, T6]
|
| 340 |
+
parent2 = [T3, T1, T4, T2, T6, T5]
|
| 341 |
+
β crossover at position 3
|
| 342 |
+
child1 = [T1, T2, T3, T2, T6, T5]
|
| 343 |
+
child2 = [T3, T1, T4, T4, T5, T6]
|
| 344 |
+
```
|
| 345 |
+
|
| 346 |
+
**2. Mutation (Swap)**
|
| 347 |
+
```python
|
| 348 |
+
# Randomly swap two trip assignments
|
| 349 |
+
schedule = [T1, T2, T3, T4, T5]
|
| 350 |
+
β swap positions 1 and 3
|
| 351 |
+
mutated = [T1, T4, T3, T2, T5]
|
| 352 |
+
```
|
| 353 |
+
|
| 354 |
+
**Termination**: Max generations or convergence (no improvement for 10 generations)
|
| 355 |
+
|
| 356 |
+
---
|
| 357 |
+
|
| 358 |
+
## Hybrid Approach
|
| 359 |
+
|
| 360 |
+
### Decision Flow
|
| 361 |
+
|
| 362 |
+
```
|
| 363 |
+
βββββββββββββββββββββββ
|
| 364 |
+
β Schedule Request β
|
| 365 |
+
ββββββββββββ¬βββββββββββ
|
| 366 |
+
β
|
| 367 |
+
βΌ
|
| 368 |
+
βββββββββββββββββββββββββββββββββββ
|
| 369 |
+
β Extract Features from Request β
|
| 370 |
+
β (num_trains, time, day, etc.) β
|
| 371 |
+
ββββββββββββ¬βββββββββββββββββββββββ
|
| 372 |
+
β
|
| 373 |
+
βΌ
|
| 374 |
+
βββββββββββββββββββββββββββββββββββ
|
| 375 |
+
β Ensemble ML Prediction β
|
| 376 |
+
β - All 5 models predict β
|
| 377 |
+
β - Weighted voting β
|
| 378 |
+
β - Calculate confidence β
|
| 379 |
+
ββββββββββββ¬βββββββββββββββββββββββ
|
| 380 |
+
β
|
| 381 |
+
βΌ
|
| 382 |
+
Confidence β₯ 75%?
|
| 383 |
+
β
|
| 384 |
+
ββββββββ΄βββββββ
|
| 385 |
+
β β
|
| 386 |
+
YES NO
|
| 387 |
+
β β
|
| 388 |
+
βΌ βΌ
|
| 389 |
+
βββββββββ ββββββββββββ
|
| 390 |
+
β Use β β Use β
|
| 391 |
+
β ML β βOR-Tools β
|
| 392 |
+
βResult β β Optimize β
|
| 393 |
+
βββββββββ ββββββββββββ
|
| 394 |
+
β β
|
| 395 |
+
ββββββββ¬βββββββ
|
| 396 |
+
β
|
| 397 |
+
βΌ
|
| 398 |
+
βββββββββββββββ
|
| 399 |
+
β Schedule β
|
| 400 |
+
βββββββββββββββ
|
| 401 |
+
```
|
| 402 |
+
|
| 403 |
+
### When ML is Used
|
| 404 |
+
|
| 405 |
+
**Conditions**:
|
| 406 |
+
1. β
Models trained (β₯100 schedules)
|
| 407 |
+
2. β
Confidence score β₯ 75%
|
| 408 |
+
3. β
Hybrid mode enabled
|
| 409 |
+
|
| 410 |
+
**Typical Scenarios**:
|
| 411 |
+
- Standard 30-train fleet
|
| 412 |
+
- Normal operational parameters
|
| 413 |
+
- No major disruptions
|
| 414 |
+
|
| 415 |
+
### When Optimization is Used
|
| 416 |
+
|
| 417 |
+
**Conditions**:
|
| 418 |
+
- β Low ML confidence (< 75%)
|
| 419 |
+
- β Models not trained
|
| 420 |
+
- β Unusual parameters (edge cases)
|
| 421 |
+
- β First-time scheduling
|
| 422 |
+
|
| 423 |
+
**Typical Scenarios**:
|
| 424 |
+
- Fleet size changes (25β40 trains)
|
| 425 |
+
- New route configurations
|
| 426 |
+
- Major maintenance events
|
| 427 |
+
- System initialization
|
| 428 |
+
|
| 429 |
+
---
|
| 430 |
+
|
| 431 |
+
## Feature Engineering
|
| 432 |
+
|
| 433 |
+
### Input Features (10 dimensions)
|
| 434 |
+
|
| 435 |
+
| Feature | Type | Range | Description |
|
| 436 |
+
|---------|------|-------|-------------|
|
| 437 |
+
| `num_trains` | Integer | 25-40 | Total fleet size |
|
| 438 |
+
| `num_available` | Integer | 20-38 | Trains in service/standby |
|
| 439 |
+
| `avg_readiness_score` | Float | 0.0-1.0 | Average train health |
|
| 440 |
+
| `total_mileage` | Integer | 100K-500K | Fleet cumulative km |
|
| 441 |
+
| `mileage_variance` | Float | 0-50K | Std dev of mileage |
|
| 442 |
+
| `maintenance_count` | Integer | 0-10 | Trains in maintenance |
|
| 443 |
+
| `certificate_expiry_count` | Integer | 0-5 | Expiring certificates |
|
| 444 |
+
| `branding_priority_sum` | Integer | 0-100 | Total branding priority |
|
| 445 |
+
| `time_of_day` | Integer | 0-23 | Hour of day |
|
| 446 |
+
| `day_of_week` | Integer | 0-6 | Day (0=Monday) |
|
| 447 |
+
|
| 448 |
+
### Target Variable
|
| 449 |
+
|
| 450 |
+
**Schedule Quality Score** (0-100):
|
| 451 |
+
|
| 452 |
+
```python
|
| 453 |
+
score = (
|
| 454 |
+
avg_readiness Γ 30 + # Health (30 points)
|
| 455 |
+
availability_% Γ 25 + # Availability (25 points)
|
| 456 |
+
(1 - mileage_var) Γ 20 + # Balance (20 points)
|
| 457 |
+
branding_sla Γ 15 + # Branding (15 points)
|
| 458 |
+
(10 - violationsΓ2) # Compliance (10 points)
|
| 459 |
+
)
|
| 460 |
+
```
|
| 461 |
+
|
| 462 |
+
### Feature Scaling
|
| 463 |
+
|
| 464 |
+
All features normalized to [0, 1] range before training:
|
| 465 |
+
|
| 466 |
+
```python
|
| 467 |
+
feature_normalized = (value - min) / (max - min)
|
| 468 |
+
```
|
| 469 |
+
|
| 470 |
+
---
|
| 471 |
+
|
| 472 |
+
## Performance Metrics
|
| 473 |
+
|
| 474 |
+
### Model Evaluation
|
| 475 |
+
|
| 476 |
+
**Primary Metric**: RΒ² Score (Coefficient of Determination)
|
| 477 |
+
- Range: [0, 1], higher is better
|
| 478 |
+
- Typical ensemble RΒ²: 0.85-0.92
|
| 479 |
+
|
| 480 |
+
**Secondary Metric**: RMSE (Root Mean Squared Error)
|
| 481 |
+
- Range: [0, β], lower is better
|
| 482 |
+
- Typical ensemble RMSE: 8-15
|
| 483 |
+
|
| 484 |
+
**Training Split**: 80% train, 20% test
|
| 485 |
+
|
| 486 |
+
### Optimization Quality
|
| 487 |
+
|
| 488 |
+
**Metrics Tracked**:
|
| 489 |
+
|
| 490 |
+
1. **Service Coverage**: % of required hours covered
|
| 491 |
+
- Target: β₯ 95%
|
| 492 |
+
|
| 493 |
+
2. **Fleet Utilization**: % of available trains used
|
| 494 |
+
- Target: 85-95%
|
| 495 |
+
|
| 496 |
+
3. **Mileage Balance**: Coefficient of variation
|
| 497 |
+
- Target: < 0.15 (15%)
|
| 498 |
+
|
| 499 |
+
4. **Constraint Violations**: Count of hard constraint breaks
|
| 500 |
+
- Target: 0
|
| 501 |
+
|
| 502 |
+
5. **Execution Time**: Algorithm runtime
|
| 503 |
+
- ML: < 0.1 seconds
|
| 504 |
+
- OR-Tools: 1-5 seconds
|
| 505 |
+
- Genetic: 5-15 seconds
|
| 506 |
+
|
| 507 |
+
### Ensemble Performance Example
|
| 508 |
+
|
| 509 |
+
```json
|
| 510 |
+
{
|
| 511 |
+
"gradient_boosting": {
|
| 512 |
+
"train_r2": 0.8912,
|
| 513 |
+
"test_r2": 0.8234,
|
| 514 |
+
"test_rmse": 13.45
|
| 515 |
+
},
|
| 516 |
+
"xgboost": {
|
| 517 |
+
"train_r2": 0.9234,
|
| 518 |
+
"test_r2": 0.8543,
|
| 519 |
+
"test_rmse": 12.34
|
| 520 |
+
},
|
| 521 |
+
"lightgbm": {
|
| 522 |
+
"train_r2": 0.9156,
|
| 523 |
+
"test_r2": 0.8467,
|
| 524 |
+
"test_rmse": 12.67
|
| 525 |
+
},
|
| 526 |
+
"catboost": {
|
| 527 |
+
"train_r2": 0.9087,
|
| 528 |
+
"test_r2": 0.8401,
|
| 529 |
+
"test_rmse": 12.89
|
| 530 |
+
},
|
| 531 |
+
"random_forest": {
|
| 532 |
+
"train_r2": 0.8756,
|
| 533 |
+
"test_r2": 0.8123,
|
| 534 |
+
"test_rmse": 13.98
|
| 535 |
+
},
|
| 536 |
+
"ensemble": {
|
| 537 |
+
"test_r2": 0.8621,
|
| 538 |
+
"test_rmse": 11.87,
|
| 539 |
+
"confidence": 0.89
|
| 540 |
+
}
|
| 541 |
+
}
|
| 542 |
+
```
|
| 543 |
+
|
| 544 |
+
---
|
| 545 |
+
|
| 546 |
+
## Algorithm Selection Guide
|
| 547 |
+
|
| 548 |
+
| Use Case | Recommended Algorithm | Rationale |
|
| 549 |
+
|----------|----------------------|-----------|
|
| 550 |
+
| First-time scheduling | OR-Tools CP-SAT | No training data available |
|
| 551 |
+
| Standard operations | Ensemble ML | Fast, accurate predictions |
|
| 552 |
+
| Edge cases | OR-Tools CP-SAT | Guaranteed feasibility |
|
| 553 |
+
| Real-time updates | Greedy + ML | Sub-second performance |
|
| 554 |
+
| Offline planning | Genetic Algorithm | Exploration of solution space |
|
| 555 |
+
| Development/Testing | LightGBM | Fastest training iteration |
|
| 556 |
+
| Production inference | XGBoost | Best accuracy/speed trade-off |
|
| 557 |
+
|
| 558 |
+
---
|
| 559 |
+
|
| 560 |
+
## Future Enhancements
|
| 561 |
+
|
| 562 |
+
### Planned Improvements
|
| 563 |
+
|
| 564 |
+
1. **Reinforcement Learning**
|
| 565 |
+
- Q-learning for dynamic scheduling
|
| 566 |
+
- Reward: schedule quality over time
|
| 567 |
+
|
| 568 |
+
2. **Deep Learning**
|
| 569 |
+
- LSTM for time-series prediction
|
| 570 |
+
- Attention mechanisms for trip dependencies
|
| 571 |
+
|
| 572 |
+
3. **Multi-objective Pareto**
|
| 573 |
+
- Generate Pareto-optimal solution set
|
| 574 |
+
- Allow user to select trade-off point
|
| 575 |
+
|
| 576 |
+
4. **Transfer Learning**
|
| 577 |
+
- Pre-train on similar metro systems
|
| 578 |
+
- Fine-tune for KMRL specifics
|
| 579 |
+
|
| 580 |
+
5. **Online Learning**
|
| 581 |
+
- Incremental model updates
|
| 582 |
+
- Adapt to changing patterns without full retraining
|
| 583 |
+
|
| 584 |
+
---
|
| 585 |
+
|
| 586 |
+
## References
|
| 587 |
+
|
| 588 |
+
### Libraries
|
| 589 |
+
- **Scikit-learn**: https://scikit-learn.org/
|
| 590 |
+
- **XGBoost**: https://xgboost.readthedocs.io/
|
| 591 |
+
- **LightGBM**: https://lightgbm.readthedocs.io/
|
| 592 |
+
- **CatBoost**: https://catboost.ai/
|
| 593 |
+
- **OR-Tools**: https://developers.google.com/optimization
|
| 594 |
+
|
| 595 |
+
### Papers
|
| 596 |
+
1. Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System"
|
| 597 |
+
2. Ke, G., et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree"
|
| 598 |
+
3. Prokhorenkova, L., et al. (2018). "CatBoost: unbiased boosting with categorical features"
|
| 599 |
+
|
| 600 |
+
---
|
| 601 |
+
|
| 602 |
+
**Document Version**: 1.0.0
|
| 603 |
+
**Last Updated**: November 2, 2025
|
| 604 |
+
**Maintained By**: ML-Service Team
|
docs/data-schemas.md
ADDED
|
@@ -0,0 +1,851 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Data Schemas & Service Specifications
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
This document details all data structures, schemas, API contracts, and data volume specifications for the Metro Train Scheduling Service.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Table of Contents
|
| 10 |
+
|
| 11 |
+
1. [Core Data Models](#core-data-models)
|
| 12 |
+
2. [API Schemas](#api-schemas)
|
| 13 |
+
3. [Database Schemas](#database-schemas)
|
| 14 |
+
4. [Data Volume & Storage](#data-volume--storage)
|
| 15 |
+
5. [Service Resource Usage](#service-resource-usage)
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## Core Data Models
|
| 20 |
+
|
| 21 |
+
All models use **Pydantic v2** for validation and serialization.
|
| 22 |
+
|
| 23 |
+
### 1. DaySchedule
|
| 24 |
+
|
| 25 |
+
**Purpose**: Complete daily schedule with all trainset assignments
|
| 26 |
+
|
| 27 |
+
```python
|
| 28 |
+
class DaySchedule(BaseModel):
|
| 29 |
+
schedule_id: str # "KMRL-2025-10-25"
|
| 30 |
+
date: str # "2025-10-25"
|
| 31 |
+
route: Route # Route details
|
| 32 |
+
trainsets: List[Trainset] # All train assignments
|
| 33 |
+
fleet_summary: FleetSummary # Fleet statistics
|
| 34 |
+
optimization_metrics: OptimizationMetrics
|
| 35 |
+
alerts: List[Alert] # Warnings/issues
|
| 36 |
+
generated_at: datetime
|
| 37 |
+
generated_by: str = "ML-Optimizer"
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
**Size**: ~45 KB per schedule (30 trains, full day)
|
| 41 |
+
|
| 42 |
+
**Example**:
|
| 43 |
+
```json
|
| 44 |
+
{
|
| 45 |
+
"schedule_id": "KMRL-2025-10-25",
|
| 46 |
+
"date": "2025-10-25",
|
| 47 |
+
"route": {...},
|
| 48 |
+
"trainsets": [...],
|
| 49 |
+
"fleet_summary": {
|
| 50 |
+
"total_trainsets": 30,
|
| 51 |
+
"in_service": 24,
|
| 52 |
+
"standby": 4,
|
| 53 |
+
"maintenance": 2
|
| 54 |
+
},
|
| 55 |
+
"optimization_metrics": {
|
| 56 |
+
"total_service_blocks": 156,
|
| 57 |
+
"avg_readiness_score": 0.87,
|
| 58 |
+
"mileage_variance_coefficient": 0.12
|
| 59 |
+
},
|
| 60 |
+
"generated_at": "2025-10-25T04:30:00+05:30"
|
| 61 |
+
}
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
### 2. Trainset
|
| 67 |
+
|
| 68 |
+
**Purpose**: Individual train assignment and status
|
| 69 |
+
|
| 70 |
+
```python
|
| 71 |
+
class Trainset(BaseModel):
|
| 72 |
+
trainset_id: str # "TS-001"
|
| 73 |
+
status: TrainHealthStatus # REVENUE_SERVICE, STANDBY, etc.
|
| 74 |
+
depot_bay: str # "BAY-01"
|
| 75 |
+
cumulative_km: int # 145250
|
| 76 |
+
readiness_score: float # 0.0-1.0
|
| 77 |
+
service_blocks: List[ServiceBlock] # Trip assignments
|
| 78 |
+
fitness_certificates: FitnessCertificates
|
| 79 |
+
job_cards: JobCards
|
| 80 |
+
branding: Branding
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
**Size**: ~1.5 KB per trainset
|
| 84 |
+
|
| 85 |
+
**Status Enum**:
|
| 86 |
+
```python
|
| 87 |
+
class TrainHealthStatus(str, Enum):
|
| 88 |
+
REVENUE_SERVICE = "REVENUE_SERVICE" # Active service
|
| 89 |
+
STANDBY = "STANDBY" # Ready, not assigned
|
| 90 |
+
MAINTENANCE = "MAINTENANCE" # Under repair
|
| 91 |
+
SCHEDULED_MAINTENANCE = "SCHEDULED_MAINTENANCE"
|
| 92 |
+
UNAVAILABLE = "UNAVAILABLE" # Out of service
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
**Distribution** (typical 30-train fleet):
|
| 96 |
+
- REVENUE_SERVICE: 22-24 trains (73-80%)
|
| 97 |
+
- STANDBY: 3-5 trains (10-17%)
|
| 98 |
+
- MAINTENANCE: 1-3 trains (3-10%)
|
| 99 |
+
- UNAVAILABLE: 0-2 trains (0-7%)
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
### 3. ServiceBlock
|
| 104 |
+
|
| 105 |
+
**Purpose**: Single trip assignment for a train
|
| 106 |
+
|
| 107 |
+
```python
|
| 108 |
+
class ServiceBlock(BaseModel):
|
| 109 |
+
block_id: str # "BLK-001-01"
|
| 110 |
+
start_time: str # "05:00"
|
| 111 |
+
end_time: str # "05:45"
|
| 112 |
+
start_station: str # "Aluva"
|
| 113 |
+
end_station: str # "Pettah"
|
| 114 |
+
direction: str # "UP" or "DOWN"
|
| 115 |
+
distance_km: float # 25.612
|
| 116 |
+
estimated_passengers: Optional[int] # 450
|
| 117 |
+
priority: str = "NORMAL" # NORMAL, HIGH, PEAK
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
**Size**: ~250 bytes per service block
|
| 121 |
+
|
| 122 |
+
**Daily Trips per Train**:
|
| 123 |
+
- Peak service train: 6-8 trips
|
| 124 |
+
- Standard service: 4-6 trips
|
| 125 |
+
- Average: ~5.2 trips per active train
|
| 126 |
+
|
| 127 |
+
**Total Service Blocks** (30-train fleet):
|
| 128 |
+
- 24 active trains Γ 5.2 trips = ~125 service blocks/day
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
### 4. Route
|
| 133 |
+
|
| 134 |
+
**Purpose**: Metro line configuration
|
| 135 |
+
|
| 136 |
+
```python
|
| 137 |
+
class Route(BaseModel):
|
| 138 |
+
route_id: str # "KMRL-LINE-01"
|
| 139 |
+
name: str # "Aluva-Pettah Line"
|
| 140 |
+
stations: List[Station] # 25 stations
|
| 141 |
+
total_distance_km: float # 25.612 km
|
| 142 |
+
avg_speed_kmh: int # 32-38 km/h
|
| 143 |
+
turnaround_time_minutes: int # 8-12 minutes
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
**KMRL Route Details**:
|
| 147 |
+
- **Stations**: 25 (Aluva to Pettah)
|
| 148 |
+
- **Distance**: 25.612 km
|
| 149 |
+
- **Average Speed**: 35 km/h
|
| 150 |
+
- **One-way Time**: ~44 minutes
|
| 151 |
+
- **Round Trip**: ~100 minutes (including turnarounds)
|
| 152 |
+
|
| 153 |
+
---
|
| 154 |
+
|
| 155 |
+
### 5. Station
|
| 156 |
+
|
| 157 |
+
**Purpose**: Individual station on route
|
| 158 |
+
|
| 159 |
+
```python
|
| 160 |
+
class Station(BaseModel):
|
| 161 |
+
station_id: str # "STN-001"
|
| 162 |
+
name: str # "Aluva"
|
| 163 |
+
code: str # "ALV"
|
| 164 |
+
distance_from_start_km: float # 0.0
|
| 165 |
+
platform_count: int # 2
|
| 166 |
+
facilities: List[str] # ["PARKING", "ELEVATOR"]
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
**Size**: ~200 bytes per station
|
| 170 |
+
|
| 171 |
+
**Total Stations**: 25 (fixed)
|
| 172 |
+
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
+
### 6. FitnessCertificates
|
| 176 |
+
|
| 177 |
+
**Purpose**: Regulatory compliance tracking
|
| 178 |
+
|
| 179 |
+
```python
|
| 180 |
+
class FitnessCertificates(BaseModel):
|
| 181 |
+
rolling_stock: FitnessCertificate # Train body/chassis
|
| 182 |
+
signalling: FitnessCertificate # Signal systems
|
| 183 |
+
telecom: FitnessCertificate # Communication systems
|
| 184 |
+
|
| 185 |
+
class FitnessCertificate(BaseModel):
|
| 186 |
+
valid_until: str # "2025-12-31"
|
| 187 |
+
status: CertificateStatus # VALID, EXPIRING_SOON, EXPIRED
|
| 188 |
+
|
| 189 |
+
class CertificateStatus(str, Enum):
|
| 190 |
+
VALID = "VALID" # > 30 days remaining
|
| 191 |
+
EXPIRING_SOON = "EXPIRING_SOON" # 7-30 days remaining
|
| 192 |
+
EXPIRED = "EXPIRED" # Past expiry date
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
**Validation Rules**:
|
| 196 |
+
- Trains with EXPIRED certificates: status = UNAVAILABLE
|
| 197 |
+
- Trains with EXPIRING_SOON: flagged in alerts, can operate
|
| 198 |
+
|
| 199 |
+
---
|
| 200 |
+
|
| 201 |
+
### 7. JobCards & Maintenance
|
| 202 |
+
|
| 203 |
+
**Purpose**: Maintenance tracking
|
| 204 |
+
|
| 205 |
+
```python
|
| 206 |
+
class JobCards(BaseModel):
|
| 207 |
+
open: int # Number of open job cards
|
| 208 |
+
blocking: List[str] # Critical issues: ["BRAKE_FAULT"]
|
| 209 |
+
|
| 210 |
+
# Example maintenance reasons
|
| 211 |
+
UNAVAILABLE_REASONS = [
|
| 212 |
+
"SCHEDULED_MAINTENANCE",
|
| 213 |
+
"BRAKE_SYSTEM_REPAIR",
|
| 214 |
+
"HVAC_REPLACEMENT",
|
| 215 |
+
"BOGIE_OVERHAUL",
|
| 216 |
+
"ELECTRICAL_FAULT",
|
| 217 |
+
"ACCIDENT_DAMAGE",
|
| 218 |
+
"PANTOGRAPH_REPAIR",
|
| 219 |
+
"DOOR_SYSTEM_FAULT"
|
| 220 |
+
]
|
| 221 |
+
```
|
| 222 |
+
|
| 223 |
+
**Impact on Scheduling**:
|
| 224 |
+
- 0 open cards: readiness = 1.0
|
| 225 |
+
- 1-2 cards: readiness = 0.9
|
| 226 |
+
- 3-4 cards: readiness = 0.7
|
| 227 |
+
- 5+ cards: readiness = 0.5, likely maintenance status
|
| 228 |
+
|
| 229 |
+
---
|
| 230 |
+
|
| 231 |
+
### 8. Branding
|
| 232 |
+
|
| 233 |
+
**Purpose**: Advertisement tracking
|
| 234 |
+
|
| 235 |
+
```python
|
| 236 |
+
class Branding(BaseModel):
|
| 237 |
+
advertiser: str # "COCACOLA-2024"
|
| 238 |
+
contract_hours_remaining: int # 450 hours
|
| 239 |
+
exposure_priority: str # LOW, MEDIUM, HIGH, CRITICAL
|
| 240 |
+
|
| 241 |
+
# Available advertisers
|
| 242 |
+
ADVERTISERS = [
|
| 243 |
+
"COCACOLA-2024",
|
| 244 |
+
"FLIPKART-FESTIVE",
|
| 245 |
+
"AMAZON-PRIME",
|
| 246 |
+
"RELIANCE-JIO",
|
| 247 |
+
"TATA-MOTORS",
|
| 248 |
+
"SAMSUNG-GALAXY",
|
| 249 |
+
"NONE"
|
| 250 |
+
]
|
| 251 |
+
```
|
| 252 |
+
|
| 253 |
+
**Priority Weights** (for optimization):
|
| 254 |
+
- CRITICAL: 4 points
|
| 255 |
+
- HIGH: 3 points
|
| 256 |
+
- MEDIUM: 2 points
|
| 257 |
+
- LOW: 1 point
|
| 258 |
+
- NONE: 0 points
|
| 259 |
+
|
| 260 |
+
**Scheduling Strategy**:
|
| 261 |
+
- HIGH/CRITICAL branded trains prioritized for peak hours
|
| 262 |
+
- Maximizes advertiser visibility during high-traffic periods
|
| 263 |
+
|
| 264 |
+
---
|
| 265 |
+
|
| 266 |
+
### 9. FleetSummary
|
| 267 |
+
|
| 268 |
+
**Purpose**: Aggregated fleet statistics
|
| 269 |
+
|
| 270 |
+
```python
|
| 271 |
+
class FleetSummary(BaseModel):
|
| 272 |
+
total_trainsets: int # 30
|
| 273 |
+
in_service: int # 24
|
| 274 |
+
standby: int # 4
|
| 275 |
+
maintenance: int # 2
|
| 276 |
+
unavailable: int # 0
|
| 277 |
+
availability_percent: float # 93.33
|
| 278 |
+
total_mileage_today: int # 3200 km
|
| 279 |
+
avg_trips_per_train: float # 5.2
|
| 280 |
+
```
|
| 281 |
+
|
| 282 |
+
**Size**: ~300 bytes
|
| 283 |
+
|
| 284 |
+
**Key Metrics**:
|
| 285 |
+
- **Availability %**: (in_service + standby) / total Γ 100
|
| 286 |
+
- **Target Availability**: β₯ 90%
|
| 287 |
+
- **Service Ratio**: in_service / (in_service + standby)
|
| 288 |
+
- **Target Service Ratio**: 85-90%
|
| 289 |
+
|
| 290 |
+
---
|
| 291 |
+
|
| 292 |
+
### 10. OptimizationMetrics
|
| 293 |
+
|
| 294 |
+
**Purpose**: Optimization quality measures
|
| 295 |
+
|
| 296 |
+
```python
|
| 297 |
+
class OptimizationMetrics(BaseModel):
|
| 298 |
+
total_service_blocks: int # 125
|
| 299 |
+
avg_readiness_score: float # 0.87
|
| 300 |
+
mileage_variance_coefficient: float # 0.12
|
| 301 |
+
branding_sla_compliance: float # 0.95
|
| 302 |
+
fitness_expiry_violations: int # 0
|
| 303 |
+
execution_time_ms: int # 1250
|
| 304 |
+
algorithm_used: str # "ensemble_ml" or "or_tools"
|
| 305 |
+
confidence_score: Optional[float] # 0.89 (if ML used)
|
| 306 |
+
```
|
| 307 |
+
|
| 308 |
+
**Size**: ~250 bytes
|
| 309 |
+
|
| 310 |
+
**Quality Thresholds**:
|
| 311 |
+
- avg_readiness_score: β₯ 0.80
|
| 312 |
+
- mileage_variance_coefficient: < 0.15
|
| 313 |
+
- branding_sla_compliance: β₯ 0.90
|
| 314 |
+
- fitness_expiry_violations: 0
|
| 315 |
+
|
| 316 |
+
---
|
| 317 |
+
|
| 318 |
+
## API Schemas
|
| 319 |
+
|
| 320 |
+
### Request: ScheduleRequest
|
| 321 |
+
|
| 322 |
+
**Endpoint**: `POST /api/v1/generate`
|
| 323 |
+
|
| 324 |
+
```python
|
| 325 |
+
class ScheduleRequest(BaseModel):
|
| 326 |
+
date: str # "2025-10-25"
|
| 327 |
+
num_trains: int = 25 # 25-40
|
| 328 |
+
num_stations: int = 25 # Fixed for KMRL
|
| 329 |
+
min_service_trains: int = 22 # Minimum active
|
| 330 |
+
min_standby_trains: int = 3 # Minimum backup
|
| 331 |
+
|
| 332 |
+
# Optional overrides
|
| 333 |
+
peak_hours: Optional[List[int]] = None # [7,8,9,17,18,19]
|
| 334 |
+
force_optimization: bool = False # Skip ML, use OR-Tools
|
| 335 |
+
```
|
| 336 |
+
|
| 337 |
+
**Size**: ~150 bytes per request
|
| 338 |
+
|
| 339 |
+
**Validation**:
|
| 340 |
+
- `num_trains`: 25 β€ n β€ 40
|
| 341 |
+
- `num_stations`: Fixed at 25 (KMRL specific)
|
| 342 |
+
- `min_service_trains`: β€ num_trains - 3
|
| 343 |
+
- `min_standby_trains`: β₯ 2
|
| 344 |
+
|
| 345 |
+
**Example**:
|
| 346 |
+
```json
|
| 347 |
+
{
|
| 348 |
+
"date": "2025-10-25",
|
| 349 |
+
"num_trains": 30,
|
| 350 |
+
"num_stations": 25,
|
| 351 |
+
"min_service_trains": 24,
|
| 352 |
+
"min_standby_trains": 4
|
| 353 |
+
}
|
| 354 |
+
```
|
| 355 |
+
|
| 356 |
+
---
|
| 357 |
+
|
| 358 |
+
### Response: DaySchedule
|
| 359 |
+
|
| 360 |
+
**Status**: 200 OK
|
| 361 |
+
|
| 362 |
+
**Content-Type**: application/json
|
| 363 |
+
|
| 364 |
+
**Size**: 45-55 KB (depends on fleet size)
|
| 365 |
+
|
| 366 |
+
**Headers**:
|
| 367 |
+
```
|
| 368 |
+
X-Algorithm-Used: ensemble_ml | or_tools | greedy
|
| 369 |
+
X-Confidence-Score: 0.89 (if ML)
|
| 370 |
+
X-Execution-Time-Ms: 1250
|
| 371 |
+
```
|
| 372 |
+
|
| 373 |
+
---
|
| 374 |
+
|
| 375 |
+
### Error Responses
|
| 376 |
+
|
| 377 |
+
**400 Bad Request**:
|
| 378 |
+
```json
|
| 379 |
+
{
|
| 380 |
+
"error": "Validation Error",
|
| 381 |
+
"details": {
|
| 382 |
+
"num_trains": "Must be between 25 and 40"
|
| 383 |
+
}
|
| 384 |
+
}
|
| 385 |
+
```
|
| 386 |
+
|
| 387 |
+
**500 Internal Server Error**:
|
| 388 |
+
```json
|
| 389 |
+
{
|
| 390 |
+
"error": "Optimization Failed",
|
| 391 |
+
"message": "Unable to find feasible schedule",
|
| 392 |
+
"timestamp": "2025-10-25T10:30:00Z"
|
| 393 |
+
}
|
| 394 |
+
```
|
| 395 |
+
|
| 396 |
+
---
|
| 397 |
+
|
| 398 |
+
## Database Schemas
|
| 399 |
+
|
| 400 |
+
### Schedule Storage (JSON Files)
|
| 401 |
+
|
| 402 |
+
**Location**: `data/schedules/`
|
| 403 |
+
|
| 404 |
+
**Naming**: `{schedule_id}_{timestamp}.json`
|
| 405 |
+
|
| 406 |
+
**Example**: `KMRL-2025-10-25_20251025_043000.json`
|
| 407 |
+
|
| 408 |
+
**Structure**:
|
| 409 |
+
```json
|
| 410 |
+
{
|
| 411 |
+
"schedule": {DaySchedule},
|
| 412 |
+
"metadata": {
|
| 413 |
+
"recorded_at": "2025-10-25T04:30:00",
|
| 414 |
+
"quality_score": 87.5,
|
| 415 |
+
"algorithm_used": "ensemble_ml",
|
| 416 |
+
"confidence": 0.89
|
| 417 |
+
},
|
| 418 |
+
"saved_at": "2025-10-25T04:30:15"
|
| 419 |
+
}
|
| 420 |
+
```
|
| 421 |
+
|
| 422 |
+
**Size per File**: ~48 KB
|
| 423 |
+
|
| 424 |
+
---
|
| 425 |
+
|
| 426 |
+
### Model Storage (Pickle Files)
|
| 427 |
+
|
| 428 |
+
**Location**: `models/`
|
| 429 |
+
|
| 430 |
+
**Files**:
|
| 431 |
+
1. `models_latest.pkl` - Current ensemble (all 5 models)
|
| 432 |
+
2. `models_{timestamp}.pkl` - Historical snapshots
|
| 433 |
+
3. `training_history.json` - Training metrics log
|
| 434 |
+
|
| 435 |
+
**Model File Contents**:
|
| 436 |
+
```python
|
| 437 |
+
{
|
| 438 |
+
"models": {
|
| 439 |
+
"gradient_boosting": GradientBoostingRegressor(),
|
| 440 |
+
"random_forest": RandomForestRegressor(),
|
| 441 |
+
"xgboost": XGBRegressor(),
|
| 442 |
+
"lightgbm": LGBMRegressor(),
|
| 443 |
+
"catboost": CatBoostRegressor()
|
| 444 |
+
},
|
| 445 |
+
"ensemble_weights": {
|
| 446 |
+
"xgboost": 0.215,
|
| 447 |
+
"lightgbm": 0.208,
|
| 448 |
+
...
|
| 449 |
+
},
|
| 450 |
+
"best_model_name": "xgboost",
|
| 451 |
+
"last_trained": datetime(2025, 10, 25, 4, 30),
|
| 452 |
+
"config": {
|
| 453 |
+
"version": "v1.0.0",
|
| 454 |
+
"features": [...],
|
| 455 |
+
"models_trained": [...]
|
| 456 |
+
}
|
| 457 |
+
}
|
| 458 |
+
```
|
| 459 |
+
|
| 460 |
+
**Size**: ~15-25 MB (all 5 models combined)
|
| 461 |
+
|
| 462 |
+
---
|
| 463 |
+
|
| 464 |
+
### Training History (JSON)
|
| 465 |
+
|
| 466 |
+
**Location**: `models/training_history.json`
|
| 467 |
+
|
| 468 |
+
**Structure**:
|
| 469 |
+
```json
|
| 470 |
+
[
|
| 471 |
+
{
|
| 472 |
+
"timestamp": "2025-10-23T12:00:00",
|
| 473 |
+
"metrics": {
|
| 474 |
+
"gradient_boosting": {
|
| 475 |
+
"train_r2": 0.8912,
|
| 476 |
+
"test_r2": 0.8234,
|
| 477 |
+
"test_rmse": 13.45
|
| 478 |
+
},
|
| 479 |
+
...
|
| 480 |
+
},
|
| 481 |
+
"best_model": "xgboost",
|
| 482 |
+
"ensemble_weights": {...},
|
| 483 |
+
"config": {
|
| 484 |
+
"models_trained": [...],
|
| 485 |
+
"version": "v1.0.0"
|
| 486 |
+
}
|
| 487 |
+
},
|
| 488 |
+
...
|
| 489 |
+
]
|
| 490 |
+
```
|
| 491 |
+
|
| 492 |
+
**Growth**: ~1 KB per training run
|
| 493 |
+
|
| 494 |
+
**Retention**: All training runs (pruned after 1000 entries)
|
| 495 |
+
|
| 496 |
+
---
|
| 497 |
+
|
| 498 |
+
## Data Volume & Storage
|
| 499 |
+
|
| 500 |
+
### Production Estimates
|
| 501 |
+
|
| 502 |
+
#### Daily Operations
|
| 503 |
+
|
| 504 |
+
**Per Day** (single schedule generation):
|
| 505 |
+
- 1 schedule file: ~48 KB
|
| 506 |
+
- API request/response: ~50 KB total
|
| 507 |
+
- Logs: ~10 KB
|
| 508 |
+
|
| 509 |
+
**Total per day**: ~108 KB
|
| 510 |
+
|
| 511 |
+
#### Monthly Operations (30 days)
|
| 512 |
+
|
| 513 |
+
**Schedule files**:
|
| 514 |
+
- 30 schedules Γ 48 KB = 1.44 MB
|
| 515 |
+
|
| 516 |
+
**Model files**:
|
| 517 |
+
- 1 retraining (every 48 hours) = 15 retrainings/month
|
| 518 |
+
- 15 Γ 25 MB = 375 MB
|
| 519 |
+
|
| 520 |
+
**Training history**:
|
| 521 |
+
- 15 entries Γ 1 KB = 15 KB
|
| 522 |
+
|
| 523 |
+
**Total per month**: ~377 MB
|
| 524 |
+
|
| 525 |
+
#### Annual Storage (1 year)
|
| 526 |
+
|
| 527 |
+
**Schedule data**:
|
| 528 |
+
- 365 schedules Γ 48 KB = 17.5 MB
|
| 529 |
+
|
| 530 |
+
**Model snapshots**:
|
| 531 |
+
- 182 retrainings Γ 25 MB = 4.55 GB
|
| 532 |
+
|
| 533 |
+
**Training history**:
|
| 534 |
+
- 182 KB
|
| 535 |
+
|
| 536 |
+
**Total per year**: ~4.57 GB
|
| 537 |
+
|
| 538 |
+
**With retention policy** (keep last 100 schedules, 50 models):
|
| 539 |
+
- Schedules: 100 Γ 48 KB = 4.8 MB
|
| 540 |
+
- Models: 50 Γ 25 MB = 1.25 GB
|
| 541 |
+
- History: 182 KB
|
| 542 |
+
|
| 543 |
+
**Total with retention**: ~1.26 GB
|
| 544 |
+
|
| 545 |
+
---
|
| 546 |
+
|
| 547 |
+
### ML Training Data Requirements
|
| 548 |
+
|
| 549 |
+
#### Minimum Training Dataset
|
| 550 |
+
|
| 551 |
+
**Initial training**: 100 schedules
|
| 552 |
+
- Storage: 100 Γ 48 KB = 4.8 MB
|
| 553 |
+
- Generation time: ~15 minutes (automated)
|
| 554 |
+
- Training time: 5-10 minutes
|
| 555 |
+
|
| 556 |
+
**Optimal training**: 500 schedules
|
| 557 |
+
- Storage: 500 Γ 48 KB = 24 MB
|
| 558 |
+
- Provides better generalization
|
| 559 |
+
- Covers more edge cases
|
| 560 |
+
|
| 561 |
+
#### Feature Matrix Size
|
| 562 |
+
|
| 563 |
+
**Per schedule**: 10 features Γ 8 bytes (float64) = 80 bytes
|
| 564 |
+
|
| 565 |
+
**Training set** (100 schedules):
|
| 566 |
+
- Features (X): 100 Γ 80 bytes = 8 KB
|
| 567 |
+
- Target (y): 100 Γ 8 bytes = 800 bytes
|
| 568 |
+
- Total: ~9 KB (minimal)
|
| 569 |
+
|
| 570 |
+
**Full dataset** (1000 schedules):
|
| 571 |
+
- Features: 80 KB
|
| 572 |
+
- Target: 8 KB
|
| 573 |
+
- Total: ~88 KB
|
| 574 |
+
|
| 575 |
+
**Memory during training**:
|
| 576 |
+
- Dataset: ~88 KB
|
| 577 |
+
- Models (5 Γ ~5 MB): ~25 MB
|
| 578 |
+
- Working memory: ~50 MB
|
| 579 |
+
- **Total**: ~75 MB
|
| 580 |
+
|
| 581 |
+
---
|
| 582 |
+
|
| 583 |
+
### Optimization Service Resource Usage
|
| 584 |
+
|
| 585 |
+
#### OR-Tools Optimization
|
| 586 |
+
|
| 587 |
+
**Input data**:
|
| 588 |
+
- 30 trains Γ 1.5 KB = 45 KB
|
| 589 |
+
- 25 stations Γ 200 bytes = 5 KB
|
| 590 |
+
- Constraints: ~10 KB
|
| 591 |
+
- **Total input**: ~60 KB
|
| 592 |
+
|
| 593 |
+
**Memory usage**:
|
| 594 |
+
- Solver state: ~10 MB
|
| 595 |
+
- Solution space: ~20 MB
|
| 596 |
+
- **Peak memory**: ~30 MB
|
| 597 |
+
|
| 598 |
+
**Execution time**: 1-5 seconds (CPU-bound)
|
| 599 |
+
|
| 600 |
+
**CPU utilization**: 100% single core
|
| 601 |
+
|
| 602 |
+
---
|
| 603 |
+
|
| 604 |
+
#### ML Ensemble Prediction
|
| 605 |
+
|
| 606 |
+
**Input data**:
|
| 607 |
+
- Feature vector: 10 Γ 8 bytes = 80 bytes
|
| 608 |
+
- **Total input**: < 1 KB
|
| 609 |
+
|
| 610 |
+
**Memory usage**:
|
| 611 |
+
- Loaded models: ~25 MB (shared)
|
| 612 |
+
- Prediction workspace: ~1 MB
|
| 613 |
+
- **Peak memory**: ~26 MB
|
| 614 |
+
|
| 615 |
+
**Execution time**: 50-100 milliseconds
|
| 616 |
+
|
| 617 |
+
**CPU utilization**: 20-30% single core
|
| 618 |
+
|
| 619 |
+
---
|
| 620 |
+
|
| 621 |
+
#### Greedy Optimization
|
| 622 |
+
|
| 623 |
+
**Input data**: ~60 KB (same as OR-Tools)
|
| 624 |
+
|
| 625 |
+
**Memory usage**:
|
| 626 |
+
- State tracking: ~5 MB
|
| 627 |
+
- Priority queue: ~2 MB
|
| 628 |
+
- **Peak memory**: ~7 MB
|
| 629 |
+
|
| 630 |
+
**Execution time**: < 1 second
|
| 631 |
+
|
| 632 |
+
**CPU utilization**: 50-70% single core
|
| 633 |
+
|
| 634 |
+
---
|
| 635 |
+
|
| 636 |
+
## Service Resource Usage
|
| 637 |
+
|
| 638 |
+
### DataService (FastAPI)
|
| 639 |
+
|
| 640 |
+
**Base memory**: 150 MB (Python + FastAPI + dependencies)
|
| 641 |
+
|
| 642 |
+
**Per request overhead**: ~10 MB
|
| 643 |
+
|
| 644 |
+
**Concurrent requests** (typical): 1-5
|
| 645 |
+
|
| 646 |
+
**Total memory** (under load): 200-250 MB
|
| 647 |
+
|
| 648 |
+
**Disk I/O**:
|
| 649 |
+
- Read: Minimal (configuration only)
|
| 650 |
+
- Write: ~50 KB per schedule generated
|
| 651 |
+
|
| 652 |
+
**Network**:
|
| 653 |
+
- Inbound: ~150 bytes (request)
|
| 654 |
+
- Outbound: ~50 KB (response)
|
| 655 |
+
|
| 656 |
+
---
|
| 657 |
+
|
| 658 |
+
### SelfTrainService
|
| 659 |
+
|
| 660 |
+
**Base memory**: 200 MB (Python + ML libraries)
|
| 661 |
+
|
| 662 |
+
**During training**:
|
| 663 |
+
- Dataset loading: +20 MB
|
| 664 |
+
- Model training: +100 MB (peak)
|
| 665 |
+
- **Total during training**: ~320 MB
|
| 666 |
+
|
| 667 |
+
**During inference** (loaded models):
|
| 668 |
+
- Models in memory: +25 MB
|
| 669 |
+
- **Total during inference**: ~225 MB
|
| 670 |
+
|
| 671 |
+
**Disk I/O**:
|
| 672 |
+
- Read: 5 MB (load schedules)
|
| 673 |
+
- Write: 25 MB (save models)
|
| 674 |
+
|
| 675 |
+
**Frequency**:
|
| 676 |
+
- Training: Every 48 hours
|
| 677 |
+
- Inference: Per schedule request (if confidence β₯ 75%)
|
| 678 |
+
|
| 679 |
+
---
|
| 680 |
+
|
| 681 |
+
### Retraining Service (Background)
|
| 682 |
+
|
| 683 |
+
**Memory**: ~50 MB (idle), ~320 MB (during training)
|
| 684 |
+
|
| 685 |
+
**CPU**:
|
| 686 |
+
- Idle: < 1%
|
| 687 |
+
- Training: 100% (5-10 minutes every 48 hours)
|
| 688 |
+
|
| 689 |
+
**Disk I/O**:
|
| 690 |
+
- Check interval: Every 60 minutes
|
| 691 |
+
- Read: ~1 MB (check schedule count)
|
| 692 |
+
- Write: ~25 MB (when retraining)
|
| 693 |
+
|
| 694 |
+
---
|
| 695 |
+
|
| 696 |
+
## Data Flow Summary
|
| 697 |
+
|
| 698 |
+
### Schedule Generation Request
|
| 699 |
+
|
| 700 |
+
```
|
| 701 |
+
Client Request (150 bytes)
|
| 702 |
+
β
|
| 703 |
+
FastAPI Parser (~1 KB in memory)
|
| 704 |
+
β
|
| 705 |
+
Feature Extraction (80 bytes)
|
| 706 |
+
β
|
| 707 |
+
ML Prediction (25 MB models loaded) OR OR-Tools (30 MB solver)
|
| 708 |
+
β
|
| 709 |
+
Schedule Generation (45 KB output)
|
| 710 |
+
β
|
| 711 |
+
JSON Serialization (~50 KB response)
|
| 712 |
+
β
|
| 713 |
+
Storage (48 KB file)
|
| 714 |
+
```
|
| 715 |
+
|
| 716 |
+
**Total data processed**: ~50 KB per request
|
| 717 |
+
|
| 718 |
+
**Response time**: 0.1-5 seconds
|
| 719 |
+
|
| 720 |
+
---
|
| 721 |
+
|
| 722 |
+
### Model Training Cycle
|
| 723 |
+
|
| 724 |
+
```
|
| 725 |
+
Load Schedules (100 Γ 48 KB = 4.8 MB)
|
| 726 |
+
β
|
| 727 |
+
Extract Features (100 Γ 80 bytes = 8 KB)
|
| 728 |
+
β
|
| 729 |
+
Train 5 Models (5-10 minutes, 100% CPU)
|
| 730 |
+
β
|
| 731 |
+
Save Models (25 MB pickle file)
|
| 732 |
+
β
|
| 733 |
+
Update History (1 KB append)
|
| 734 |
+
```
|
| 735 |
+
|
| 736 |
+
**Total data processed**: ~30 MB
|
| 737 |
+
|
| 738 |
+
**Frequency**: Every 48 hours
|
| 739 |
+
|
| 740 |
+
---
|
| 741 |
+
|
| 742 |
+
## Configuration Data
|
| 743 |
+
|
| 744 |
+
### Service Configuration
|
| 745 |
+
|
| 746 |
+
**Location**: `SelfTrainService/config.py`
|
| 747 |
+
|
| 748 |
+
**Size**: ~5 KB
|
| 749 |
+
|
| 750 |
+
**Key Parameters**:
|
| 751 |
+
```python
|
| 752 |
+
{
|
| 753 |
+
"RETRAIN_INTERVAL_HOURS": 48,
|
| 754 |
+
"MIN_SCHEDULES_FOR_TRAINING": 100,
|
| 755 |
+
"MODEL_TYPES": ["gradient_boosting", "xgboost", ...],
|
| 756 |
+
"USE_ENSEMBLE": true,
|
| 757 |
+
"ML_CONFIDENCE_THRESHOLD": 0.75,
|
| 758 |
+
"FEATURES": [10 feature names],
|
| 759 |
+
"EPOCHS": 100,
|
| 760 |
+
"LEARNING_RATE": 0.001
|
| 761 |
+
}
|
| 762 |
+
```
|
| 763 |
+
|
| 764 |
+
---
|
| 765 |
+
|
| 766 |
+
## Data Retention Policies
|
| 767 |
+
|
| 768 |
+
### Recommended Retention
|
| 769 |
+
|
| 770 |
+
**Schedule files**:
|
| 771 |
+
- Keep last 365 days (17.5 MB)
|
| 772 |
+
- Archive older to compressed storage
|
| 773 |
+
|
| 774 |
+
**Model snapshots**:
|
| 775 |
+
- Keep last 50 models (~1.25 GB)
|
| 776 |
+
- Delete older snapshots
|
| 777 |
+
- Keep 1 model per month for historical reference
|
| 778 |
+
|
| 779 |
+
**Training history**:
|
| 780 |
+
- Keep all entries (grows slowly)
|
| 781 |
+
- Compress after 1000 entries
|
| 782 |
+
|
| 783 |
+
**Logs**:
|
| 784 |
+
- Application logs: 30 days
|
| 785 |
+
- Error logs: 90 days
|
| 786 |
+
- Audit logs: 1 year
|
| 787 |
+
|
| 788 |
+
---
|
| 789 |
+
|
| 790 |
+
## Scaling Considerations
|
| 791 |
+
|
| 792 |
+
### Horizontal Scaling
|
| 793 |
+
|
| 794 |
+
**API Service** (DataService):
|
| 795 |
+
- Stateless - easy to scale
|
| 796 |
+
- Load balancer distributes requests
|
| 797 |
+
- Each instance: ~250 MB memory
|
| 798 |
+
|
| 799 |
+
**ML Service** (SelfTrainService):
|
| 800 |
+
- Share model files via NFS/S3
|
| 801 |
+
- Only one instance should train (avoid conflicts)
|
| 802 |
+
- Multiple instances can serve predictions
|
| 803 |
+
|
| 804 |
+
### Vertical Scaling
|
| 805 |
+
|
| 806 |
+
**Memory requirements**:
|
| 807 |
+
- Minimum: 1 GB RAM
|
| 808 |
+
- Recommended: 2 GB RAM
|
| 809 |
+
- Optimal: 4 GB RAM (allows concurrent training + serving)
|
| 810 |
+
|
| 811 |
+
**CPU requirements**:
|
| 812 |
+
- Minimum: 1 core
|
| 813 |
+
- Recommended: 2 cores (1 for API, 1 for training)
|
| 814 |
+
- Optimal: 4 cores (parallel model training)
|
| 815 |
+
|
| 816 |
+
**Storage requirements**:
|
| 817 |
+
- Minimum: 5 GB
|
| 818 |
+
- Recommended: 20 GB
|
| 819 |
+
- Optimal: 50 GB (1-year retention)
|
| 820 |
+
|
| 821 |
+
---
|
| 822 |
+
|
| 823 |
+
## Performance Benchmarks
|
| 824 |
+
|
| 825 |
+
### Schedule Generation Performance
|
| 826 |
+
|
| 827 |
+
| Fleet Size | Algorithm | Time | Memory | Output Size |
|
| 828 |
+
|------------|-----------|------|--------|-------------|
|
| 829 |
+
| 25 trains | ML | 0.08s | 225 MB | 38 KB |
|
| 830 |
+
| 30 trains | ML | 0.10s | 225 MB | 45 KB |
|
| 831 |
+
| 40 trains | ML | 0.12s | 225 MB | 60 KB |
|
| 832 |
+
| 25 trains | OR-Tools | 1.2s | 30 MB | 38 KB |
|
| 833 |
+
| 30 trains | OR-Tools | 2.8s | 30 MB | 45 KB |
|
| 834 |
+
| 40 trains | OR-Tools | 4.5s | 30 MB | 60 KB |
|
| 835 |
+
| 25 trains | Greedy | 0.3s | 7 MB | 38 KB |
|
| 836 |
+
| 30 trains | Greedy | 0.5s | 7 MB | 45 KB |
|
| 837 |
+
| 40 trains | Greedy | 0.8s | 7 MB | 60 KB |
|
| 838 |
+
|
| 839 |
+
### Training Performance
|
| 840 |
+
|
| 841 |
+
| Dataset Size | Training Time | Memory | Model Size |
|
| 842 |
+
|--------------|---------------|--------|------------|
|
| 843 |
+
| 100 schedules | 3 min | 320 MB | 20 MB |
|
| 844 |
+
| 500 schedules | 8 min | 350 MB | 24 MB |
|
| 845 |
+
| 1000 schedules | 15 min | 400 MB | 28 MB |
|
| 846 |
+
|
| 847 |
+
---
|
| 848 |
+
|
| 849 |
+
**Document Version**: 1.0.0
|
| 850 |
+
**Last Updated**: November 2, 2025
|
| 851 |
+
**Maintained By**: ML-Service Team
|
docs/integrate.md
DELETED
|
File without changes
|