Spaces:

Arpit-Bansal
/

train-schedule-optimization

Sleeping

App Files Files Community

Arpit-Bansal commited on Nov 2, 2025

Commit

8720c05

1 Parent(s): e7bbf32

update docs

Browse files

Files changed (3) hide show

docs/algorithms.md +604 -0
docs/data-schemas.md +851 -0
docs/integrate.md +0 -0

docs/algorithms.md ADDED Viewed

	@@ -0,0 +1,604 @@

+# Algorithms & Optimization Techniques
+## Overview
+This document describes all algorithms, optimization techniques, and machine learning models used in the Metro Train Scheduling Service.
+---
+## Table of Contents
+1. [Machine Learning Algorithms](#machine-learning-algorithms)
+2. [Optimization Algorithms](#optimization-algorithms)
+3. [Hybrid Approach](#hybrid-approach)
+4. [Feature Engineering](#feature-engineering)
+5. [Performance Metrics](#performance-metrics)
+---
+## Machine Learning Algorithms
+### Ensemble Learning Architecture
+The system employs a **5-model ensemble** approach for schedule quality prediction:
+#### 1. Gradient Boosting (Scikit-learn)
+**Algorithm**: Sequential ensemble of weak learners (decision trees)
+**Parameters**:
+- `n_estimators`: 100 trees
+- `learning_rate`: 0.001
+- `loss function`: Least squares regression
+- `max_depth`: Auto (unlimited)
+**Strengths**:
+- Excellent baseline performance
+- Handles non-linear relationships well
+- Robust to outliers
+**Use Case**: Primary baseline model for schedule quality prediction
+---
+#### 2. Random Forest (Scikit-learn)
+**Algorithm**: Bagging ensemble of decision trees
+**Parameters**:
+- `n_estimators`: 100 trees
+- `max_features`: Auto (√n_features)
+- `n_jobs`: -1 (parallel processing)
+- `random_state`: 42
+**Strengths**:
+- Low variance through averaging
+- Handles missing data well
+- Feature importance ranking
+**Use Case**: Robust predictions with feature importance insights
+---
+#### 3. XGBoost (Extreme Gradient Boosting)
+**Algorithm**: Optimized distributed gradient boosting
+**Parameters**:
+- `n_estimators`: 100
+- `learning_rate`: 0.001
+- `objective`: reg:squarederror
+- `tree_method`: Auto
+- `verbosity`: 0
+**Technical Details**:
+- Uses second-order gradients (Newton-Raphson)
+- L1/L2 regularization to prevent overfitting
+- Parallel tree construction
+- Cache-aware block structure
+**Strengths**:
+- Typically best single-model performance
+- Fast training and prediction
+- Built-in cross-validation
+**Use Case**: High-performance predictions, often selected as best model
+---
+#### 4. LightGBM (Microsoft)
+**Algorithm**: Gradient-based One-Side Sampling (GOSS) + Exclusive Feature Bundling (EFB)
+**Parameters**:
+- `n_estimators`: 100
+- `learning_rate`: 0.001
+- `boosting_type`: gbdt
+- `verbose`: -1
+**Technical Details**:
+- **GOSS**: Keeps instances with large gradients, randomly samples small gradients
+- **EFB**: Bundles mutually exclusive features to reduce dimensions
+- Leaf-wise tree growth (vs level-wise)
+- Histogram-based splitting
+**Strengths**:
+- Fastest training time
+- Low memory usage
+- Handles large datasets efficiently
+**Use Case**: Fast iteration during development, efficient production inference
+---
+#### 5. CatBoost (Yandex)
+**Algorithm**: Ordered boosting with categorical feature handling
+**Parameters**:
+- `iterations`: 100
+- `learning_rate`: 0.001
+- `loss_function`: RMSE
+- `verbose`: False
+**Technical Details**:
+- **Ordered Boosting**: Prevents target leakage in gradient calculation
+- **Symmetric Trees**: Balanced tree structure
+- Native categorical feature support
+- Minimal hyperparameter tuning needed
+**Strengths**:
+- Best out-of-the-box performance
+- Robust to overfitting
+- Excellent with categorical data
+**Use Case**: Robust predictions with minimal tuning
+---
+### Ensemble Strategy
+#### Weighted Voting
+```python
+# Weight calculation (performance-based)
+weight_i = R²_score_i / Σ(R²_scores)
+# Final prediction
+prediction = Σ(weight_i × prediction_i)
+```
+**Example Weights**:
+```json
+{
+  "xgboost": 0.215,      // Best performer
+  "lightgbm": 0.208,
+  "gradient_boosting": 0.195,
+  "catboost": 0.195,
+  "random_forest": 0.187
+}
+```
+#### Confidence Calculation
+```python
+# Ensemble confidence based on model agreement
+predictions = [model.predict(features) for model in models]
+std_dev = np.std(predictions)
+# High agreement → High confidence
+confidence = max(0.5, min(1.0, 1.0 - (std_dev / 50)))
+```
+**Confidence Threshold**: 0.75 (75%)
+- If confidence ≥ 75%: Use ML prediction
+- If confidence < 75%: Fall back to optimization
+---
+## Optimization Algorithms
+### Constraint Programming (OR-Tools)
+**Algorithm**: Google OR-Tools CP-SAT Solver
+**Problem Type**: Constraint Satisfaction Problem (CSP)
+#### Variables
+```python
+# Decision variables for each trainset
+for train in trainsets:
+    for time_slot in operational_hours:
+        is_assigned[train, time_slot] = BoolVar()
+```
+#### Constraints
+**1. Fleet Coverage**
+```
+Σ(active_trains_at_time_t) ≥ min_service_trains
+∀ t ∈ peak_hours
+```
+**2. Turnaround Time**
+```
+end_time[trip_i] + turnaround_time ≤ start_time[trip_i+1]
+∀ consecutive trips of same train
+```
+**3. Maintenance Windows**
+```
+if train.status == MAINTENANCE:
+    is_assigned[train, t] = False
+    ∀ t ∈ maintenance_window
+```
+**4. Fitness Certificates**
+```
+if certificate_expired(train):
+    is_assigned[train, t] = False
+    ∀ t
+```
+**5. Mileage Balancing**
+```
+min_mileage ≤ daily_km[train] ≤ max_mileage
+∀ trains in AVAILABLE status
+```
+**6. Depot Capacity**
+```
+Σ(trains_in_depot_at_t) ≤ depot_capacity
+∀ t ∈ non_operational_hours
+```
+#### Objective Functions
+**Multi-objective optimization** with weighted sum:
+```python
+objective = (
+    0.35 × maximize(service_coverage) +
+    0.25 × minimize(mileage_variance) +
+    0.20 × maximize(availability_utilization) +
+    0.10 × minimize(certificate_violations) +
+    0.10 × maximize(branding_exposure)
+)
+```
+**Component Details**:
+1. **Service Coverage** (35% weight)
+   - Maximize trains in service during peak hours
+   - Ensure minimum standby capacity
+2. **Mileage Variance** (25% weight)
+   - Balance cumulative mileage across fleet
+   - Prevent overuse of specific trainsets
+   - Formula: `1 / (1 + coefficient_of_variation)`
+3. **Availability Utilization** (20% weight)
+   - Maximize usage of available healthy trains
+   - Minimize idle time for service-ready trainsets
+4. **Certificate Violations** (10% weight)
+   - Minimize assignments with expiring certificates
+   - Penalize near-expiry usage (< 30 days)
+5. **Branding Exposure** (10% weight)
+   - Prioritize branded trains during peak hours
+   - Maximize visibility of high-priority advertisers
+---
+### Greedy Optimization
+**Algorithm**: Priority-based greedy assignment
+**Location**: `greedyOptim/` folder
+#### Priority Scoring
+```python
+priority_score = (
+    0.40 × readiness_score +
+    0.25 × (1 - normalized_mileage) +
+    0.20 × certificate_validity_days +
+    0.10 × branding_priority +
+    0.05 × maintenance_gap_days
+)
+```
+#### Assignment Process
+1. **Sort trains by priority** (descending)
+2. **Iterate through time slots** (5 AM → 11 PM)
+3. **For each slot**:
+   - Select highest-priority available train
+   - Check constraints (turnaround, capacity)
+   - Assign if feasible
+   - Update train state (location, mileage)
+4. **Fallback**: If no train available, flag as gap
+**Complexity**: O(n × t) where n = trains, t = time slots
+**Advantages**:
+- Fast execution (< 1 second for 40 trains)
+- Interpretable decisions
+- Good for real-time adjustments
+**Disadvantages**:
+- May not find global optimum
+- Sensitive to initial priority weights
+---
+### Genetic Algorithm
+**Algorithm**: Evolutionary optimization
+**Location**: `greedyOptim/genetic_algorithm.py`
+#### Parameters
+- **Population size**: 100 schedules
+- **Generations**: 50 iterations
+- **Crossover rate**: 0.8
+- **Mutation rate**: 0.1
+- **Selection**: Tournament (k=3)
+#### Chromosome Encoding
+```python
+# Each chromosome = complete schedule
+chromosome = [train_id_for_trip_0, train_id_for_trip_1, ..., train_id_for_trip_n]
+```
+#### Fitness Function
+```python
+fitness = (
+    service_quality_score -
+    constraint_violations × penalty_weight
+)
+```
+#### Genetic Operators
+**1. Crossover (Single-point)**
+```python
+parent1 = [T1, T2, T3, T4, T5, T6]
+parent2 = [T3, T1, T4, T2, T6, T5]
+         ↓ crossover at position 3
+child1  = [T1, T2, T3, T2, T6, T5]
+child2  = [T3, T1, T4, T4, T5, T6]
+```
+**2. Mutation (Swap)**
+```python
+# Randomly swap two trip assignments
+schedule = [T1, T2, T3, T4, T5]
+         ↓ swap positions 1 and 3
+mutated  = [T1, T4, T3, T2, T5]
+```
+**Termination**: Max generations or convergence (no improvement for 10 generations)
+---
+## Hybrid Approach
+### Decision Flow
+```
+┌─────────────────────┐
+│  Schedule Request   │
+└──────────┬──────────┘
+           │
+           ▼
+┌─────────────────────────────────┐
+│ Extract Features from Request   │
+│ (num_trains, time, day, etc.)  │
+└──────────┬──────────────────────┘
+           │
+           ▼
+┌─────────────────────────────────┐
+│  Ensemble ML Prediction         │
+│  - All 5 models predict         │
+│  - Weighted voting              │
+│  - Calculate confidence         │
+└──────────┬──────────────────────┘
+           │
+           ▼
+      Confidence ≥ 75%?
+           │
+    ┌──────┴──────┐
+    │             │
+   YES            NO
+    │             │
+    ▼             ▼
+┌───────┐   ┌──────────┐
+│  Use  │   │   Use    │
+│  ML   │   │OR-Tools  │
+│Result │   │ Optimize │
+└───────┘   └──────────┘
+    │             │
+    └──────┬──────┘
+           │
+           ▼
+    ┌─────────────┐
+    │  Schedule   │
+    └─────────────┘
+```
+### When ML is Used
+**Conditions**:
+1. ✅ Models trained (≥100 schedules)
+2. ✅ Confidence score ≥ 75%
+3. ✅ Hybrid mode enabled
+**Typical Scenarios**:
+- Standard 30-train fleet
+- Normal operational parameters
+- No major disruptions
+### When Optimization is Used
+**Conditions**:
+- ❌ Low ML confidence (< 75%)
+- ❌ Models not trained
+- ❌ Unusual parameters (edge cases)
+- ❌ First-time scheduling
+**Typical Scenarios**:
+- Fleet size changes (25→40 trains)
+- New route configurations
+- Major maintenance events
+- System initialization
+---
+## Feature Engineering
+### Input Features (10 dimensions)
+| Feature | Type | Range | Description |
+|---------|------|-------|-------------|
+| `num_trains` | Integer | 25-40 | Total fleet size |
+| `num_available` | Integer | 20-38 | Trains in service/standby |
+| `avg_readiness_score` | Float | 0.0-1.0 | Average train health |
+| `total_mileage` | Integer | 100K-500K | Fleet cumulative km |
+| `mileage_variance` | Float | 0-50K | Std dev of mileage |
+| `maintenance_count` | Integer | 0-10 | Trains in maintenance |
+| `certificate_expiry_count` | Integer | 0-5 | Expiring certificates |
+| `branding_priority_sum` | Integer | 0-100 | Total branding priority |
+| `time_of_day` | Integer | 0-23 | Hour of day |
+| `day_of_week` | Integer | 0-6 | Day (0=Monday) |
+### Target Variable
+**Schedule Quality Score** (0-100):
+```python
+score = (
+    avg_readiness × 30 +        # Health (30 points)
+    availability_% × 25 +        # Availability (25 points)
+    (1 - mileage_var) × 20 +    # Balance (20 points)
+    branding_sla × 15 +          # Branding (15 points)
+    (10 - violations×2)          # Compliance (10 points)
+)
+```
+### Feature Scaling
+All features normalized to [0, 1] range before training:
+```python
+feature_normalized = (value - min) / (max - min)
+```
+---
+## Performance Metrics
+### Model Evaluation
+**Primary Metric**: R² Score (Coefficient of Determination)
+- Range: [0, 1], higher is better
+- Typical ensemble R²: 0.85-0.92
+**Secondary Metric**: RMSE (Root Mean Squared Error)
+- Range: [0, ∞], lower is better
+- Typical ensemble RMSE: 8-15
+**Training Split**: 80% train, 20% test
+### Optimization Quality
+**Metrics Tracked**:
+1. **Service Coverage**: % of required hours covered
+   - Target: ≥ 95%
+2. **Fleet Utilization**: % of available trains used
+   - Target: 85-95%
+3. **Mileage Balance**: Coefficient of variation
+   - Target: < 0.15 (15%)
+4. **Constraint Violations**: Count of hard constraint breaks
+   - Target: 0
+5. **Execution Time**: Algorithm runtime
+   - ML: < 0.1 seconds
+   - OR-Tools: 1-5 seconds
+   - Genetic: 5-15 seconds
+### Ensemble Performance Example
+```json
+{
+  "gradient_boosting": {
+    "train_r2": 0.8912,
+    "test_r2": 0.8234,
+    "test_rmse": 13.45
+  },
+  "xgboost": {
+    "train_r2": 0.9234,
+    "test_r2": 0.8543,
+    "test_rmse": 12.34
+  },
+  "lightgbm": {
+    "train_r2": 0.9156,
+    "test_r2": 0.8467,
+    "test_rmse": 12.67
+  },
+  "catboost": {
+    "train_r2": 0.9087,
+    "test_r2": 0.8401,
+    "test_rmse": 12.89
+  },
+  "random_forest": {
+    "train_r2": 0.8756,
+    "test_r2": 0.8123,
+    "test_rmse": 13.98
+  },
+  "ensemble": {
+    "test_r2": 0.8621,
+    "test_rmse": 11.87,
+    "confidence": 0.89
+  }
+}
+```
+---
+## Algorithm Selection Guide
+| Use Case | Recommended Algorithm | Rationale |
+|----------|----------------------|-----------|
+| First-time scheduling | OR-Tools CP-SAT | No training data available |
+| Standard operations | Ensemble ML | Fast, accurate predictions |
+| Edge cases | OR-Tools CP-SAT | Guaranteed feasibility |
+| Real-time updates | Greedy + ML | Sub-second performance |
+| Offline planning | Genetic Algorithm | Exploration of solution space |
+| Development/Testing | LightGBM | Fastest training iteration |
+| Production inference | XGBoost | Best accuracy/speed trade-off |
+---
+## Future Enhancements
+### Planned Improvements
+1. **Reinforcement Learning**
+   - Q-learning for dynamic scheduling
+   - Reward: schedule quality over time
+2. **Deep Learning**
+   - LSTM for time-series prediction
+   - Attention mechanisms for trip dependencies
+3. **Multi-objective Pareto**
+   - Generate Pareto-optimal solution set
+   - Allow user to select trade-off point
+4. **Transfer Learning**
+   - Pre-train on similar metro systems
+   - Fine-tune for KMRL specifics
+5. **Online Learning**
+   - Incremental model updates
+   - Adapt to changing patterns without full retraining
+---
+## References
+### Libraries
+- **Scikit-learn**: https://scikit-learn.org/
+- **XGBoost**: https://xgboost.readthedocs.io/
+- **LightGBM**: https://lightgbm.readthedocs.io/
+- **CatBoost**: https://catboost.ai/
+- **OR-Tools**: https://developers.google.com/optimization
+### Papers
+1. Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System"
+2. Ke, G., et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree"
+3. Prokhorenkova, L., et al. (2018). "CatBoost: unbiased boosting with categorical features"
+---
+**Document Version**: 1.0.0
+**Last Updated**: November 2, 2025
+**Maintained By**: ML-Service Team

docs/data-schemas.md ADDED Viewed

	@@ -0,0 +1,851 @@

+# Data Schemas & Service Specifications
+## Overview
+This document details all data structures, schemas, API contracts, and data volume specifications for the Metro Train Scheduling Service.
+---
+## Table of Contents
+1. [Core Data Models](#core-data-models)
+2. [API Schemas](#api-schemas)
+3. [Database Schemas](#database-schemas)
+4. [Data Volume & Storage](#data-volume--storage)
+5. [Service Resource Usage](#service-resource-usage)
+---
+## Core Data Models
+All models use **Pydantic v2** for validation and serialization.
+### 1. DaySchedule
+**Purpose**: Complete daily schedule with all trainset assignments
+```python
+class DaySchedule(BaseModel):
+    schedule_id: str                    # "KMRL-2025-10-25"
+    date: str                           # "2025-10-25"
+    route: Route                        # Route details
+    trainsets: List[Trainset]           # All train assignments
+    fleet_summary: FleetSummary         # Fleet statistics
+    optimization_metrics: OptimizationMetrics
+    alerts: List[Alert]                 # Warnings/issues
+    generated_at: datetime
+    generated_by: str = "ML-Optimizer"
+```
+**Size**: ~45 KB per schedule (30 trains, full day)
+**Example**:
+```json
+{
+  "schedule_id": "KMRL-2025-10-25",
+  "date": "2025-10-25",
+  "route": {...},
+  "trainsets": [...],
+  "fleet_summary": {
+    "total_trainsets": 30,
+    "in_service": 24,
+    "standby": 4,
+    "maintenance": 2
+  },
+  "optimization_metrics": {
+    "total_service_blocks": 156,
+    "avg_readiness_score": 0.87,
+    "mileage_variance_coefficient": 0.12
+  },
+  "generated_at": "2025-10-25T04:30:00+05:30"
+}
+```
+---
+### 2. Trainset
+**Purpose**: Individual train assignment and status
+```python
+class Trainset(BaseModel):
+    trainset_id: str                    # "TS-001"
+    status: TrainHealthStatus           # REVENUE_SERVICE, STANDBY, etc.
+    depot_bay: str                      # "BAY-01"
+    cumulative_km: int                  # 145250
+    readiness_score: float              # 0.0-1.0
+    service_blocks: List[ServiceBlock]  # Trip assignments
+    fitness_certificates: FitnessCertificates
+    job_cards: JobCards
+    branding: Branding
+```
+**Size**: ~1.5 KB per trainset
+**Status Enum**:
+```python
+class TrainHealthStatus(str, Enum):
+    REVENUE_SERVICE = "REVENUE_SERVICE"  # Active service
+    STANDBY = "STANDBY"                  # Ready, not assigned
+    MAINTENANCE = "MAINTENANCE"          # Under repair
+    SCHEDULED_MAINTENANCE = "SCHEDULED_MAINTENANCE"
+    UNAVAILABLE = "UNAVAILABLE"          # Out of service
+```
+**Distribution** (typical 30-train fleet):
+- REVENUE_SERVICE: 22-24 trains (73-80%)
+- STANDBY: 3-5 trains (10-17%)
+- MAINTENANCE: 1-3 trains (3-10%)
+- UNAVAILABLE: 0-2 trains (0-7%)
+---
+### 3. ServiceBlock
+**Purpose**: Single trip assignment for a train
+```python
+class ServiceBlock(BaseModel):
+    block_id: str                       # "BLK-001-01"
+    start_time: str                     # "05:00"
+    end_time: str                       # "05:45"
+    start_station: str                  # "Aluva"
+    end_station: str                    # "Pettah"
+    direction: str                      # "UP" or "DOWN"
+    distance_km: float                  # 25.612
+    estimated_passengers: Optional[int] # 450
+    priority: str = "NORMAL"            # NORMAL, HIGH, PEAK
+```
+**Size**: ~250 bytes per service block
+**Daily Trips per Train**:
+- Peak service train: 6-8 trips
+- Standard service: 4-6 trips
+- Average: ~5.2 trips per active train
+**Total Service Blocks** (30-train fleet):
+- 24 active trains × 5.2 trips = ~125 service blocks/day
+---
+### 4. Route
+**Purpose**: Metro line configuration
+```python
+class Route(BaseModel):
+    route_id: str                       # "KMRL-LINE-01"
+    name: str                           # "Aluva-Pettah Line"
+    stations: List[Station]             # 25 stations
+    total_distance_km: float            # 25.612 km
+    avg_speed_kmh: int                  # 32-38 km/h
+    turnaround_time_minutes: int        # 8-12 minutes
+```
+**KMRL Route Details**:
+- **Stations**: 25 (Aluva to Pettah)
+- **Distance**: 25.612 km
+- **Average Speed**: 35 km/h
+- **One-way Time**: ~44 minutes
+- **Round Trip**: ~100 minutes (including turnarounds)
+---
+### 5. Station
+**Purpose**: Individual station on route
+```python
+class Station(BaseModel):
+    station_id: str                     # "STN-001"
+    name: str                           # "Aluva"
+    code: str                           # "ALV"
+    distance_from_start_km: float       # 0.0
+    platform_count: int                 # 2
+    facilities: List[str]               # ["PARKING", "ELEVATOR"]
+```
+**Size**: ~200 bytes per station
+**Total Stations**: 25 (fixed)
+---
+### 6. FitnessCertificates
+**Purpose**: Regulatory compliance tracking
+```python
+class FitnessCertificates(BaseModel):
+    rolling_stock: FitnessCertificate   # Train body/chassis
+    signalling: FitnessCertificate      # Signal systems
+    telecom: FitnessCertificate         # Communication systems
+class FitnessCertificate(BaseModel):
+    valid_until: str                    # "2025-12-31"
+    status: CertificateStatus           # VALID, EXPIRING_SOON, EXPIRED
+class CertificateStatus(str, Enum):
+    VALID = "VALID"                     # > 30 days remaining
+    EXPIRING_SOON = "EXPIRING_SOON"     # 7-30 days remaining
+    EXPIRED = "EXPIRED"                 # Past expiry date
+```
+**Validation Rules**:
+- Trains with EXPIRED certificates: status = UNAVAILABLE
+- Trains with EXPIRING_SOON: flagged in alerts, can operate
+---
+### 7. JobCards & Maintenance
+**Purpose**: Maintenance tracking
+```python
+class JobCards(BaseModel):
+    open: int                           # Number of open job cards
+    blocking: List[str]                 # Critical issues: ["BRAKE_FAULT"]
+# Example maintenance reasons
+UNAVAILABLE_REASONS = [
+    "SCHEDULED_MAINTENANCE",
+    "BRAKE_SYSTEM_REPAIR",
+    "HVAC_REPLACEMENT",
+    "BOGIE_OVERHAUL",
+    "ELECTRICAL_FAULT",
+    "ACCIDENT_DAMAGE",
+    "PANTOGRAPH_REPAIR",
+    "DOOR_SYSTEM_FAULT"
+]
+```
+**Impact on Scheduling**:
+- 0 open cards: readiness = 1.0
+- 1-2 cards: readiness = 0.9
+- 3-4 cards: readiness = 0.7
+- 5+ cards: readiness = 0.5, likely maintenance status
+---
+### 8. Branding
+**Purpose**: Advertisement tracking
+```python
+class Branding(BaseModel):
+    advertiser: str                     # "COCACOLA-2024"
+    contract_hours_remaining: int       # 450 hours
+    exposure_priority: str              # LOW, MEDIUM, HIGH, CRITICAL
+# Available advertisers
+ADVERTISERS = [
+    "COCACOLA-2024",
+    "FLIPKART-FESTIVE",
+    "AMAZON-PRIME",
+    "RELIANCE-JIO",
+    "TATA-MOTORS",
+    "SAMSUNG-GALAXY",
+    "NONE"
+]
+```
+**Priority Weights** (for optimization):
+- CRITICAL: 4 points
+- HIGH: 3 points
+- MEDIUM: 2 points
+- LOW: 1 point
+- NONE: 0 points
+**Scheduling Strategy**:
+- HIGH/CRITICAL branded trains prioritized for peak hours
+- Maximizes advertiser visibility during high-traffic periods
+---
+### 9. FleetSummary
+**Purpose**: Aggregated fleet statistics
+```python
+class FleetSummary(BaseModel):
+    total_trainsets: int                # 30
+    in_service: int                     # 24
+    standby: int                        # 4
+    maintenance: int                    # 2
+    unavailable: int                    # 0
+    availability_percent: float         # 93.33
+    total_mileage_today: int           # 3200 km
+    avg_trips_per_train: float         # 5.2
+```
+**Size**: ~300 bytes
+**Key Metrics**:
+- **Availability %**: (in_service + standby) / total × 100
+- **Target Availability**: ≥ 90%
+- **Service Ratio**: in_service / (in_service + standby)
+- **Target Service Ratio**: 85-90%
+---
+### 10. OptimizationMetrics
+**Purpose**: Optimization quality measures
+```python
+class OptimizationMetrics(BaseModel):
+    total_service_blocks: int           # 125
+    avg_readiness_score: float          # 0.87
+    mileage_variance_coefficient: float # 0.12
+    branding_sla_compliance: float      # 0.95
+    fitness_expiry_violations: int      # 0
+    execution_time_ms: int              # 1250
+    algorithm_used: str                 # "ensemble_ml" or "or_tools"
+    confidence_score: Optional[float]   # 0.89 (if ML used)
+```
+**Size**: ~250 bytes
+**Quality Thresholds**:
+- avg_readiness_score: ≥ 0.80
+- mileage_variance_coefficient: < 0.15
+- branding_sla_compliance: ≥ 0.90
+- fitness_expiry_violations: 0
+---
+## API Schemas
+### Request: ScheduleRequest
+**Endpoint**: `POST /api/v1/generate`
+```python
+class ScheduleRequest(BaseModel):
+    date: str                           # "2025-10-25"
+    num_trains: int = 25                # 25-40
+    num_stations: int = 25              # Fixed for KMRL
+    min_service_trains: int = 22        # Minimum active
+    min_standby_trains: int = 3         # Minimum backup
+    # Optional overrides
+    peak_hours: Optional[List[int]] = None  # [7,8,9,17,18,19]
+    force_optimization: bool = False    # Skip ML, use OR-Tools
+```
+**Size**: ~150 bytes per request
+**Validation**:
+- `num_trains`: 25 ≤ n ≤ 40
+- `num_stations`: Fixed at 25 (KMRL specific)
+- `min_service_trains`: ≤ num_trains - 3
+- `min_standby_trains`: ≥ 2
+**Example**:
+```json
+{
+  "date": "2025-10-25",
+  "num_trains": 30,
+  "num_stations": 25,
+  "min_service_trains": 24,
+  "min_standby_trains": 4
+}
+```
+---
+### Response: DaySchedule
+**Status**: 200 OK
+**Content-Type**: application/json
+**Size**: 45-55 KB (depends on fleet size)
+**Headers**:
+```
+X-Algorithm-Used: ensemble_ml | or_tools | greedy
+X-Confidence-Score: 0.89 (if ML)
+X-Execution-Time-Ms: 1250
+```
+---
+### Error Responses
+**400 Bad Request**:
+```json
+{
+  "error": "Validation Error",
+  "details": {
+    "num_trains": "Must be between 25 and 40"
+  }
+}
+```
+**500 Internal Server Error**:
+```json
+{
+  "error": "Optimization Failed",
+  "message": "Unable to find feasible schedule",
+  "timestamp": "2025-10-25T10:30:00Z"
+}
+```
+---
+## Database Schemas
+### Schedule Storage (JSON Files)
+**Location**: `data/schedules/`
+**Naming**: `{schedule_id}_{timestamp}.json`
+**Example**: `KMRL-2025-10-25_20251025_043000.json`
+**Structure**:
+```json
+{
+  "schedule": {DaySchedule},
+  "metadata": {
+    "recorded_at": "2025-10-25T04:30:00",
+    "quality_score": 87.5,
+    "algorithm_used": "ensemble_ml",
+    "confidence": 0.89
+  },
+  "saved_at": "2025-10-25T04:30:15"
+}
+```
+**Size per File**: ~48 KB
+---
+### Model Storage (Pickle Files)
+**Location**: `models/`
+**Files**:
+1. `models_latest.pkl` - Current ensemble (all 5 models)
+2. `models_{timestamp}.pkl` - Historical snapshots
+3. `training_history.json` - Training metrics log
+**Model File Contents**:
+```python
+{
+    "models": {
+        "gradient_boosting": GradientBoostingRegressor(),
+        "random_forest": RandomForestRegressor(),
+        "xgboost": XGBRegressor(),
+        "lightgbm": LGBMRegressor(),
+        "catboost": CatBoostRegressor()
+    },
+    "ensemble_weights": {
+        "xgboost": 0.215,
+        "lightgbm": 0.208,
+        ...
+    },
+    "best_model_name": "xgboost",
+    "last_trained": datetime(2025, 10, 25, 4, 30),
+    "config": {
+        "version": "v1.0.0",
+        "features": [...],
+        "models_trained": [...]
+    }
+}
+```
+**Size**: ~15-25 MB (all 5 models combined)
+---
+### Training History (JSON)
+**Location**: `models/training_history.json`
+**Structure**:
+```json
+[
+  {
+    "timestamp": "2025-10-23T12:00:00",
+    "metrics": {
+      "gradient_boosting": {
+        "train_r2": 0.8912,
+        "test_r2": 0.8234,
+        "test_rmse": 13.45
+      },
+      ...
+    },
+    "best_model": "xgboost",
+    "ensemble_weights": {...},
+    "config": {
+      "models_trained": [...],
+      "version": "v1.0.0"
+    }
+  },
+  ...
+]
+```
+**Growth**: ~1 KB per training run
+**Retention**: All training runs (pruned after 1000 entries)
+---
+## Data Volume & Storage
+### Production Estimates
+#### Daily Operations
+**Per Day** (single schedule generation):
+- 1 schedule file: ~48 KB
+- API request/response: ~50 KB total
+- Logs: ~10 KB
+**Total per day**: ~108 KB
+#### Monthly Operations (30 days)
+**Schedule files**:
+- 30 schedules × 48 KB = 1.44 MB
+**Model files**:
+- 1 retraining (every 48 hours) = 15 retrainings/month
+- 15 × 25 MB = 375 MB
+**Training history**:
+- 15 entries × 1 KB = 15 KB
+**Total per month**: ~377 MB
+#### Annual Storage (1 year)
+**Schedule data**:
+- 365 schedules × 48 KB = 17.5 MB
+**Model snapshots**:
+- 182 retrainings × 25 MB = 4.55 GB
+**Training history**:
+- 182 KB
+**Total per year**: ~4.57 GB
+**With retention policy** (keep last 100 schedules, 50 models):
+- Schedules: 100 × 48 KB = 4.8 MB
+- Models: 50 × 25 MB = 1.25 GB
+- History: 182 KB
+**Total with retention**: ~1.26 GB
+---
+### ML Training Data Requirements
+#### Minimum Training Dataset
+**Initial training**: 100 schedules
+- Storage: 100 × 48 KB = 4.8 MB
+- Generation time: ~15 minutes (automated)
+- Training time: 5-10 minutes
+**Optimal training**: 500 schedules
+- Storage: 500 × 48 KB = 24 MB
+- Provides better generalization
+- Covers more edge cases
+#### Feature Matrix Size
+**Per schedule**: 10 features × 8 bytes (float64) = 80 bytes
+**Training set** (100 schedules):
+- Features (X): 100 × 80 bytes = 8 KB
+- Target (y): 100 × 8 bytes = 800 bytes
+- Total: ~9 KB (minimal)
+**Full dataset** (1000 schedules):
+- Features: 80 KB
+- Target: 8 KB
+- Total: ~88 KB
+**Memory during training**:
+- Dataset: ~88 KB
+- Models (5 × ~5 MB): ~25 MB
+- Working memory: ~50 MB
+- **Total**: ~75 MB
+---
+### Optimization Service Resource Usage
+#### OR-Tools Optimization
+**Input data**:
+- 30 trains × 1.5 KB = 45 KB
+- 25 stations × 200 bytes = 5 KB
+- Constraints: ~10 KB
+- **Total input**: ~60 KB
+**Memory usage**:
+- Solver state: ~10 MB
+- Solution space: ~20 MB
+- **Peak memory**: ~30 MB
+**Execution time**: 1-5 seconds (CPU-bound)
+**CPU utilization**: 100% single core
+---
+#### ML Ensemble Prediction
+**Input data**:
+- Feature vector: 10 × 8 bytes = 80 bytes
+- **Total input**: < 1 KB
+**Memory usage**:
+- Loaded models: ~25 MB (shared)
+- Prediction workspace: ~1 MB
+- **Peak memory**: ~26 MB
+**Execution time**: 50-100 milliseconds
+**CPU utilization**: 20-30% single core
+---
+#### Greedy Optimization
+**Input data**: ~60 KB (same as OR-Tools)
+**Memory usage**:
+- State tracking: ~5 MB
+- Priority queue: ~2 MB
+- **Peak memory**: ~7 MB
+**Execution time**: < 1 second
+**CPU utilization**: 50-70% single core
+---
+## Service Resource Usage
+### DataService (FastAPI)
+**Base memory**: 150 MB (Python + FastAPI + dependencies)
+**Per request overhead**: ~10 MB
+**Concurrent requests** (typical): 1-5
+**Total memory** (under load): 200-250 MB
+**Disk I/O**:
+- Read: Minimal (configuration only)
+- Write: ~50 KB per schedule generated
+**Network**:
+- Inbound: ~150 bytes (request)
+- Outbound: ~50 KB (response)
+---
+### SelfTrainService
+**Base memory**: 200 MB (Python + ML libraries)
+**During training**:
+- Dataset loading: +20 MB
+- Model training: +100 MB (peak)
+- **Total during training**: ~320 MB
+**During inference** (loaded models):
+- Models in memory: +25 MB
+- **Total during inference**: ~225 MB
+**Disk I/O**:
+- Read: 5 MB (load schedules)
+- Write: 25 MB (save models)
+**Frequency**:
+- Training: Every 48 hours
+- Inference: Per schedule request (if confidence ≥ 75%)
+---
+### Retraining Service (Background)
+**Memory**: ~50 MB (idle), ~320 MB (during training)
+**CPU**:
+- Idle: < 1%
+- Training: 100% (5-10 minutes every 48 hours)
+**Disk I/O**:
+- Check interval: Every 60 minutes
+- Read: ~1 MB (check schedule count)
+- Write: ~25 MB (when retraining)
+---
+## Data Flow Summary
+### Schedule Generation Request
+```
+Client Request (150 bytes)
+    ↓
+FastAPI Parser (~1 KB in memory)
+    ↓
+Feature Extraction (80 bytes)
+    ↓
+ML Prediction (25 MB models loaded) OR OR-Tools (30 MB solver)
+    ↓
+Schedule Generation (45 KB output)
+    ↓
+JSON Serialization (~50 KB response)
+    ↓
+Storage (48 KB file)
+```
+**Total data processed**: ~50 KB per request
+**Response time**: 0.1-5 seconds
+---
+### Model Training Cycle
+```
+Load Schedules (100 × 48 KB = 4.8 MB)
+    ↓
+Extract Features (100 × 80 bytes = 8 KB)
+    ↓
+Train 5 Models (5-10 minutes, 100% CPU)
+    ↓
+Save Models (25 MB pickle file)
+    ↓
+Update History (1 KB append)
+```
+**Total data processed**: ~30 MB
+**Frequency**: Every 48 hours
+---
+## Configuration Data
+### Service Configuration
+**Location**: `SelfTrainService/config.py`
+**Size**: ~5 KB
+**Key Parameters**:
+```python
+{
+  "RETRAIN_INTERVAL_HOURS": 48,
+  "MIN_SCHEDULES_FOR_TRAINING": 100,
+  "MODEL_TYPES": ["gradient_boosting", "xgboost", ...],
+  "USE_ENSEMBLE": true,
+  "ML_CONFIDENCE_THRESHOLD": 0.75,
+  "FEATURES": [10 feature names],
+  "EPOCHS": 100,
+  "LEARNING_RATE": 0.001
+}
+```
+---
+## Data Retention Policies
+### Recommended Retention
+**Schedule files**:
+- Keep last 365 days (17.5 MB)
+- Archive older to compressed storage
+**Model snapshots**:
+- Keep last 50 models (~1.25 GB)
+- Delete older snapshots
+- Keep 1 model per month for historical reference
+**Training history**:
+- Keep all entries (grows slowly)
+- Compress after 1000 entries
+**Logs**:
+- Application logs: 30 days
+- Error logs: 90 days
+- Audit logs: 1 year
+---
+## Scaling Considerations
+### Horizontal Scaling
+**API Service** (DataService):
+- Stateless - easy to scale
+- Load balancer distributes requests
+- Each instance: ~250 MB memory
+**ML Service** (SelfTrainService):
+- Share model files via NFS/S3
+- Only one instance should train (avoid conflicts)
+- Multiple instances can serve predictions
+### Vertical Scaling
+**Memory requirements**:
+- Minimum: 1 GB RAM
+- Recommended: 2 GB RAM
+- Optimal: 4 GB RAM (allows concurrent training + serving)
+**CPU requirements**:
+- Minimum: 1 core
+- Recommended: 2 cores (1 for API, 1 for training)
+- Optimal: 4 cores (parallel model training)
+**Storage requirements**:
+- Minimum: 5 GB
+- Recommended: 20 GB
+- Optimal: 50 GB (1-year retention)
+---
+## Performance Benchmarks
+### Schedule Generation Performance
+| Fleet Size | Algorithm | Time | Memory | Output Size |
+|------------|-----------|------|--------|-------------|
+| 25 trains  | ML        | 0.08s | 225 MB | 38 KB |
+| 30 trains  | ML        | 0.10s | 225 MB | 45 KB |
+| 40 trains  | ML        | 0.12s | 225 MB | 60 KB |
+| 25 trains  | OR-Tools  | 1.2s  | 30 MB  | 38 KB |
+| 30 trains  | OR-Tools  | 2.8s  | 30 MB  | 45 KB |
+| 40 trains  | OR-Tools  | 4.5s  | 30 MB  | 60 KB |
+| 25 trains  | Greedy    | 0.3s  | 7 MB   | 38 KB |
+| 30 trains  | Greedy    | 0.5s  | 7 MB   | 45 KB |
+| 40 trains  | Greedy    | 0.8s  | 7 MB   | 60 KB |
+### Training Performance
+| Dataset Size | Training Time | Memory | Model Size |
+|--------------|---------------|--------|------------|
+| 100 schedules | 3 min       | 320 MB | 20 MB |
+| 500 schedules | 8 min       | 350 MB | 24 MB |
+| 1000 schedules | 15 min     | 400 MB | 28 MB |
+---
+**Document Version**: 1.0.0
+**Last Updated**: November 2, 2025
+**Maintained By**: ML-Service Team

docs/integrate.md DELETED Viewed

File without changes