# Data Schemas & Service Specifications

## Overview

This document details all data structures, schemas, API contracts, and data volume specifications for the Metro Train Scheduling Service.

---

## Table of Contents

1. [Core Data Models](#core-data-models)
2. [API Schemas](#api-schemas)
3. [Database Schemas](#database-schemas)
4. [Data Volume & Storage](#data-volume--storage)
5. [Service Resource Usage](#service-resource-usage)

---

## Core Data Models

All models use **Pydantic v2** for validation and serialization.

### 1. DaySchedule

**Purpose**: Complete daily schedule with all trainset assignments

```python
class DaySchedule(BaseModel):
    schedule_id: str                    # "KMRL-2025-10-25"
    date: str                           # "2025-10-25"
    route: Route                        # Route details
    trainsets: List[Trainset]           # All train assignments
    fleet_summary: FleetSummary         # Fleet statistics
    optimization_metrics: OptimizationMetrics
    alerts: List[Alert]                 # Warnings/issues
    generated_at: datetime
    generated_by: str = "ML-Optimizer"
```

**Size**: ~45 KB per schedule (30 trains, full day)

**Example**:
```json
{
  "schedule_id": "KMRL-2025-10-25",
  "date": "2025-10-25",
  "route": {...},
  "trainsets": [...],
  "fleet_summary": {
    "total_trainsets": 30,
    "in_service": 24,
    "standby": 4,
    "maintenance": 2
  },
  "optimization_metrics": {
    "total_service_blocks": 156,
    "avg_readiness_score": 0.87,
    "mileage_variance_coefficient": 0.12
  },
  "generated_at": "2025-10-25T04:30:00+05:30"
}
```

---

### 2. Trainset

**Purpose**: Individual train assignment and status

```python
class Trainset(BaseModel):
    trainset_id: str                    # "TS-001"
    status: TrainHealthStatus           # REVENUE_SERVICE, STANDBY, etc.
    depot_bay: str                      # "BAY-01"
    cumulative_km: int                  # 145250
    readiness_score: float              # 0.0-1.0
    service_blocks: List[ServiceBlock]  # Trip assignments
    fitness_certificates: FitnessCertificates
    job_cards: JobCards
    branding: Branding
```

**Size**: ~1.5 KB per trainset

**Status Enum**:
```python
class TrainHealthStatus(str, Enum):
    REVENUE_SERVICE = "REVENUE_SERVICE"  # Active service
    STANDBY = "STANDBY"                  # Ready, not assigned
    MAINTENANCE = "MAINTENANCE"          # Under repair
    SCHEDULED_MAINTENANCE = "SCHEDULED_MAINTENANCE"
    UNAVAILABLE = "UNAVAILABLE"          # Out of service
```

**Distribution** (typical 30-train fleet):
- REVENUE_SERVICE: 22-24 trains (73-80%)
- STANDBY: 3-5 trains (10-17%)
- MAINTENANCE: 1-3 trains (3-10%)
- UNAVAILABLE: 0-2 trains (0-7%)

---

### 3. ServiceBlock

**Purpose**: Single trip assignment for a train

```python
class ServiceBlock(BaseModel):
    block_id: str                       # "BLK-001-01"
    start_time: str                     # "05:00"
    end_time: str                       # "05:45"
    start_station: str                  # "Aluva"
    end_station: str                    # "Pettah"
    direction: str                      # "UP" or "DOWN"
    distance_km: float                  # 25.612
    estimated_passengers: Optional[int] # 450
    priority: str = "NORMAL"            # NORMAL, HIGH, PEAK
```

**Size**: ~250 bytes per service block

**Daily Trips per Train**: 
- Peak service train: 6-8 trips
- Standard service: 4-6 trips
- Average: ~5.2 trips per active train

**Total Service Blocks** (30-train fleet):
- 24 active trains × 5.2 trips = ~125 service blocks/day

---

### 4. Route

**Purpose**: Metro line configuration

```python
class Route(BaseModel):
    route_id: str                       # "KMRL-LINE-01"
    name: str                           # "Aluva-Pettah Line"
    stations: List[Station]             # 25 stations
    total_distance_km: float            # 25.612 km
    avg_speed_kmh: int                  # 32-38 km/h
    turnaround_time_minutes: int        # 8-12 minutes
```

**KMRL Route Details**:
- **Stations**: 25 (Aluva to Pettah)
- **Distance**: 25.612 km
- **Average Speed**: 35 km/h
- **One-way Time**: ~44 minutes
- **Round Trip**: ~100 minutes (including turnarounds)

---

### 5. Station

**Purpose**: Individual station on route

```python
class Station(BaseModel):
    station_id: str                     # "STN-001"
    name: str                           # "Aluva"
    code: str                           # "ALV"
    distance_from_start_km: float       # 0.0
    platform_count: int                 # 2
    facilities: List[str]               # ["PARKING", "ELEVATOR"]
```

**Size**: ~200 bytes per station

**Total Stations**: 25 (fixed)

---

### 6. FitnessCertificates

**Purpose**: Regulatory compliance tracking

```python
class FitnessCertificates(BaseModel):
    rolling_stock: FitnessCertificate   # Train body/chassis
    signalling: FitnessCertificate      # Signal systems
    telecom: FitnessCertificate         # Communication systems

class FitnessCertificate(BaseModel):
    valid_until: str                    # "2025-12-31"
    status: CertificateStatus           # VALID, EXPIRING_SOON, EXPIRED

class CertificateStatus(str, Enum):
    VALID = "VALID"                     # > 30 days remaining
    EXPIRING_SOON = "EXPIRING_SOON"     # 7-30 days remaining
    EXPIRED = "EXPIRED"                 # Past expiry date
```

**Validation Rules**:
- Trains with EXPIRED certificates: status = UNAVAILABLE
- Trains with EXPIRING_SOON: flagged in alerts, can operate

---

### 7. JobCards & Maintenance

**Purpose**: Maintenance tracking

```python
class JobCards(BaseModel):
    open: int                           # Number of open job cards
    blocking: List[str]                 # Critical issues: ["BRAKE_FAULT"]

# Example maintenance reasons
UNAVAILABLE_REASONS = [
    "SCHEDULED_MAINTENANCE",
    "BRAKE_SYSTEM_REPAIR",
    "HVAC_REPLACEMENT",
    "BOGIE_OVERHAUL",
    "ELECTRICAL_FAULT",
    "ACCIDENT_DAMAGE",
    "PANTOGRAPH_REPAIR",
    "DOOR_SYSTEM_FAULT"
]
```

**Impact on Scheduling**:
- 0 open cards: readiness = 1.0
- 1-2 cards: readiness = 0.9
- 3-4 cards: readiness = 0.7
- 5+ cards: readiness = 0.5, likely maintenance status

---

### 8. Branding

**Purpose**: Advertisement tracking

```python
class Branding(BaseModel):
    advertiser: str                     # "COCACOLA-2024"
    contract_hours_remaining: int       # 450 hours
    exposure_priority: str              # LOW, MEDIUM, HIGH, CRITICAL

# Available advertisers
ADVERTISERS = [
    "COCACOLA-2024",
    "FLIPKART-FESTIVE",
    "AMAZON-PRIME",
    "RELIANCE-JIO",
    "TATA-MOTORS",
    "SAMSUNG-GALAXY",
    "NONE"
]
```

**Priority Weights** (for optimization):
- CRITICAL: 4 points
- HIGH: 3 points
- MEDIUM: 2 points
- LOW: 1 point
- NONE: 0 points

**Scheduling Strategy**:
- HIGH/CRITICAL branded trains prioritized for peak hours
- Maximizes advertiser visibility during high-traffic periods

---

### 9. FleetSummary

**Purpose**: Aggregated fleet statistics

```python
class FleetSummary(BaseModel):
    total_trainsets: int                # 30
    in_service: int                     # 24
    standby: int                        # 4
    maintenance: int                    # 2
    unavailable: int                    # 0
    availability_percent: float         # 93.33
    total_mileage_today: int           # 3200 km
    avg_trips_per_train: float         # 5.2
```

**Size**: ~300 bytes

**Key Metrics**:
- **Availability %**: (in_service + standby) / total × 100
- **Target Availability**: ≥ 90%
- **Service Ratio**: in_service / (in_service + standby)
- **Target Service Ratio**: 85-90%

---

### 10. OptimizationMetrics

**Purpose**: Optimization quality measures

```python
class OptimizationMetrics(BaseModel):
    total_service_blocks: int           # 125
    avg_readiness_score: float          # 0.87
    mileage_variance_coefficient: float # 0.12
    branding_sla_compliance: float      # 0.95
    fitness_expiry_violations: int      # 0
    execution_time_ms: int              # 1250
    algorithm_used: str                 # "ensemble_ml" or "or_tools"
    confidence_score: Optional[float]   # 0.89 (if ML used)
```

**Size**: ~250 bytes

**Quality Thresholds**:
- avg_readiness_score: ≥ 0.80
- mileage_variance_coefficient: < 0.15
- branding_sla_compliance: ≥ 0.90
- fitness_expiry_violations: 0

---

## API Schemas

### Request: ScheduleRequest

**Endpoint**: `POST /api/v1/generate`

```python
class ScheduleRequest(BaseModel):
    date: str                           # "2025-10-25"
    num_trains: int = 25                # 25-40
    num_stations: int = 25              # Fixed for KMRL
    min_service_trains: int = 22        # Minimum active
    min_standby_trains: int = 3         # Minimum backup
    
    # Optional overrides
    peak_hours: Optional[List[int]] = None  # [7,8,9,17,18,19]
    force_optimization: bool = False    # Skip ML, use OR-Tools
```

**Size**: ~150 bytes per request

**Validation**:
- `num_trains`: 25 ≤ n ≤ 40
- `num_stations`: Fixed at 25 (KMRL specific)
- `min_service_trains`: ≤ num_trains - 3
- `min_standby_trains`: ≥ 2

**Example**:
```json
{
  "date": "2025-10-25",
  "num_trains": 30,
  "num_stations": 25,
  "min_service_trains": 24,
  "min_standby_trains": 4
}
```

---

### Response: DaySchedule

**Status**: 200 OK

**Content-Type**: application/json

**Size**: 45-55 KB (depends on fleet size)

**Headers**:
```
X-Algorithm-Used: ensemble_ml | or_tools | greedy
X-Confidence-Score: 0.89 (if ML)
X-Execution-Time-Ms: 1250
```

---

### Error Responses

**400 Bad Request**:
```json
{
  "error": "Validation Error",
  "details": {
    "num_trains": "Must be between 25 and 40"
  }
}
```

**500 Internal Server Error**:
```json
{
  "error": "Optimization Failed",
  "message": "Unable to find feasible schedule",
  "timestamp": "2025-10-25T10:30:00Z"
}
```

---

## Database Schemas

### Schedule Storage (JSON Files)

**Location**: `data/schedules/`

**Naming**: `{schedule_id}_{timestamp}.json`

**Example**: `KMRL-2025-10-25_20251025_043000.json`

**Structure**:
```json
{
  "schedule": {DaySchedule},
  "metadata": {
    "recorded_at": "2025-10-25T04:30:00",
    "quality_score": 87.5,
    "algorithm_used": "ensemble_ml",
    "confidence": 0.89
  },
  "saved_at": "2025-10-25T04:30:15"
}
```

**Size per File**: ~48 KB

---

### Model Storage (Pickle Files)

**Location**: `models/`

**Files**:
1. `models_latest.pkl` - Current ensemble (all 5 models)
2. `models_{timestamp}.pkl` - Historical snapshots
3. `training_history.json` - Training metrics log

**Model File Contents**:
```python
{
    "models": {
        "gradient_boosting": GradientBoostingRegressor(),
        "random_forest": RandomForestRegressor(),
        "xgboost": XGBRegressor(),
        "lightgbm": LGBMRegressor(),
        "catboost": CatBoostRegressor()
    },
    "ensemble_weights": {
        "xgboost": 0.215,
        "lightgbm": 0.208,
        ...
    },
    "best_model_name": "xgboost",
    "last_trained": datetime(2025, 10, 25, 4, 30),
    "config": {
        "version": "v1.0.0",
        "features": [...],
        "models_trained": [...]
    }
}
```

**Size**: ~15-25 MB (all 5 models combined)

---

### Training History (JSON)

**Location**: `models/training_history.json`

**Structure**:
```json
[
  {
    "timestamp": "2025-10-23T12:00:00",
    "metrics": {
      "gradient_boosting": {
        "train_r2": 0.8912,
        "test_r2": 0.8234,
        "test_rmse": 13.45
      },
      ...
    },
    "best_model": "xgboost",
    "ensemble_weights": {...},
    "config": {
      "models_trained": [...],
      "version": "v1.0.0"
    }
  },
  ...
]
```

**Growth**: ~1 KB per training run

**Retention**: All training runs (pruned after 1000 entries)

---

## Data Volume & Storage

### Production Estimates

#### Daily Operations

**Per Day** (single schedule generation):
- 1 schedule file: ~48 KB
- API request/response: ~50 KB total
- Logs: ~10 KB

**Total per day**: ~108 KB

#### Monthly Operations (30 days)

**Schedule files**: 
- 30 schedules × 48 KB = 1.44 MB

**Model files**:
- 1 retraining (every 48 hours) = 15 retrainings/month
- 15 × 25 MB = 375 MB

**Training history**:
- 15 entries × 1 KB = 15 KB

**Total per month**: ~377 MB

#### Annual Storage (1 year)

**Schedule data**: 
- 365 schedules × 48 KB = 17.5 MB

**Model snapshots**:
- 182 retrainings × 25 MB = 4.55 GB

**Training history**: 
- 182 KB

**Total per year**: ~4.57 GB

**With retention policy** (keep last 100 schedules, 50 models):
- Schedules: 100 × 48 KB = 4.8 MB
- Models: 50 × 25 MB = 1.25 GB
- History: 182 KB

**Total with retention**: ~1.26 GB

---

### ML Training Data Requirements

#### Minimum Training Dataset

**Initial training**: 100 schedules
- Storage: 100 × 48 KB = 4.8 MB
- Generation time: ~15 minutes (automated)
- Training time: 5-10 minutes

**Optimal training**: 500 schedules
- Storage: 500 × 48 KB = 24 MB
- Provides better generalization
- Covers more edge cases

#### Feature Matrix Size

**Per schedule**: 10 features × 8 bytes (float64) = 80 bytes

**Training set** (100 schedules):
- Features (X): 100 × 80 bytes = 8 KB
- Target (y): 100 × 8 bytes = 800 bytes
- Total: ~9 KB (minimal)

**Full dataset** (1000 schedules):
- Features: 80 KB
- Target: 8 KB
- Total: ~88 KB

**Memory during training**:
- Dataset: ~88 KB
- Models (5 × ~5 MB): ~25 MB
- Working memory: ~50 MB
- **Total**: ~75 MB

---

### Optimization Service Resource Usage

#### OR-Tools Optimization

**Input data**:
- 30 trains × 1.5 KB = 45 KB
- 25 stations × 200 bytes = 5 KB
- Constraints: ~10 KB
- **Total input**: ~60 KB

**Memory usage**:
- Solver state: ~10 MB
- Solution space: ~20 MB
- **Peak memory**: ~30 MB

**Execution time**: 1-5 seconds (CPU-bound)

**CPU utilization**: 100% single core

---

#### ML Ensemble Prediction

**Input data**:
- Feature vector: 10 × 8 bytes = 80 bytes
- **Total input**: < 1 KB

**Memory usage**:
- Loaded models: ~25 MB (shared)
- Prediction workspace: ~1 MB
- **Peak memory**: ~26 MB

**Execution time**: 50-100 milliseconds

**CPU utilization**: 20-30% single core

---

#### Greedy Optimization

**Input data**: ~60 KB (same as OR-Tools)

**Memory usage**:
- State tracking: ~5 MB
- Priority queue: ~2 MB
- **Peak memory**: ~7 MB

**Execution time**: < 1 second

**CPU utilization**: 50-70% single core

---

## Service Resource Usage

### DataService (FastAPI)

**Base memory**: 150 MB (Python + FastAPI + dependencies)

**Per request overhead**: ~10 MB

**Concurrent requests** (typical): 1-5

**Total memory** (under load): 200-250 MB

**Disk I/O**:
- Read: Minimal (configuration only)
- Write: ~50 KB per schedule generated

**Network**: 
- Inbound: ~150 bytes (request)
- Outbound: ~50 KB (response)

---

### SelfTrainService

**Base memory**: 200 MB (Python + ML libraries)

**During training**:
- Dataset loading: +20 MB
- Model training: +100 MB (peak)
- **Total during training**: ~320 MB

**During inference** (loaded models):
- Models in memory: +25 MB
- **Total during inference**: ~225 MB

**Disk I/O**:
- Read: 5 MB (load schedules)
- Write: 25 MB (save models)

**Frequency**:
- Training: Every 48 hours
- Inference: Per schedule request (if confidence ≥ 75%)

---

### Retraining Service (Background)

**Memory**: ~50 MB (idle), ~320 MB (during training)

**CPU**: 
- Idle: < 1%
- Training: 100% (5-10 minutes every 48 hours)

**Disk I/O**:
- Check interval: Every 60 minutes
- Read: ~1 MB (check schedule count)
- Write: ~25 MB (when retraining)

---

## Data Flow Summary

### Schedule Generation Request

```
Client Request (150 bytes)
    ↓
FastAPI Parser (~1 KB in memory)
    ↓
Feature Extraction (80 bytes)
    ↓
ML Prediction (25 MB models loaded) OR OR-Tools (30 MB solver)
    ↓
Schedule Generation (45 KB output)
    ↓
JSON Serialization (~50 KB response)
    ↓
Storage (48 KB file)
```

**Total data processed**: ~50 KB per request

**Response time**: 0.1-5 seconds

---

### Model Training Cycle

```
Load Schedules (100 × 48 KB = 4.8 MB)
    ↓
Extract Features (100 × 80 bytes = 8 KB)
    ↓
Train 5 Models (5-10 minutes, 100% CPU)
    ↓
Save Models (25 MB pickle file)
    ↓
Update History (1 KB append)
```

**Total data processed**: ~30 MB

**Frequency**: Every 48 hours

---

## Configuration Data

### Service Configuration

**Location**: `SelfTrainService/config.py`

**Size**: ~5 KB

**Key Parameters**:
```python
{
  "RETRAIN_INTERVAL_HOURS": 48,
  "MIN_SCHEDULES_FOR_TRAINING": 100,
  "MODEL_TYPES": ["gradient_boosting", "xgboost", ...],
  "USE_ENSEMBLE": true,
  "ML_CONFIDENCE_THRESHOLD": 0.75,
  "FEATURES": [10 feature names],
  "EPOCHS": 100,
  "LEARNING_RATE": 0.001
}
```

---

## Data Retention Policies

### Recommended Retention

**Schedule files**:
- Keep last 365 days (17.5 MB)
- Archive older to compressed storage

**Model snapshots**:
- Keep last 50 models (~1.25 GB)
- Delete older snapshots
- Keep 1 model per month for historical reference

**Training history**:
- Keep all entries (grows slowly)
- Compress after 1000 entries

**Logs**:
- Application logs: 30 days
- Error logs: 90 days
- Audit logs: 1 year

---

## Scaling Considerations

### Horizontal Scaling

**API Service** (DataService):
- Stateless - easy to scale
- Load balancer distributes requests
- Each instance: ~250 MB memory

**ML Service** (SelfTrainService):
- Share model files via NFS/S3
- Only one instance should train (avoid conflicts)
- Multiple instances can serve predictions

### Vertical Scaling

**Memory requirements**:
- Minimum: 1 GB RAM
- Recommended: 2 GB RAM
- Optimal: 4 GB RAM (allows concurrent training + serving)

**CPU requirements**:
- Minimum: 1 core
- Recommended: 2 cores (1 for API, 1 for training)
- Optimal: 4 cores (parallel model training)

**Storage requirements**:
- Minimum: 5 GB
- Recommended: 20 GB
- Optimal: 50 GB (1-year retention)

---

## Performance Benchmarks

### Schedule Generation Performance

| Fleet Size | Algorithm | Time | Memory | Output Size |
|------------|-----------|------|--------|-------------|
| 25 trains  | ML        | 0.08s | 225 MB | 38 KB |
| 30 trains  | ML        | 0.10s | 225 MB | 45 KB |
| 40 trains  | ML        | 0.12s | 225 MB | 60 KB |
| 25 trains  | OR-Tools  | 1.2s  | 30 MB  | 38 KB |
| 30 trains  | OR-Tools  | 2.8s  | 30 MB  | 45 KB |
| 40 trains  | OR-Tools  | 4.5s  | 30 MB  | 60 KB |
| 25 trains  | Greedy    | 0.3s  | 7 MB   | 38 KB |
| 30 trains  | Greedy    | 0.5s  | 7 MB   | 45 KB |
| 40 trains  | Greedy    | 0.8s  | 7 MB   | 60 KB |

### Training Performance

| Dataset Size | Training Time | Memory | Model Size |
|--------------|---------------|--------|------------|
| 100 schedules | 3 min       | 320 MB | 20 MB |
| 500 schedules | 8 min       | 350 MB | 24 MB |
| 1000 schedules | 15 min     | 400 MB | 28 MB |

---

**Document Version**: 1.0.0  
**Last Updated**: November 2, 2025  
**Maintained By**: ML-Service Team