muhalwan commited on
Commit
6a0a429
·
1 Parent(s): 1335617

Revised version

Browse files
Files changed (11) hide show
  1. .gitignore +3 -0
  2. README.md +103 -24
  3. app.py +194 -578
  4. backend.py +674 -0
  5. config.py +96 -56
  6. data_loader.py +9 -29
  7. data_processor.py +150 -111
  8. data_validator.py +0 -467
  9. evaluator.py +213 -16
  10. prophet_predictor.py +223 -40
  11. ui_components.py +322 -0
.gitignore CHANGED
@@ -12,3 +12,6 @@ WORKFLOW.md
12
  data/
13
  hf_cache/
14
  MODEL_WORKFLOW.md
 
 
 
 
12
  data/
13
  hf_cache/
14
  MODEL_WORKFLOW.md
15
+ data_validator.py
16
+ utils/
17
+ .gitignore
README.md CHANGED
@@ -9,42 +9,121 @@ app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- # SKS Course Enrollment Prediction System
13
 
14
- Predicts student enrollment for elective courses using Prophet time series forecasting with semester-specific analysis.
15
 
16
- ## 🎯 Features
17
 
18
- - **Semester-Specific Predictions**: Separate predictions for Semester 1 (Ganjil/Odd) and Semester 2 (Genap/Even)
19
- - **Time Series Forecasting**: Uses Facebook Prophet for accurate enrollment predictions
20
- - **Historical Backtesting**: Validates model accuracy with MAE and RMSE metrics
21
- - **Automated Recommendations**: Suggests which courses to open based on predicted demand
22
- - **Private Data**: Loads enrollment data from private Hugging Face dataset
23
 
24
- ## 🚀 How It Works
 
 
 
25
 
26
- 1. **Select Target Year and Semester**: Choose the academic period to predict
27
- 2. **Generate Predictions**: AI analyzes historical enrollment patterns
28
- 3. **View Recommendations**: See which courses should be opened and recommended quotas
29
- 4. **Review Metrics**: Check model performance (MAE, RMSE)
30
 
31
- ## 📈 Prediction Strategy
 
 
 
32
 
33
- The system uses multiple forecasting strategies:
34
- - **Prophet Logistic Growth**: For courses with sufficient historical data
35
- - **Trend-Based Fallback**: For courses with unrealistic Prophet predictions
36
- - **Mean Fallback**: For courses with limited history
37
- - **Cold Start**: For new courses without historical data
38
 
39
- ## 🛠️ Technical Stack
40
 
41
- - **Framework**: Gradio for UI
42
- - **ML Model**: Facebook Prophet
 
 
 
 
 
 
 
 
43
  - **Data Processing**: Pandas, NumPy
 
44
  - **Deployment**: Hugging Face Spaces
45
 
46
- ## 📊 Model Performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- The model is validated through backtesting on historical data:
 
 
 
 
49
  - Mean Absolute Error (MAE): ~31 students
50
  - Root Mean Squared Error (RMSE): ~49 students
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pinned: false
10
  ---
11
 
12
+ # SKS Course Enrollment & Class Capacity Prediction System
13
 
14
+ Sistem prediksi **jumlah kelas yang perlu dibuka** berdasarkan forecasting enrollment dengan mempertimbangkan kapasitas maksimum per kelas menggunakan Prophet time series forecasting.
15
 
16
+ ## How It Works
17
 
18
+ ### Single Semester Prediction
19
+ 1. **Pilih Tahun dan Semester**: Tentukan periode akademik yang akan diprediksi
20
+ 2. **Generate Predictions**: AI menganalisis pola enrollment historis
21
+ 3. **Lihat Rekomendasi Kelas**: Berapa kelas yang perlu dibuka untuk setiap mata kuliah
22
+ 4. **Review Utilization**: Cek tingkat utilisasi kapasitas kelas
23
 
24
+ ### Multi-Year Forecasting
25
+ 1. **Tentukan Periode Awal**: Tahun dan semester mulai proyeksi
26
+ 2. **Pilih Horizon Forecast**: Berapa tahun ke depan (1-5 tahun)
27
+ 3. **Lihat Tren**: Bagaimana kebutuhan kelas berevolusi dari waktu ke waktu
28
 
29
+ ## Class Capacity Logic
 
 
 
30
 
31
+ Contoh skenario:
32
+ - **PDST Course**: 10-15 mahasiswa (2023) → 40 mahasiswa (2024) → Proyeksi terus naik
33
+ - **Kapasitas Max**: 50 mahasiswa per kelas
34
+ - **Rekomendasi**: 1 kelas (jika ≤50), 2 kelas (jika 51-100), dst.
35
 
36
+ ### Calculation Formula
37
+ ```
38
+ Jumlah Kelas = ⌈Prediksi Enrollment / Kapasitas per Kelas⌉
39
+ ```
 
40
 
41
+ ## Prediction Strategy
42
 
43
+ Sistem menggunakan beberapa strategi forecasting:
44
+ - **Prophet Logistic Growth**: Untuk mata kuliah dengan data historis cukup, menggunakan kapasitas sebagai upper bound (cap)
45
+ - **Trend-Based Fallback**: Untuk prediksi Prophet yang tidak realistis
46
+ - **Mean Fallback**: Untuk mata kuliah dengan data terbatas
47
+ - **Cold Start**: Untuk mata kuliah baru tanpa data historis
48
+
49
+ ## Technical Stack
50
+
51
+ - **Framework**: Gradio untuk UI
52
+ - **ML Model**: Facebook Prophet dengan logistic growth
53
  - **Data Processing**: Pandas, NumPy
54
+ - **Visualization**: Matplotlib, Seaborn
55
  - **Deployment**: Hugging Face Spaces
56
 
57
+ ## Configuration
58
+
59
+ ### Class Capacity Settings (config.py)
60
+ ```python
61
+ @dataclass
62
+ class ClassCapacityConfig:
63
+ DEFAULT_CLASS_CAPACITY: int = 50 # Max students per class
64
+ MIN_STUDENTS_TO_OPEN_CLASS: int = 10 # Minimum to open a class
65
+ CAPACITY_WARNING_THRESHOLD: float = 0.8 # 80% utilization warning
66
+ ENABLE_CAPACITY_CONSTRAINTS: bool = True
67
+ ```
68
+
69
+ ### Multi-Year Forecast Settings
70
+ ```python
71
+ @dataclass
72
+ class MultiYearForecastConfig:
73
+ FORECAST_YEARS_AHEAD: int = 3 # Years to forecast
74
+ MAX_YEARLY_GROWTH_RATE: float = 0.5 # 50% max growth/year
75
+ MIN_YEARLY_GROWTH_RATE: float = -0.3 # 30% max decline/year
76
+ ```
77
 
78
+ ## Model Performance
79
+
80
+ Model divalidasi melalui backtesting pada data historis:
81
+
82
+ ### Enrollment Prediction
83
  - Mean Absolute Error (MAE): ~31 students
84
  - Root Mean Squared Error (RMSE): ~49 students
85
+
86
+ ### Class Count Prediction
87
+ - Class MAE: ~0.5 classes
88
+ - Exact Class Match: ~70%
89
+ - Within ±1 Class: ~95%
90
+
91
+ ## 🔧 Usage
92
+
93
+ ### Local Development
94
+ ```bash
95
+ # Install dependencies
96
+ pip install -r requirements.txt
97
+
98
+ # Run the app
99
+ python app.py
100
+ ```
101
+
102
+ ### Environment Variables
103
+ - `HF_TOKEN`: Hugging Face token untuk akses private dataset
104
+
105
+ ## 📝 Output Columns
106
+
107
+ ### Prediksi Semester
108
+ | Column | Description |
109
+ |--------|-------------|
110
+ | Kode MK | Kode mata kuliah |
111
+ | Nama MK | Nama mata kuliah |
112
+ | Prediksi | Prediksi jumlah mahasiswa |
113
+ | Jumlah Kelas | Rekomendasi jumlah kelas dibuka |
114
+ | Kapasitas/Kelas | Kapasitas maksimum per kelas |
115
+ | Total Kuota | Total kapasitas (Jumlah Kelas × Kapasitas) |
116
+ | Utilization % | Persentase utilisasi kapasitas |
117
+ | Status | BUKA/TUTUP |
118
+ | Confidence | high/medium/low |
119
+ | Strategy | Metode prediksi yang digunakan |
120
+
121
+ ### Proyeksi Multi-Tahun
122
+ | Column | Description |
123
+ |--------|-------------|
124
+ | Tahun | Tahun prediksi |
125
+ | Kode MK | Kode mata kuliah |
126
+ | Nama MK | Nama mata kuliah |
127
+ | Prediksi | Prediksi enrollment |
128
+ | Kelas | Jumlah kelas dibutuhkan |
129
+ | Kapasitas | Total kapasitas tersedia |
app.py CHANGED
@@ -1,616 +1,232 @@
1
- # Version: 3.1 - Dark theme UI with white text
2
  import logging
3
- from typing import Optional, Tuple
4
 
5
  import gradio as gr
6
- import pandas as pd
7
 
8
- from config import Config
9
- from data_processor import DataProcessor
10
- from evaluator import Evaluator
11
- from prophet_predictor import ProphetPredictor
 
 
 
 
12
  from utils import setup_logging
13
 
14
  setup_logging("INFO")
15
  logger = logging.getLogger("GradioApp")
16
 
17
- _processor: Optional[DataProcessor] = None
18
- _predictor: Optional[ProphetPredictor] = None
19
- _config: Optional[Config] = None
20
- _df_enrollment: Optional[pd.DataFrame] = None
21
- _elective_codes: Optional[set] = None
22
- _backtest_metrics: Optional[dict] = None
23
-
24
-
25
- def initialize_system():
26
- """Initialize the prediction system (called once at startup)."""
27
- global \
28
- _processor, \
29
- _predictor, \
30
- _config, \
31
- _df_enrollment, \
32
- _elective_codes, \
33
- _backtest_metrics
34
-
35
- try:
36
- logger.info("Initializing prediction system...")
37
- _config = Config()
38
-
39
- _processor = DataProcessor(_config)
40
- _df_enrollment, _elective_codes = _processor.load_and_process()
41
-
42
- _predictor = ProphetPredictor(_config)
43
- _predictor.train_student_population_model(
44
- _processor.raw_data["students_yearly"]
45
- )
46
 
47
- logger.info("System initialized successfully")
48
- return True
49
- except Exception as e:
50
- logger.error(f"Failed to initialize system: {e}", exc_info=True)
51
- return False
52
-
53
-
54
- def generate_predictions(
55
- year: int, semester: int
56
- ) -> Tuple[str, Optional[pd.DataFrame], Optional[pd.DataFrame]]:
57
- """
58
- Generate enrollment predictions for a given year and semester.
59
-
60
- Args:
61
- year: Target year (e.g., 2025)
62
- semester: Target semester (1 = Ganjil, 2 = Genap)
63
-
64
- Returns:
65
- Tuple of (summary_text, all_predictions_df, comparison_df)
66
- """
67
- global \
68
- _processor, \
69
- _predictor, \
70
- _config, \
71
- _df_enrollment, \
72
- _elective_codes, \
73
- _backtest_metrics
74
-
75
- try:
76
- if semester not in [1, 2]:
77
- return (
78
- "Error: Semester harus 1 (Ganjil) atau 2 (Genap)",
79
- None,
80
- None,
81
- )
82
-
83
- if year < 2020 or year > 2030:
84
- return "Error: Year must be between 2020 and 2030", None, None
85
-
86
- if (
87
- _config is None
88
- or _predictor is None
89
- or _processor is None
90
- or _df_enrollment is None
91
- or _elective_codes is None
92
- ):
93
- return (
94
- "Error: System not initialized. Please restart the app.",
95
- None,
96
- None,
97
- )
98
-
99
- logger.info(f"Generating predictions for {year} Semester {semester}...")
100
-
101
- _config.prediction.PREDICT_YEAR = year
102
- _config.prediction.PREDICT_SEMESTER = semester
103
-
104
- # Check if actual data exists for this year/semester
105
- actual_data = _df_enrollment[
106
- (_df_enrollment["thn"] == year) & (_df_enrollment["smt"] == semester)
107
- ]
108
- has_actual_data = len(actual_data) > 0
109
-
110
- if has_actual_data:
111
- logger.info(
112
- f"Found actual enrollment data for {year} Semester {semester} - will compare predictions vs actual"
113
- )
114
- else:
115
- logger.info(
116
- f"No actual data for {year} Semester {semester} - generating future predictions"
117
- )
118
-
119
- if _backtest_metrics is None:
120
- logger.info("Running backtest for the first time...")
121
- evaluator = Evaluator(_config)
122
- backtest_results = evaluator.run_backtest(_df_enrollment, _predictor)
123
-
124
- if backtest_results is None or len(backtest_results) == 0:
125
- logger.warning("Backtest returned no results, using defaults")
126
- _backtest_metrics = {"mae": 0, "rmse": 0}
127
- else:
128
- metrics_result = evaluator.generate_metrics(backtest_results)
129
- if metrics_result is None:
130
- logger.warning("Metrics calculation failed, using defaults")
131
- _backtest_metrics = {"mae": 0, "rmse": 0}
132
- else:
133
- _backtest_metrics = metrics_result
134
- else:
135
- logger.info("Using cached backtest metrics")
136
-
137
- metrics = _backtest_metrics
138
-
139
- predictions = _predictor.generate_batch_predictions(
140
- _df_enrollment,
141
- _processor.raw_data["courses"],
142
- _elective_codes,
143
- year,
144
- semester,
145
- )
146
 
147
- semester_name = "1 (Ganjil)" if semester == 1 else "2 (Genap)"
148
- total_to_open = len(predictions[predictions["recommendation"] == "BUKA"])
149
- total_seats = (
150
- int(
151
- predictions[predictions["recommendation"] == "BUKA"][
152
- "recommended_quota"
153
- ].sum()
154
- )
155
- if total_to_open > 0
156
- else 0
157
- )
158
 
159
- # Build summary with actual vs prediction comparison if data exists
160
- if has_actual_data:
161
- # Merge predictions with actual data
162
- comparison = predictions.merge(
163
- actual_data[["kode_mk", "enrollment"]], on="kode_mk", how="left"
164
- )
165
- comparison = comparison.rename(columns={"enrollment": "actual_enrollment"})
166
-
167
- # Calculate comparison metrics only for courses with actual data
168
- courses_with_actual = comparison[
169
- comparison["actual_enrollment"].notna()
170
- ].copy()
171
-
172
- if len(courses_with_actual) > 0:
173
- comparison_mae = abs(
174
- courses_with_actual["predicted_enrollment"]
175
- - courses_with_actual["actual_enrollment"]
176
- ).mean()
177
- comparison_rmse = (
178
- (
179
- courses_with_actual["predicted_enrollment"]
180
- - courses_with_actual["actual_enrollment"]
181
- )
182
- ** 2
183
- ).mean() ** 0.5
184
- total_actual = courses_with_actual["actual_enrollment"].sum()
185
- total_predicted = courses_with_actual["predicted_enrollment"].sum()
186
- accuracy_pct = (
187
- 1 - abs(total_predicted - total_actual) / total_actual
188
- ) * 100
189
-
190
- diff_color = (
191
- "#4ade80" if total_predicted - total_actual >= 0 else "#f87171"
192
- )
193
 
194
- summary = f"""
195
- <div style="padding: 24px;">
196
- <div style="margin-bottom: 24px;">
197
- <h2 style="margin: 0 0 8px 0; color: #fff; font-size: 24px; font-weight: 600;">{year} Semester {semester_name}</h2>
198
- <p style="color: #9ca3af; margin: 0; font-size: 14px;">Validasi prediksi terhadap data aktual</p>
199
- </div>
200
-
201
- <div style="display: grid; grid-template-columns: repeat(4, 1fr); gap: 16px; margin-bottom: 24px;">
202
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid #4ade80;">
203
- <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">Akurasi</div>
204
- <div style="font-size: 28px; font-weight: 700; color: #4ade80;">{accuracy_pct:.1f}%</div>
205
- </div>
206
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid #60a5fa;">
207
- <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">MAE</div>
208
- <div style="font-size: 28px; font-weight: 700; color: #60a5fa;">{comparison_mae:.2f}</div>
209
- </div>
210
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid #a78bfa;">
211
- <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">RMSE</div>
212
- <div style="font-size: 28px; font-weight: 700; color: #a78bfa;">{comparison_rmse:.2f}</div>
213
- </div>
214
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid #fb923c;">
215
- <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">MK Divalidasi</div>
216
- <div style="font-size: 28px; font-weight: 700; color: #fb923c;">{len(courses_with_actual)}</div>
217
- </div>
218
- </div>
219
-
220
- <div style="display: grid; grid-template-columns: repeat(2, 1fr); gap: 16px;">
221
- <div style="background: #1e293b; padding: 20px; border-radius: 12px;">
222
- <h4 style="margin: 0 0 16px 0; color: #fff; font-size: 14px; font-weight: 600;">Ringkasan Enrollment</h4>
223
- <div style="display: flex; justify-content: space-between; padding: 12px 0; border-bottom: 1px solid #334155;">
224
- <span style="color: #9ca3af;">Total Aktual</span>
225
- <span style="font-weight: 600; color: #fff;">{int(total_actual)}</span>
226
- </div>
227
- <div style="display: flex; justify-content: space-between; padding: 12px 0; border-bottom: 1px solid #334155;">
228
- <span style="color: #9ca3af;">Total Prediksi</span>
229
- <span style="font-weight: 600; color: #fff;">{int(total_predicted)}</span>
230
- </div>
231
- <div style="display: flex; justify-content: space-between; padding: 12px 0;">
232
- <span style="color: #9ca3af;">Selisih</span>
233
- <span style="font-weight: 600; color: {diff_color};">{int(total_predicted - total_actual):+d}</span>
234
- </div>
235
- </div>
236
-
237
- <div style="background: #1e293b; padding: 20px; border-radius: 12px;">
238
- <h4 style="margin: 0 0 16px 0; color: #fff; font-size: 14px; font-weight: 600;">Rekomendasi</h4>
239
- <div style="display: flex; justify-content: space-between; padding: 12px 0; border-bottom: 1px solid #334155;">
240
- <span style="color: #9ca3af;">MK Dibuka</span>
241
- <span style="font-weight: 600; color: #fff;">{total_to_open}</span>
242
- </div>
243
- <div style="display: flex; justify-content: space-between; padding: 12px 0; border-bottom: 1px solid #334155;">
244
- <span style="color: #9ca3af;">Total Kuota</span>
245
- <span style="font-weight: 600; color: #fff;">{total_seats}</span>
246
- </div>
247
- <div style="display: flex; justify-content: space-between; padding: 12px 0;">
248
- <span style="color: #9ca3af;">Backtest MAE</span>
249
- <span style="font-weight: 600; color: #fff;">{metrics["mae"]:.2f}</span>
250
- </div>
251
- </div>
252
- </div>
253
- </div>
254
- """
255
- else:
256
- summary = f"""
257
- <div style="padding: 24px;">
258
- <div style="margin-bottom: 24px;">
259
- <h2 style="margin: 0 0 8px 0; color: #fff; font-size: 24px; font-weight: 600;">{year} Semester {semester_name}</h2>
260
- <p style="color: #9ca3af; margin: 0; font-size: 14px;">Data semester ada, tetapi tidak ditemukan MK pilihan yang cocok</p>
261
- </div>
262
-
263
- <div style="display: grid; grid-template-columns: repeat(4, 1fr); gap: 16px;">
264
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid #60a5fa;">
265
- <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">MAE (Backtest)</div>
266
- <div style="font-size: 28px; font-weight: 700; color: #60a5fa;">{metrics["mae"]:.2f}</div>
267
- </div>
268
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid #a78bfa;">
269
- <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">RMSE (Backtest)</div>
270
- <div style="font-size: 28px; font-weight: 700; color: #a78bfa;">{metrics["rmse"]:.2f}</div>
271
- </div>
272
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid #4ade80;">
273
- <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">MK Dibuka</div>
274
- <div style="font-size: 28px; font-weight: 700; color: #4ade80;">{total_to_open}</div>
275
- </div>
276
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid #fb923c;">
277
- <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">Total Kuota</div>
278
- <div style="font-size: 28px; font-weight: 700; color: #fb923c;">{total_seats}</div>
279
- </div>
280
- </div>
281
- </div>
282
- """
283
- else:
284
- summary = f"""
285
- <div style="padding: 24px;">
286
- <div style="margin-bottom: 24px;">
287
- <h2 style="margin: 0 0 8px 0; color: #fff; font-size: 24px; font-weight: 600;">{year} Semester {semester_name}</h2>
288
- <p style="color: #9ca3af; margin: 0; font-size: 14px;">Prediksi masa depan berdasarkan tren historis</p>
289
- </div>
290
-
291
- <div style="display: grid; grid-template-columns: repeat(4, 1fr); gap: 16px; margin-bottom: 24px;">
292
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid #60a5fa;">
293
- <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">MAE (Backtest)</div>
294
- <div style="font-size: 28px; font-weight: 700; color: #60a5fa;">{metrics["mae"]:.2f}</div>
295
- </div>
296
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid #a78bfa;">
297
- <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">RMSE (Backtest)</div>
298
- <div style="font-size: 28px; font-weight: 700; color: #a78bfa;">{metrics["rmse"]:.2f}</div>
299
- </div>
300
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid #4ade80;">
301
- <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">MK Dibuka</div>
302
- <div style="font-size: 28px; font-weight: 700; color: #4ade80;">{total_to_open}</div>
303
- </div>
304
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid #fb923c;">
305
- <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">Total Kuota</div>
306
- <div style="font-size: 28px; font-weight: 700; color: #fb923c;">{total_seats}</div>
307
- </div>
308
- </div>
309
-
310
- <div style="background: #1e293b; padding: 20px; border-radius: 12px; max-width: 300px;">
311
- <h4 style="margin: 0 0 16px 0; color: #fff; font-size: 14px; font-weight: 600;">Estimasi Total</h4>
312
- <div style="display: flex; justify-content: space-between; padding: 12px 0;">
313
- <span style="color: #9ca3af;">Total Mahasiswa</span>
314
- <span style="font-weight: 600; color: #fff;">{int(predictions["predicted_enrollment"].sum())}</span>
315
- </div>
316
- </div>
317
- </div>
318
- """
319
-
320
- # Prepare all predictions display
321
- all_predictions_display = predictions[
322
- [
323
- "kode_mk",
324
- "nama_mk",
325
- "predicted_enrollment",
326
- "recommended_quota",
327
- "recommendation",
328
- "confidence",
329
- "strategy",
330
- ]
331
- ].copy()
332
- all_predictions_display.columns = [
333
- "Kode MK",
334
- "Nama MK",
335
- "Prediksi",
336
- "Kuota",
337
- "Status",
338
- "Confidence",
339
- "Strategy",
340
- ]
341
- all_predictions_display["Prediksi"] = all_predictions_display["Prediksi"].round(
342
- 1
343
- )
344
- all_predictions_display["Kuota"] = all_predictions_display["Kuota"].astype(int)
345
 
346
- # Map status to plain text
347
- all_predictions_display["Status"] = all_predictions_display["Status"].map(
348
- {"BUKA": "BUKA", "TUTUP": "TUTUP"}
349
- )
 
 
 
350
 
351
- all_predictions_display = all_predictions_display.sort_values(
352
- "Prediksi", ascending=False
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
353
  )
354
 
355
- # Prepare comparison table if actual data exists
356
- comparison_display = None
357
- if has_actual_data:
358
- logger.info(
359
- f"Building comparison table - Actual data has {len(actual_data)} courses"
360
- )
361
- logger.info(f"Predictions has {len(predictions)} courses")
362
-
363
- comparison = predictions.merge(
364
- actual_data[["kode_mk", "enrollment"]], on="kode_mk", how="left"
365
- )
366
- comparison = comparison.rename(columns={"enrollment": "actual_enrollment"})
367
-
368
- # Filter to courses with actual data and calculate error
369
- courses_with_actual = comparison[
370
- comparison["actual_enrollment"].notna()
371
- ].copy()
372
-
373
- logger.info(
374
- f"Courses with matching actual data: {len(courses_with_actual)}"
375
- )
376
- if len(courses_with_actual) > 0:
377
- logger.info(
378
- f"Matching courses: {courses_with_actual['kode_mk'].tolist()}"
379
- )
380
 
381
- if len(courses_with_actual) > 0:
382
- courses_with_actual["error"] = (
383
- courses_with_actual["predicted_enrollment"]
384
- - courses_with_actual["actual_enrollment"]
385
- )
386
- courses_with_actual["abs_error"] = abs(courses_with_actual["error"])
387
- courses_with_actual["accuracy_%"] = 100 * (
388
- 1
389
- - courses_with_actual["abs_error"]
390
- / courses_with_actual["actual_enrollment"].replace(0, 1)
391
- )
392
 
393
- comparison_display = courses_with_actual[
394
- [
395
- "kode_mk",
396
- "nama_mk",
397
- "actual_enrollment",
398
- "predicted_enrollment",
399
- "error",
400
- "abs_error",
401
- "accuracy_%",
402
- "strategy",
403
- ]
404
- ].copy()
405
-
406
- comparison_display.columns = [
407
- "Kode MK",
408
- "Nama MK",
409
- "Aktual",
410
- "Prediksi",
411
- "Error",
412
- "Abs Error",
413
- "Akurasi %",
414
- "Strategy",
415
- ]
416
-
417
- comparison_display["Aktual"] = comparison_display["Aktual"].astype(int)
418
- comparison_display["Prediksi"] = comparison_display["Prediksi"].round(1)
419
- comparison_display["Error"] = comparison_display["Error"].round(1)
420
- comparison_display["Abs Error"] = comparison_display["Abs Error"].round(
421
- 1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
422
  )
423
- comparison_display["Akurasi %"] = comparison_display["Akurasi %"].round(
424
- 1
 
 
425
  )
426
 
427
- comparison_display = comparison_display.sort_values(
428
- "Abs Error", ascending=False
429
- )
 
 
 
 
 
 
 
 
430
 
431
- logger.info(
432
- f"Comparison table created with {len(comparison_display)} courses"
 
 
 
 
433
  )
434
- else:
435
- logger.warning(
436
- "Actual data exists but no matching courses found for comparison"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
437
  )
438
- logger.warning(f"Predicted courses: {predictions['kode_mk'].tolist()}")
439
- logger.warning(f"Actual courses: {actual_data['kode_mk'].tolist()}")
440
 
441
- logger.info(
442
- f"Predictions generated successfully (comparison_display: {comparison_display is not None})"
 
 
 
 
 
 
 
 
443
  )
444
- return summary, all_predictions_display, comparison_display
445
 
446
- except Exception as e:
447
- error_msg = f"Error generating predictions: {str(e)}"
448
- logger.error(error_msg, exc_info=True)
449
- return error_msg, None, None
 
 
 
450
 
451
 
452
- def get_data_info() -> str:
453
- """Get information about the loaded dataset."""
454
- global _processor, _config
455
-
456
- try:
457
- if _processor is None or _config is None:
458
- return "System not initialized"
459
-
460
- courses = _processor.raw_data.get("courses")
461
- students = _processor.raw_data.get("students_yearly")
462
-
463
- if courses is None or students is None:
464
- return "Data not loaded"
465
-
466
- # Get elective courses
467
- elective_courses = courses[courses["kategori_mk"] == "P"]
468
-
469
- info = f"""
470
- <div style="padding: 8px 0;">
471
- <div style="display: grid; grid-template-columns: repeat(2, 1fr); gap: 12px;">
472
- <div style="background: #1e293b; padding: 16px; border-radius: 8px; text-align: center;">
473
- <div style="font-size: 11px; color: #9ca3af; margin-bottom: 6px;">Total MK</div>
474
- <div style="font-size: 20px; font-weight: 700; color: #fff;">{len(courses)}</div>
475
- </div>
476
- <div style="background: #1e293b; padding: 16px; border-radius: 8px; text-align: center;">
477
- <div style="font-size: 11px; color: #9ca3af; margin-bottom: 6px;">MK Pilihan</div>
478
- <div style="font-size: 20px; font-weight: 700; color: #4ade80;">{len(elective_courses)}</div>
479
- </div>
480
- <div style="background: #1e293b; padding: 16px; border-radius: 8px; text-align: center;">
481
- <div style="font-size: 11px; color: #9ca3af; margin-bottom: 6px;">MK Wajib</div>
482
- <div style="font-size: 20px; font-weight: 700; color: #fff;">{len(courses) - len(elective_courses)}</div>
483
- </div>
484
- <div style="background: #1e293b; padding: 16px; border-radius: 8px; text-align: center;">
485
- <div style="font-size: 11px; color: #9ca3af; margin-bottom: 6px;">Tahun Data</div>
486
- <div style="font-size: 20px; font-weight: 700; color: #60a5fa;">{students["thn"].min()}-{students["thn"].max()}</div>
487
- </div>
488
- </div>
489
- </div>
490
- """
491
- return info
492
-
493
- except Exception as e:
494
- return f"Error getting data info: {str(e)}"
495
-
496
-
497
- # Initialize system at startup
498
  logger.info("Starting Gradio app...")
499
- init_success = initialize_system()
 
500
 
501
  if not init_success:
502
  logger.error("Failed to initialize system. App may not work correctly.")
503
 
504
- # Create Gradio Interface
505
- with gr.Blocks(title="SKS Enrollment Predictor") as demo:
506
- # Header
507
- gr.Markdown("# Course Enrollment Predictor")
508
-
509
- with gr.Row():
510
- # Left panel - Controls
511
- with gr.Column(scale=1, min_width=300):
512
- year_input = gr.Number(
513
- label="Tahun",
514
- value=2025,
515
- precision=0,
516
- minimum=2020,
517
- maximum=2030,
518
- )
519
-
520
- semester_input = gr.Radio(
521
- choices=[("1 (Ganjil)", 1), ("2 (Genap)", 2)],
522
- label="Semester",
523
- value=2,
524
- )
525
-
526
- predict_btn = gr.Button(
527
- "Generate Predictions",
528
- variant="primary",
529
- size="lg",
530
- )
531
-
532
- gr.Markdown("---")
533
-
534
- # Data info section
535
- with gr.Accordion("Dataset Info", open=False):
536
- data_info_output = gr.HTML()
537
- demo.load(fn=get_data_info, inputs=[], outputs=data_info_output)
538
-
539
- # Right panel - Results
540
- with gr.Column(scale=3):
541
- summary_output = gr.HTML(
542
- value="""
543
- <div style="padding: 60px 40px; text-align: center; background: #1e293b; border-radius: 12px;">
544
- <h3 style="color: #fff; margin: 0 0 8px 0; font-size: 18px; font-weight: 600;">Pilih tahun dan semester</h3>
545
- <p style="color: #9ca3af; margin: 0; font-size: 14px;">Klik Generate Predictions untuk melihat hasil</p>
546
- </div>
547
- """
548
- )
549
-
550
- gr.Markdown("---")
551
-
552
- # Predictions table
553
- gr.Markdown("### Daftar Prediksi Mata Kuliah")
554
- all_predictions_output = gr.Dataframe(
555
- label="",
556
- wrap=True,
557
- interactive=False,
558
- )
559
-
560
- # Comparison section
561
- with gr.Accordion("Detail Validasi", open=False) as comparison_accordion:
562
- comparison_info = gr.Markdown(
563
- value="Data validasi muncul ketika data aktual tersedia",
564
- )
565
- comparison_output = gr.Dataframe(
566
- label="",
567
- wrap=True,
568
- interactive=False,
569
- )
570
-
571
- def update_ui_with_predictions(year, semester):
572
- """Wrapper to handle UI updates based on whether comparison data exists."""
573
- summary, all_predictions, comparison = generate_predictions(year, semester)
574
-
575
- logger.info(
576
- f"UI Update: comparison is None: {comparison is None}, empty: {comparison.empty if comparison is not None else 'N/A'}"
577
- )
578
-
579
- if comparison is not None and not comparison.empty:
580
- logger.info(f"Showing comparison table with {len(comparison)} rows")
581
- return (
582
- summary,
583
- all_predictions,
584
- gr.update(open=True),
585
- gr.update(
586
- value=f"Validasi terhadap {len(comparison)} mata kuliah",
587
- ),
588
- gr.update(value=comparison),
589
- )
590
- else:
591
- logger.info("Hiding comparison table - no data available")
592
- return (
593
- summary,
594
- all_predictions,
595
- gr.update(open=False),
596
- gr.update(
597
- value="Tidak ada data validasi untuk prediksi masa depan",
598
- ),
599
- gr.update(value=None),
600
- )
601
-
602
- predict_btn.click(
603
- fn=update_ui_with_predictions,
604
- inputs=[year_input, semester_input],
605
- outputs=[
606
- summary_output,
607
- all_predictions_output,
608
- comparison_accordion,
609
- comparison_info,
610
- comparison_output,
611
- ],
612
- )
613
-
614
 
615
  # Launch the app
616
  if __name__ == "__main__":
 
 
1
  import logging
 
2
 
3
  import gradio as gr
 
4
 
5
+ from backend import get_backend
6
+ from ui_components import (
7
+ build_data_info,
8
+ build_multi_year_summary,
9
+ build_prediction_summary,
10
+ get_forecast_placeholder,
11
+ get_prediction_placeholder,
12
+ )
13
  from utils import setup_logging
14
 
15
  setup_logging("INFO")
16
  logger = logging.getLogger("GradioApp")
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
+ # Backend Interface
20
+ def get_data_info() -> str:
21
+ backend = get_backend()
22
+ data = backend.get_data_info()
23
+ return build_data_info(data)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ def generate_predictions(year: int, semester: int):
27
+ backend = get_backend()
28
+ result = backend.generate_predictions(year, semester)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
+ if result.error:
31
+ return f"Error: {result.error}", None, None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
+ summary_html = build_prediction_summary(result.summary_data)
34
+ return summary_html, result.predictions_df, result.comparison_df
35
+
36
+
37
+ def generate_multi_year_forecast(year: int, semester: int, years_ahead: int = 3):
38
+ backend = get_backend()
39
+ result = backend.generate_multi_year_forecast(year, semester, years_ahead)
40
 
41
+ if result.error:
42
+ return f"Error: {result.error}", None
43
+
44
+ summary_html = build_multi_year_summary(result.summary_data)
45
+ return summary_html, result.forecast_df
46
+
47
+
48
+ def update_ui_with_predictions(year: int, semester: int):
49
+ summary, all_predictions, comparison = generate_predictions(year, semester)
50
+
51
+ logger.info(
52
+ f"UI Update: comparison is None: {comparison is None}, "
53
+ f"empty: {comparison.empty if comparison is not None else 'N/A'}"
54
+ )
55
+
56
+ if comparison is not None and not comparison.empty:
57
+ logger.info(f"Showing comparison table with {len(comparison)} rows")
58
+ return (
59
+ summary,
60
+ all_predictions,
61
+ gr.update(open=True),
62
+ gr.update(
63
+ value=f"Validasi terhadap {len(comparison)} mata kuliah - "
64
+ "termasuk perbandingan jumlah kelas aktual vs prediksi"
65
+ ),
66
+ gr.update(value=comparison),
67
+ )
68
+ else:
69
+ logger.info("Hiding comparison table - no data available")
70
+ return (
71
+ summary,
72
+ all_predictions,
73
+ gr.update(open=False),
74
+ gr.update(value="Tidak ada data validasi untuk prediksi masa depan"),
75
+ gr.update(value=None),
76
  )
77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
+ # Gradio UI
80
+ def create_gradio_app() -> gr.Blocks:
81
+ """Create and configure the Gradio application."""
 
 
 
 
 
 
 
 
82
 
83
+ with gr.Blocks(title="SKS Enrollment Predictor") as demo:
84
+ # Header
85
+ gr.Markdown("# Course Enrollment & Class Capacity Predictor")
86
+ gr.Markdown(
87
+ "Sistem prediksi **jumlah kelas yang perlu dibuka** berdasarkan "
88
+ "forecasting enrollment dengan mempertimbangkan kapasitas maksimum per kelas."
89
+ )
90
+
91
+ with gr.Tabs():
92
+ # Single Year
93
+ with gr.TabItem("Prediksi Semester"):
94
+ with gr.Row():
95
+ with gr.Column(scale=1, min_width=300):
96
+ year_input = gr.Number(
97
+ label="Tahun",
98
+ value=2025,
99
+ precision=0,
100
+ minimum=2020,
101
+ maximum=2030,
102
+ )
103
+
104
+ semester_input = gr.Radio(
105
+ choices=[("1 (Ganjil)", 1), ("2 (Genap)", 2)],
106
+ label="Semester",
107
+ value=2,
108
+ )
109
+
110
+ predict_btn = gr.Button(
111
+ "Generate Predictions",
112
+ variant="primary",
113
+ size="lg",
114
+ )
115
+
116
+ gr.Markdown("---")
117
+
118
+ with gr.Accordion("Dataset Info", open=False):
119
+ data_info_output = gr.HTML()
120
+ demo.load(
121
+ fn=get_data_info, inputs=[], outputs=data_info_output
122
+ )
123
+
124
+ with gr.Column(scale=3):
125
+ summary_output = gr.HTML(value=get_prediction_placeholder())
126
+
127
+ gr.Markdown("---")
128
+
129
+ gr.Markdown("### Rekomendasi Jumlah Kelas per Mata Kuliah")
130
+ gr.Markdown(
131
+ "*Jumlah kelas dihitung berdasarkan prediksi enrollment ÷ "
132
+ "kapasitas per kelas*"
133
  )
134
+ all_predictions_output = gr.Dataframe(
135
+ label="",
136
+ wrap=True,
137
+ interactive=False,
138
  )
139
 
140
+ with gr.Accordion(
141
+ "Detail Validasi", open=False
142
+ ) as comparison_accordion:
143
+ comparison_info = gr.Markdown(
144
+ value="Data validasi muncul ketika data aktual tersedia"
145
+ )
146
+ comparison_output = gr.Dataframe(
147
+ label="",
148
+ wrap=True,
149
+ interactive=False,
150
+ )
151
 
152
+ # Multi-Year Forecast
153
+ with gr.TabItem("Proyeksi Multi-Tahun"):
154
+ gr.Markdown("### Forecasting Kebutuhan Kelas Beberapa Tahun ke Depan")
155
+ gr.Markdown(
156
+ "Memprediksi tren jumlah mahasiswa dan kebutuhan kelas "
157
+ "untuk perencanaan jangka panjang."
158
  )
159
+
160
+ with gr.Row():
161
+ with gr.Column(scale=1):
162
+ forecast_year = gr.Number(
163
+ label="Tahun Mulai",
164
+ value=2025,
165
+ precision=0,
166
+ minimum=2020,
167
+ maximum=2030,
168
+ )
169
+
170
+ forecast_semester = gr.Radio(
171
+ choices=[("1 (Ganjil)", 1), ("2 (Genap)", 2)],
172
+ label="Semester",
173
+ value=2,
174
+ )
175
+
176
+ forecast_years = gr.Slider(
177
+ label="Tahun ke Depan",
178
+ minimum=1,
179
+ maximum=5,
180
+ value=3,
181
+ step=1,
182
+ )
183
+
184
+ forecast_btn = gr.Button(
185
+ "Generate Forecast",
186
+ variant="primary",
187
+ size="lg",
188
+ )
189
+
190
+ with gr.Column(scale=3):
191
+ forecast_summary = gr.HTML(value=get_forecast_placeholder())
192
+
193
+ gr.Markdown("---")
194
+ gr.Markdown("### Detail Proyeksi per Mata Kuliah per Tahun")
195
+ forecast_table = gr.Dataframe(
196
+ label="",
197
+ wrap=True,
198
+ interactive=False,
199
  )
 
 
200
 
201
+ predict_btn.click(
202
+ fn=update_ui_with_predictions,
203
+ inputs=[year_input, semester_input],
204
+ outputs=[
205
+ summary_output,
206
+ all_predictions_output,
207
+ comparison_accordion,
208
+ comparison_info,
209
+ comparison_output,
210
+ ],
211
  )
 
212
 
213
+ forecast_btn.click(
214
+ fn=generate_multi_year_forecast,
215
+ inputs=[forecast_year, forecast_semester, forecast_years],
216
+ outputs=[forecast_summary, forecast_table],
217
+ )
218
+
219
+ return demo
220
 
221
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
222
  logger.info("Starting Gradio app...")
223
+ backend = get_backend()
224
+ init_success = backend.initialize()
225
 
226
  if not init_success:
227
  logger.error("Failed to initialize system. App may not work correctly.")
228
 
229
+ demo = create_gradio_app()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
230
 
231
  # Launch the app
232
  if __name__ == "__main__":
backend.py ADDED
@@ -0,0 +1,674 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ from dataclasses import dataclass
3
+ from typing import Dict, Optional, Tuple
4
+
5
+ import pandas as pd
6
+
7
+ from config import Config
8
+ from data_processor import DataProcessor
9
+ from evaluator import Evaluator
10
+ from prophet_predictor import ProphetPredictor
11
+ from utils import setup_logging
12
+
13
+ setup_logging("INFO")
14
+ logger = logging.getLogger("Backend")
15
+
16
+
17
+ @dataclass
18
+ class PredictionResult:
19
+ summary_data: Dict
20
+ predictions_df: pd.DataFrame
21
+ comparison_df: Optional[pd.DataFrame]
22
+ has_actual_data: bool
23
+ error: Optional[str] = None
24
+
25
+
26
+ @dataclass
27
+ class ForecastResult:
28
+ summary_data: Dict
29
+ forecast_df: pd.DataFrame
30
+ yearly_summary: pd.DataFrame
31
+ error: Optional[str] = None
32
+
33
+
34
+ class PredictionBackend:
35
+ def __init__(self):
36
+ self._processor: Optional[DataProcessor] = None
37
+ self._predictor: Optional[ProphetPredictor] = None
38
+ self._config: Optional[Config] = None
39
+ self._df_enrollment: Optional[pd.DataFrame] = None
40
+ self._elective_codes: Optional[set] = None
41
+ self._backtest_metrics: Optional[dict] = None
42
+ self._initialized: bool = False
43
+
44
+ @property
45
+ def is_initialized(self) -> bool:
46
+ return self._initialized
47
+
48
+ @property
49
+ def config(self) -> Optional[Config]:
50
+ return self._config
51
+
52
+ def initialize(self) -> bool:
53
+ try:
54
+ logger.info("Initializing prediction system...")
55
+ self._config = Config()
56
+
57
+ self._processor = DataProcessor(self._config)
58
+ self._df_enrollment, self._elective_codes = (
59
+ self._processor.load_and_process()
60
+ )
61
+
62
+ self._predictor = ProphetPredictor(self._config)
63
+ self._predictor.train_student_population_model(
64
+ self._processor.raw_data["students_yearly"]
65
+ )
66
+
67
+ self._initialized = True
68
+ logger.info("System initialized successfully")
69
+ return True
70
+
71
+ except Exception as e:
72
+ logger.error(f"Failed to initialize system: {e}", exc_info=True)
73
+ self._initialized = False
74
+ return False
75
+
76
+ def get_data_info(self) -> Dict:
77
+ if not self._initialized or self._processor is None or self._config is None:
78
+ return {"error": "System not initialized"}
79
+
80
+ try:
81
+ courses = self._processor.raw_data.get("courses")
82
+ students = self._processor.raw_data.get("students_yearly")
83
+
84
+ if courses is None or students is None:
85
+ return {"error": "Data not loaded"}
86
+
87
+ elective_courses = courses[courses["kategori_mk"] == "P"]
88
+
89
+ return {
90
+ "total_courses": len(courses),
91
+ "elective_courses": len(elective_courses),
92
+ "class_capacity": self._config.class_capacity.DEFAULT_CLASS_CAPACITY,
93
+ "year_min": int(students["thn"].min()),
94
+ "year_max": int(students["thn"].max()),
95
+ }
96
+
97
+ except Exception as e:
98
+ return {"error": str(e)}
99
+
100
+ def _run_backtest_if_needed(self) -> Dict:
101
+ if self._backtest_metrics is not None:
102
+ return self._backtest_metrics
103
+
104
+ if (
105
+ self._config is None
106
+ or self._df_enrollment is None
107
+ or self._predictor is None
108
+ ):
109
+ logger.warning("System not initialized, using default metrics")
110
+ self._backtest_metrics = {"mae": 0, "rmse": 0}
111
+ return self._backtest_metrics
112
+
113
+ logger.info("Running backtest for the first time...")
114
+ evaluator = Evaluator(self._config)
115
+ backtest_results = evaluator.run_backtest(self._df_enrollment, self._predictor)
116
+
117
+ if backtest_results is None or len(backtest_results) == 0:
118
+ logger.warning("Backtest returned no results, using defaults")
119
+ self._backtest_metrics = {"mae": 0, "rmse": 0}
120
+ else:
121
+ metrics_result = evaluator.generate_metrics(backtest_results)
122
+ if metrics_result is None:
123
+ logger.warning("Metrics calculation failed, using defaults")
124
+ self._backtest_metrics = {"mae": 0, "rmse": 0}
125
+ else:
126
+ self._backtest_metrics = metrics_result
127
+
128
+ return self._backtest_metrics
129
+
130
+ def _get_actual_data(self, year: int, semester: int) -> Tuple[pd.DataFrame, bool]:
131
+ if self._df_enrollment is None:
132
+ return pd.DataFrame(), False
133
+
134
+ actual_data = self._df_enrollment[
135
+ (self._df_enrollment["thn"] == year)
136
+ & (self._df_enrollment["smt"] == semester)
137
+ ]
138
+ return actual_data, len(actual_data) > 0
139
+
140
+ def _calculate_class_metrics(
141
+ self,
142
+ courses_with_actual: pd.DataFrame,
143
+ year: int,
144
+ semester: int,
145
+ ) -> Dict:
146
+ if self._processor is None or self._config is None:
147
+ return {
148
+ "class_matches": 0,
149
+ "class_within_one": 0,
150
+ "total_for_class_accuracy": 0,
151
+ "class_accuracy_pct": 0,
152
+ "class_within_one_pct": 0,
153
+ "has_actual_class_data": False,
154
+ "data_source": "kalkulasi",
155
+ }
156
+
157
+ actual_classes_df = self._processor.get_class_count_for_validation(
158
+ year, semester
159
+ )
160
+
161
+ has_actual_class_data = False
162
+ courses_with_class_data: Optional[pd.DataFrame] = None
163
+
164
+ if len(actual_classes_df) > 0:
165
+ courses_with_actual = courses_with_actual.merge(
166
+ actual_classes_df, on="kode_mk", how="left"
167
+ )
168
+ has_actual_class_data = courses_with_actual["actual_classes"].notna().any()
169
+
170
+ if has_actual_class_data:
171
+ courses_with_class_data = courses_with_actual[
172
+ courses_with_actual["actual_classes"].notna()
173
+ ].copy()
174
+ courses_with_class_data["actual_classes"] = courses_with_class_data[
175
+ "actual_classes"
176
+ ].astype(int)
177
+
178
+ class_matches = (
179
+ courses_with_class_data["classes_needed"]
180
+ == courses_with_class_data["actual_classes"]
181
+ ).sum()
182
+ total_for_class_accuracy = len(courses_with_class_data)
183
+
184
+ else:
185
+ config = self._config
186
+ courses_with_actual["actual_classes_calc"] = courses_with_actual.apply(
187
+ lambda row: config.calculate_classes_needed(
188
+ row["actual_enrollment"],
189
+ row["kode_mk"],
190
+ has_historical_data=True,
191
+ ),
192
+ axis=1,
193
+ )
194
+ class_matches = (
195
+ courses_with_actual["classes_needed"]
196
+ == courses_with_actual["actual_classes_calc"]
197
+ ).sum()
198
+ total_for_class_accuracy = len(courses_with_actual)
199
+
200
+ class_accuracy_pct = (
201
+ (class_matches / total_for_class_accuracy) * 100
202
+ if total_for_class_accuracy > 0
203
+ else 0
204
+ )
205
+
206
+ if has_actual_class_data and courses_with_class_data is not None:
207
+ class_within_one = (
208
+ abs(
209
+ courses_with_class_data["classes_needed"]
210
+ - courses_with_class_data["actual_classes"]
211
+ )
212
+ <= 1
213
+ ).sum()
214
+ else:
215
+ class_within_one = (
216
+ abs(
217
+ courses_with_actual["classes_needed"]
218
+ - courses_with_actual["actual_classes_calc"]
219
+ )
220
+ <= 1
221
+ ).sum()
222
+
223
+ class_within_one_pct = (
224
+ (class_within_one / total_for_class_accuracy) * 100
225
+ if total_for_class_accuracy > 0
226
+ else 0
227
+ )
228
+
229
+ return {
230
+ "class_matches": int(class_matches),
231
+ "class_within_one": int(class_within_one),
232
+ "total_for_class_accuracy": total_for_class_accuracy,
233
+ "class_accuracy_pct": class_accuracy_pct,
234
+ "class_within_one_pct": class_within_one_pct,
235
+ "has_actual_class_data": has_actual_class_data,
236
+ "data_source": "tabel2" if has_actual_class_data else "kalkulasi",
237
+ }
238
+
239
+ def _prepare_comparison_table(
240
+ self,
241
+ predictions: pd.DataFrame,
242
+ actual_data: pd.DataFrame,
243
+ year: int,
244
+ semester: int,
245
+ ) -> Optional[pd.DataFrame]:
246
+ if self._processor is None or self._config is None:
247
+ return None
248
+
249
+ comparison = predictions.merge(
250
+ actual_data[["kode_mk", "enrollment"]], on="kode_mk", how="left"
251
+ )
252
+ comparison = comparison.rename(columns={"enrollment": "actual_enrollment"})
253
+
254
+ actual_classes_df = self._processor.get_class_count_for_validation(
255
+ year, semester
256
+ )
257
+ if len(actual_classes_df) > 0:
258
+ comparison = comparison.merge(actual_classes_df, on="kode_mk", how="left")
259
+ else:
260
+ comparison["actual_classes"] = None
261
+
262
+ courses_with_actual = comparison[comparison["actual_enrollment"].notna()].copy()
263
+
264
+ if len(courses_with_actual) == 0:
265
+ return None
266
+
267
+ courses_with_actual["error"] = (
268
+ courses_with_actual["predicted_enrollment"]
269
+ - courses_with_actual["actual_enrollment"]
270
+ )
271
+ courses_with_actual["abs_error"] = abs(courses_with_actual["error"])
272
+ courses_with_actual["accuracy_%"] = 100 * (
273
+ 1
274
+ - courses_with_actual["abs_error"]
275
+ / courses_with_actual["actual_enrollment"].replace(0, 1)
276
+ )
277
+
278
+ if (
279
+ "actual_classes" not in courses_with_actual.columns
280
+ or courses_with_actual["actual_classes"].isna().all()
281
+ ):
282
+ config_ref = self._config
283
+ courses_with_actual["actual_classes"] = courses_with_actual.apply(
284
+ lambda row: config_ref.calculate_classes_needed(
285
+ row["actual_enrollment"],
286
+ row["kode_mk"],
287
+ has_historical_data=True,
288
+ ),
289
+ axis=1,
290
+ )
291
+ else:
292
+ config_ref = self._config
293
+ courses_with_actual["actual_classes"] = courses_with_actual.apply(
294
+ lambda row: (
295
+ int(row["actual_classes"])
296
+ if pd.notna(row["actual_classes"])
297
+ else config_ref.calculate_classes_needed(
298
+ row["actual_enrollment"],
299
+ row["kode_mk"],
300
+ has_historical_data=True,
301
+ )
302
+ ),
303
+ axis=1,
304
+ )
305
+
306
+ courses_with_actual["class_diff"] = (
307
+ courses_with_actual["classes_needed"]
308
+ - courses_with_actual["actual_classes"]
309
+ )
310
+
311
+ comparison_display = courses_with_actual[
312
+ [
313
+ "kode_mk",
314
+ "nama_mk",
315
+ "actual_enrollment",
316
+ "predicted_enrollment",
317
+ "actual_classes",
318
+ "classes_needed",
319
+ "class_diff",
320
+ "error",
321
+ "accuracy_%",
322
+ "strategy",
323
+ ]
324
+ ].copy()
325
+
326
+ comparison_display.columns = [
327
+ "Kode MK",
328
+ "Nama MK",
329
+ "Aktual",
330
+ "Prediksi",
331
+ "Kelas Aktual",
332
+ "Kelas Prediksi",
333
+ "Selisih Kelas",
334
+ "Error",
335
+ "Akurasi %",
336
+ "Strategy",
337
+ ]
338
+
339
+ comparison_display["Aktual"] = comparison_display["Aktual"].astype(int)
340
+ comparison_display["Prediksi"] = comparison_display["Prediksi"].round(1)
341
+ comparison_display["Error"] = comparison_display["Error"].round(1)
342
+ comparison_display["Akurasi %"] = comparison_display["Akurasi %"].round(1)
343
+ comparison_display["Kelas Aktual"] = comparison_display["Kelas Aktual"].astype(
344
+ int
345
+ )
346
+ comparison_display["Kelas Prediksi"] = comparison_display[
347
+ "Kelas Prediksi"
348
+ ].astype(int)
349
+ comparison_display["Selisih Kelas"] = comparison_display[
350
+ "Selisih Kelas"
351
+ ].astype(int)
352
+
353
+ return comparison_display.sort_values("Aktual", ascending=False)
354
+
355
+ def _prepare_predictions_display(self, predictions: pd.DataFrame) -> pd.DataFrame:
356
+ """Prepare predictions dataframe for display."""
357
+ display_df = predictions[
358
+ [
359
+ "kode_mk",
360
+ "nama_mk",
361
+ "predicted_enrollment",
362
+ "classes_needed",
363
+ "class_capacity",
364
+ "total_quota",
365
+ "utilization_pct",
366
+ "recommendation",
367
+ "confidence",
368
+ "strategy",
369
+ ]
370
+ ].copy()
371
+
372
+ display_df.columns = [
373
+ "Kode MK",
374
+ "Nama MK",
375
+ "Prediksi",
376
+ "Jumlah Kelas",
377
+ "Kapasitas/Kelas",
378
+ "Total Kuota",
379
+ "Utilization %",
380
+ "Status",
381
+ "Confidence",
382
+ "Strategy",
383
+ ]
384
+
385
+ display_df["Prediksi"] = display_df["Prediksi"].round(1)
386
+ display_df["Jumlah Kelas"] = display_df["Jumlah Kelas"].astype(int)
387
+ display_df["Total Kuota"] = display_df["Total Kuota"].astype(int)
388
+
389
+ display_df["Status"] = display_df["Status"].map(
390
+ {"BUKA": "BUKA", "TUTUP": "TUTUP"}
391
+ )
392
+
393
+ display_df = display_df[display_df["Confidence"] == "high"]
394
+ display_df = display_df[display_df["Status"] == "BUKA"]
395
+
396
+ display_df = display_df.sort_values("Prediksi", ascending=False)
397
+ display_df = display_df.drop(columns=["Confidence", "Status"])
398
+
399
+ return display_df
400
+
401
+ def generate_predictions(self, year: int, semester: int) -> PredictionResult:
402
+ if semester not in [1, 2]:
403
+ return PredictionResult(
404
+ summary_data={},
405
+ predictions_df=pd.DataFrame(),
406
+ comparison_df=None,
407
+ has_actual_data=False,
408
+ error="Semester harus 1 (Ganjil) atau 2 (Genap)",
409
+ )
410
+
411
+ if year < 2020 or year > 2030:
412
+ return PredictionResult(
413
+ summary_data={},
414
+ predictions_df=pd.DataFrame(),
415
+ comparison_df=None,
416
+ has_actual_data=False,
417
+ error="Year must be between 2020 and 2030",
418
+ )
419
+
420
+ if not self._initialized:
421
+ return PredictionResult(
422
+ summary_data={},
423
+ predictions_df=pd.DataFrame(),
424
+ comparison_df=None,
425
+ has_actual_data=False,
426
+ error="System not initialized. Please restart the app.",
427
+ )
428
+
429
+ try:
430
+ logger.info(f"Generating predictions for {year} Semester {semester}...")
431
+
432
+ assert self._config is not None
433
+ assert self._predictor is not None
434
+ assert self._processor is not None
435
+ assert self._df_enrollment is not None
436
+ assert self._elective_codes is not None
437
+
438
+ self._config.prediction.PREDICT_YEAR = year
439
+ self._config.prediction.PREDICT_SEMESTER = semester
440
+
441
+ actual_data, has_actual_data = self._get_actual_data(year, semester)
442
+
443
+ if has_actual_data:
444
+ logger.info(
445
+ f"Found actual enrollment data for {year} Semester {semester}"
446
+ )
447
+ else:
448
+ logger.info(f"No actual data for {year} Semester {semester}")
449
+
450
+ metrics = self._run_backtest_if_needed()
451
+
452
+ predictions = self._predictor.generate_batch_predictions(
453
+ self._df_enrollment,
454
+ self._processor.raw_data["courses"],
455
+ self._elective_codes,
456
+ year,
457
+ semester,
458
+ )
459
+
460
+ open_courses = predictions[predictions["recommendation"] == "BUKA"]
461
+ total_to_open = len(open_courses)
462
+ total_classes = int(open_courses["classes_needed"].sum())
463
+ total_predicted_students = int(open_courses["predicted_enrollment"].sum())
464
+ total_capacity = int(open_courses["total_quota"].sum())
465
+ class_capacity = self._config.class_capacity.DEFAULT_CLASS_CAPACITY
466
+
467
+ summary_data = {
468
+ "year": year,
469
+ "semester": semester,
470
+ "semester_name": "1 (Ganjil)" if semester == 1 else "2 (Genap)",
471
+ "total_to_open": total_to_open,
472
+ "total_classes": total_classes,
473
+ "total_predicted_students": total_predicted_students,
474
+ "total_capacity": total_capacity,
475
+ "class_capacity": class_capacity,
476
+ "metrics": metrics,
477
+ "has_actual_data": has_actual_data,
478
+ }
479
+
480
+ comparison_df = None
481
+ if has_actual_data:
482
+ comparison = predictions.merge(
483
+ actual_data[["kode_mk", "enrollment"]], on="kode_mk", how="left"
484
+ )
485
+ comparison = comparison.rename(
486
+ columns={"enrollment": "actual_enrollment"}
487
+ )
488
+
489
+ courses_with_actual = comparison[
490
+ comparison["actual_enrollment"].notna()
491
+ ].copy()
492
+
493
+ if len(courses_with_actual) > 0:
494
+ comparison_mae = abs(
495
+ courses_with_actual["predicted_enrollment"]
496
+ - courses_with_actual["actual_enrollment"]
497
+ ).mean()
498
+ comparison_rmse = (
499
+ (
500
+ courses_with_actual["predicted_enrollment"]
501
+ - courses_with_actual["actual_enrollment"]
502
+ )
503
+ ** 2
504
+ ).mean() ** 0.5
505
+
506
+ total_actual = courses_with_actual["actual_enrollment"].sum()
507
+ total_predicted = courses_with_actual["predicted_enrollment"].sum()
508
+ accuracy_pct = (
509
+ 1 - abs(total_predicted - total_actual) / total_actual
510
+ ) * 100
511
+
512
+ class_metrics = self._calculate_class_metrics(
513
+ courses_with_actual.copy(), year, semester
514
+ )
515
+
516
+ summary_data.update(
517
+ {
518
+ "comparison_mae": comparison_mae,
519
+ "comparison_rmse": comparison_rmse,
520
+ "total_actual": total_actual,
521
+ "total_predicted": total_predicted,
522
+ "accuracy_pct": accuracy_pct,
523
+ **class_metrics,
524
+ }
525
+ )
526
+
527
+ comparison_df = self._prepare_comparison_table(
528
+ predictions, actual_data, year, semester
529
+ )
530
+
531
+ predictions_display = self._prepare_predictions_display(predictions)
532
+
533
+ return PredictionResult(
534
+ summary_data=summary_data,
535
+ predictions_df=predictions_display,
536
+ comparison_df=comparison_df,
537
+ has_actual_data=has_actual_data,
538
+ )
539
+
540
+ except Exception as e:
541
+ logger.error(f"Error generating predictions: {e}", exc_info=True)
542
+ return PredictionResult(
543
+ summary_data={},
544
+ predictions_df=pd.DataFrame(),
545
+ comparison_df=None,
546
+ has_actual_data=False,
547
+ error=str(e),
548
+ )
549
+
550
+ def generate_multi_year_forecast(
551
+ self, year: int, semester: int, years_ahead: int = 3
552
+ ) -> ForecastResult:
553
+ if not self._initialized:
554
+ return ForecastResult(
555
+ summary_data={},
556
+ forecast_df=pd.DataFrame(),
557
+ yearly_summary=pd.DataFrame(),
558
+ error="System not initialized.",
559
+ )
560
+
561
+ try:
562
+ logger.info(f"Generating {years_ahead}-year forecast from {year}...")
563
+
564
+ assert self._config is not None
565
+ assert self._predictor is not None
566
+ assert self._processor is not None
567
+ assert self._df_enrollment is not None
568
+ assert self._elective_codes is not None
569
+
570
+ forecast_df = self._predictor.generate_multi_year_forecast(
571
+ self._df_enrollment,
572
+ self._processor.raw_data["courses"],
573
+ self._elective_codes,
574
+ year,
575
+ semester,
576
+ years_ahead,
577
+ )
578
+
579
+ if forecast_df.empty:
580
+ return ForecastResult(
581
+ summary_data={},
582
+ forecast_df=pd.DataFrame(),
583
+ yearly_summary=pd.DataFrame(),
584
+ error="Tidak ada data untuk forecast.",
585
+ )
586
+
587
+ yearly_summary = (
588
+ forecast_df.groupby("year")
589
+ .agg(
590
+ {
591
+ "predicted_enrollment": "sum",
592
+ "classes_needed": "sum",
593
+ "total_capacity": "sum",
594
+ "kode_mk": "count",
595
+ }
596
+ )
597
+ .reset_index()
598
+ )
599
+ yearly_summary.columns = [
600
+ "Tahun",
601
+ "Total Prediksi",
602
+ "Total Kelas",
603
+ "Total Kapasitas",
604
+ "Jumlah MK",
605
+ ]
606
+
607
+ class_capacity = self._config.class_capacity.DEFAULT_CLASS_CAPACITY
608
+ semester_name = "Ganjil" if semester == 1 else "Genap"
609
+
610
+ first_year = yearly_summary.iloc[0]
611
+ last_year = yearly_summary.iloc[-1]
612
+ growth_classes = int(last_year["Total Kelas"] - first_year["Total Kelas"])
613
+ growth_students = int(
614
+ last_year["Total Prediksi"] - first_year["Total Prediksi"]
615
+ )
616
+
617
+ summary_data = {
618
+ "year": year,
619
+ "semester": semester,
620
+ "semester_name": semester_name,
621
+ "years_ahead": years_ahead,
622
+ "class_capacity": class_capacity,
623
+ "first_year_classes": int(first_year["Total Kelas"]),
624
+ "last_year_classes": int(last_year["Total Kelas"]),
625
+ "growth_classes": growth_classes,
626
+ "growth_students": growth_students,
627
+ }
628
+
629
+ display_df = forecast_df[
630
+ [
631
+ "year",
632
+ "kode_mk",
633
+ "nama_mk",
634
+ "predicted_enrollment",
635
+ "classes_needed",
636
+ "total_capacity",
637
+ ]
638
+ ].copy()
639
+ display_df.columns = [
640
+ "Tahun",
641
+ "Kode MK",
642
+ "Nama MK",
643
+ "Prediksi",
644
+ "Kelas",
645
+ "Kapasitas",
646
+ ]
647
+ display_df["Prediksi"] = display_df["Prediksi"].round(0).astype(int)
648
+ display_df = display_df.sort_values(["Kode MK", "Tahun"])
649
+
650
+ return ForecastResult(
651
+ summary_data=summary_data,
652
+ forecast_df=display_df,
653
+ yearly_summary=yearly_summary,
654
+ )
655
+
656
+ except Exception as e:
657
+ logger.error(f"Error generating forecast: {e}", exc_info=True)
658
+ return ForecastResult(
659
+ summary_data={},
660
+ forecast_df=pd.DataFrame(),
661
+ yearly_summary=pd.DataFrame(),
662
+ error=str(e),
663
+ )
664
+
665
+
666
+ _backend_instance: Optional[PredictionBackend] = None
667
+
668
+
669
+ def get_backend() -> PredictionBackend:
670
+ """Get the singleton backend instance."""
671
+ global _backend_instance
672
+ if _backend_instance is None:
673
+ _backend_instance = PredictionBackend()
674
+ return _backend_instance
config.py CHANGED
@@ -1,29 +1,21 @@
 
1
  from dataclasses import dataclass, field
2
  from typing import Dict, List
3
- import os
4
 
5
- # Import data loader for private HF dataset support
6
  try:
7
  from data_loader import load_data_file
 
8
  DATA_LOADER_AVAILABLE = True
9
  except ImportError:
10
  DATA_LOADER_AVAILABLE = False
 
11
  def load_data_file() -> str:
12
- """Fallback if data_loader not available."""
13
  return "data/optimized_data.xlsx"
14
 
15
 
16
  def _get_data_file_path() -> str:
17
- """
18
- Get data file path based on environment.
19
-
20
- Priority:
21
- 1. If HF_TOKEN set: Load from private HF dataset (muhalwan/optimized_data_mhs)
22
- 2. If DEMO_MODE=true: Use demo_data.xlsx (anonymized)
23
- 3. Otherwise: Use local optimized_data.xlsx
24
- """
25
  if os.getenv("HF_TOKEN"):
26
- return load_data_file() # Loads from HF dataset if HF_TOKEN is set
27
  elif os.getenv("DEMO_MODE", "false").lower() == "true":
28
  return "data/demo_data.xlsx"
29
  else:
@@ -32,12 +24,8 @@ def _get_data_file_path() -> str:
32
 
33
  @dataclass
34
  class DataConfig:
35
- """Data source configuration and validation rules."""
36
-
37
- # Data file path - automatically determined based on environment
38
  FILE_PATH: str = field(default_factory=_get_data_file_path)
39
 
40
- # Sheet mappings
41
  SHEET_COURSES: str = "tabel1_data_matkul"
42
  SHEET_OFFERINGS: str = "tabel2_data_matkul_dibuka"
43
  SHEET_STUDENTS_YEARLY: str = "tabel3_data_mahasiswa_per_tahun"
@@ -48,20 +36,55 @@ class DataConfig:
48
  default_factory=lambda: {"tahun": "thn", "semester": "smt"}
49
  )
50
 
51
- # Elective Course Identification
52
- # IMPORTANT: Elective courses are identified by kategori_mk = 'P' in tabel1
53
- # Mandatory/Required courses have kategori_mk = 'W'
54
  ELECTIVE_CATEGORY: str = "P"
55
  MANDATORY_CATEGORY: str = "W"
56
 
57
- # Valid category values (will be normalized to uppercase)
58
  VALID_CATEGORIES: List[str] = field(default_factory=lambda: ["P", "W"])
59
 
60
 
61
  @dataclass
62
- class ModelConfig:
63
- """Prophet model hyperparameters and prediction strategies."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
 
 
 
65
  # Prophet Hyperparameters
66
  GROWTH_MODE: str = "logistic"
67
  CHANGEPOINT_SCALE: float = 0.01
@@ -75,6 +98,11 @@ class ModelConfig:
75
  # Minimum historical data points required for reliable prediction
76
  MIN_HISTORY_POINTS: int = 3
77
 
 
 
 
 
 
78
 
79
  @dataclass
80
  class PredictionConfig:
@@ -83,7 +111,6 @@ class PredictionConfig:
83
  PREDICT_YEAR: int = 2025
84
  PREDICT_SEMESTER: int = 2
85
 
86
- # Buffer Calculations
87
  BUFFER_PERCENT: float = 0.20
88
  MIN_QUOTA_OPEN: int = 25
89
  MIN_PREDICT_THRESHOLD: int = 15
@@ -101,8 +128,6 @@ class PredictionConfig:
101
 
102
  @dataclass
103
  class OutputConfig:
104
- """Output settings."""
105
-
106
  OUTPUT_DIR: str = "output"
107
  LOG_LEVEL: str = "INFO"
108
  TOP_N_DISPLAY: int = 30
@@ -110,8 +135,6 @@ class OutputConfig:
110
 
111
  @dataclass
112
  class BacktestConfig:
113
- """Backtest settings and validation."""
114
-
115
  START_YEAR: int = 2010
116
  END_YEAR: int = 2024
117
  VERBOSE: bool = True
@@ -123,48 +146,65 @@ class BacktestConfig:
123
 
124
 
125
  class Config:
126
- """
127
- Master Config Object.
128
-
129
- ELECTIVE COURSE DEFINITION:
130
- ---------------------------
131
- Elective courses are identified by kategori_mk = 'P' in tabel1_data_matkul.
132
- This is the ONLY source of truth for course categories.
133
-
134
- Examples of elective courses (kategori_mk = 'P'):
135
- - EF234607: Keamanan Aplikasi
136
- - EF234613: Game Edukasi dan Simulasi
137
- - UG234922: Kebudayaan dan Kebangsaan
138
- - IW184301: Sistem Basis Data
139
- - KI series: Various computer science electives
140
-
141
- Mandatory courses have kategori_mk = 'W' (Wajib).
142
-
143
- DATA REQUIREMENTS FOR BACKTESTING:
144
- -----------------------------------
145
- To backtest a semester, you need:
146
- 1. Course catalog (tabel1) with kategori_mk properly set
147
- 2. ACTUAL student enrollments (tabel4) for that semester
148
- 3. At least one elective course with enrollments
149
-
150
- Note: Course offerings (tabel2) alone are NOT sufficient for backtesting.
151
- You must have actual enrollment data (tabel4) to validate predictions.
152
- """
153
-
154
  def __init__(self):
155
  self.data: DataConfig = DataConfig()
156
  self.model: ModelConfig = ModelConfig()
157
  self.prediction: PredictionConfig = PredictionConfig()
158
  self.output: OutputConfig = OutputConfig()
159
  self.backtest: BacktestConfig = BacktestConfig()
 
 
160
 
161
  def get_prediction_target_name(self) -> str:
162
  sem = "Ganjil" if self.prediction.PREDICT_SEMESTER == 1 else "Genap"
163
  return f"{self.prediction.PREDICT_YEAR} Semester {sem}"
164
 
165
  def get_elective_filter_description(self) -> str:
166
- """Get human-readable description of elective identification."""
167
  return f"kategori_mk = '{self.data.ELECTIVE_CATEGORY}' in {self.data.SHEET_COURSES}"
168
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
  default_config = Config()
 
1
+ import os
2
  from dataclasses import dataclass, field
3
  from typing import Dict, List
 
4
 
 
5
  try:
6
  from data_loader import load_data_file
7
+
8
  DATA_LOADER_AVAILABLE = True
9
  except ImportError:
10
  DATA_LOADER_AVAILABLE = False
11
+
12
  def load_data_file() -> str:
 
13
  return "data/optimized_data.xlsx"
14
 
15
 
16
  def _get_data_file_path() -> str:
 
 
 
 
 
 
 
 
17
  if os.getenv("HF_TOKEN"):
18
+ return load_data_file()
19
  elif os.getenv("DEMO_MODE", "false").lower() == "true":
20
  return "data/demo_data.xlsx"
21
  else:
 
24
 
25
  @dataclass
26
  class DataConfig:
 
 
 
27
  FILE_PATH: str = field(default_factory=_get_data_file_path)
28
 
 
29
  SHEET_COURSES: str = "tabel1_data_matkul"
30
  SHEET_OFFERINGS: str = "tabel2_data_matkul_dibuka"
31
  SHEET_STUDENTS_YEARLY: str = "tabel3_data_mahasiswa_per_tahun"
 
36
  default_factory=lambda: {"tahun": "thn", "semester": "smt"}
37
  )
38
 
 
 
 
39
  ELECTIVE_CATEGORY: str = "P"
40
  MANDATORY_CATEGORY: str = "W"
41
 
 
42
  VALID_CATEGORIES: List[str] = field(default_factory=lambda: ["P", "W"])
43
 
44
 
45
  @dataclass
46
+ class ClassCapacityConfig:
47
+ # Default maximum students per class
48
+ DEFAULT_CLASS_CAPACITY: int = 50
49
+
50
+ # Minimum students required to open a class
51
+ MIN_STUDENTS_TO_OPEN_CLASS: int = 1
52
+
53
+ # Threshold for opening additional classes
54
+ ADDITIONAL_CLASS_THRESHOLD: float = 0.7
55
+
56
+ # Always open at least 1 class if there's any historical enrollment
57
+ OPEN_CLASS_IF_HAS_HISTORY: bool = True
58
+
59
+ # Course-specific capacity overrides (kode_mk -> max_capacity)
60
+ COURSE_CAPACITY_OVERRIDES: Dict[str, int] = field(default_factory=dict)
61
+
62
+ # Warning threshold - if predicted > capacity * threshold, warn about capacity
63
+ CAPACITY_WARNING_THRESHOLD: float = 0.8
64
+
65
+ # Enable capacity-aware prediction
66
+ # When True, predictions will be bounded by realistic capacity constraints
67
+ ENABLE_CAPACITY_CONSTRAINTS: bool = True
68
+
69
+
70
+ @dataclass
71
+ class MultiYearForecastConfig:
72
+ # How many years ahead to forecast
73
+ FORECAST_YEARS_AHEAD: int = 3
74
+
75
+ # Include trend analysis in output
76
+ SHOW_TREND_ANALYSIS: bool = True
77
+
78
+ # Confidence interval for forecasts (0-1)
79
+ CONFIDENCE_INTERVAL: float = 0.95
80
+
81
+ # Growth rate limits for sanity checking
82
+ MAX_YEARLY_GROWTH_RATE: float = 0.5 # 50% max growth per year
83
+ MIN_YEARLY_GROWTH_RATE: float = -0.3 # 30% max decline per year
84
 
85
+
86
+ @dataclass
87
+ class ModelConfig:
88
  # Prophet Hyperparameters
89
  GROWTH_MODE: str = "logistic"
90
  CHANGEPOINT_SCALE: float = 0.01
 
98
  # Minimum historical data points required for reliable prediction
99
  MIN_HISTORY_POINTS: int = 3
100
 
101
+ # Use student population as regressor
102
+ USE_POPULATION_REGRESSOR: bool = True
103
+ # Use capacity as upper bound (cap in logistic growth)
104
+ USE_CAPACITY_AS_CAP: bool = True
105
+
106
 
107
  @dataclass
108
  class PredictionConfig:
 
111
  PREDICT_YEAR: int = 2025
112
  PREDICT_SEMESTER: int = 2
113
 
 
114
  BUFFER_PERCENT: float = 0.20
115
  MIN_QUOTA_OPEN: int = 25
116
  MIN_PREDICT_THRESHOLD: int = 15
 
128
 
129
  @dataclass
130
  class OutputConfig:
 
 
131
  OUTPUT_DIR: str = "output"
132
  LOG_LEVEL: str = "INFO"
133
  TOP_N_DISPLAY: int = 30
 
135
 
136
  @dataclass
137
  class BacktestConfig:
 
 
138
  START_YEAR: int = 2010
139
  END_YEAR: int = 2024
140
  VERBOSE: bool = True
 
146
 
147
 
148
  class Config:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  def __init__(self):
150
  self.data: DataConfig = DataConfig()
151
  self.model: ModelConfig = ModelConfig()
152
  self.prediction: PredictionConfig = PredictionConfig()
153
  self.output: OutputConfig = OutputConfig()
154
  self.backtest: BacktestConfig = BacktestConfig()
155
+ self.class_capacity: ClassCapacityConfig = ClassCapacityConfig()
156
+ self.multi_year: MultiYearForecastConfig = MultiYearForecastConfig()
157
 
158
  def get_prediction_target_name(self) -> str:
159
  sem = "Ganjil" if self.prediction.PREDICT_SEMESTER == 1 else "Genap"
160
  return f"{self.prediction.PREDICT_YEAR} Semester {sem}"
161
 
162
  def get_elective_filter_description(self) -> str:
 
163
  return f"kategori_mk = '{self.data.ELECTIVE_CATEGORY}' in {self.data.SHEET_COURSES}"
164
 
165
+ def get_class_capacity(self, course_code: str) -> int:
166
+ if course_code in self.class_capacity.COURSE_CAPACITY_OVERRIDES:
167
+ return self.class_capacity.COURSE_CAPACITY_OVERRIDES[course_code]
168
+ return self.class_capacity.DEFAULT_CLASS_CAPACITY
169
+
170
+ def calculate_classes_needed(
171
+ self,
172
+ predicted_enrollment: float,
173
+ course_code: str,
174
+ has_historical_data: bool = True,
175
+ ) -> int:
176
+ import math
177
+
178
+ capacity = self.get_class_capacity(course_code)
179
+
180
+ if predicted_enrollment <= 0:
181
+ return 0
182
+
183
+ if predicted_enrollment < 1 and has_historical_data:
184
+ return 1
185
+
186
+ classes = math.ceil(predicted_enrollment / capacity)
187
+
188
+ return max(1, classes)
189
+
190
+ def get_capacity_status(self, predicted_enrollment: float, course_code: str) -> str:
191
+ capacity = self.get_class_capacity(course_code)
192
+ classes_needed = self.calculate_classes_needed(
193
+ predicted_enrollment, course_code
194
+ )
195
+
196
+ if classes_needed == 0:
197
+ return "UNDER"
198
+
199
+ total_capacity = classes_needed * capacity
200
+ utilization = predicted_enrollment / total_capacity
201
+
202
+ if utilization >= 1.0:
203
+ return "OVER"
204
+ elif utilization >= self.class_capacity.CAPACITY_WARNING_THRESHOLD:
205
+ return "WARNING"
206
+ else:
207
+ return "NORMAL"
208
+
209
 
210
  default_config = Config()
data_loader.py CHANGED
@@ -12,8 +12,7 @@ def load_data_file() -> str:
12
  try:
13
  from huggingface_hub import hf_hub_download
14
 
15
- logger.info("🔐 Loading data from private Hugging Face dataset...")
16
- logger.info(" Dataset: muhalwan/optimized_data_mhs")
17
 
18
  file_path = hf_hub_download(
19
  repo_id="muhalwan/optimized_data_mhs",
@@ -23,34 +22,22 @@ def load_data_file() -> str:
23
  cache_dir="./hf_cache",
24
  )
25
 
26
- logger.info("Data loaded successfully from HF dataset")
27
- logger.info(f" Cached at: {file_path}")
28
  return file_path
29
 
30
- except ImportError:
31
- logger.error(
32
- "huggingface_hub not installed. Install with: pip install huggingface_hub"
33
- )
34
- raise
35
-
36
  except Exception as e:
37
  logger.error(f"Failed to download from HF dataset: {e}")
38
- logger.error("Falling back to local file if available...")
39
 
40
  local_path = "data/optimized_data.xlsx"
41
 
42
  if Path(local_path).exists():
43
- logger.info(f"📁 Loading data from local file: {local_path}")
44
  return local_path
45
 
46
- error_msg = (
47
- "No data file found!\n"
48
- "Options:\n"
49
- "1. Set HF_TOKEN environment variable to load from private dataset\n"
50
- "2. Place optimized_data.xlsx in data/ folder for local development\n"
51
  )
52
- logger.error(error_msg)
53
- raise FileNotFoundError(error_msg)
54
 
55
 
56
  def get_data_source_info() -> dict:
@@ -69,21 +56,14 @@ def get_data_source_info() -> dict:
69
 
70
  if __name__ == "__main__":
71
  logging.basicConfig(level=logging.INFO)
72
-
73
- print("=" * 80)
74
- print("Data Source Information")
75
- print("=" * 80)
76
 
77
  info = get_data_source_info()
78
  for key, value in info.items():
79
  print(f" {key}: {value}")
80
 
81
- print("\n" + "=" * 80)
82
- print("Attempting to load data...")
83
- print("=" * 80)
84
-
85
  try:
86
  file_path = load_data_file()
87
- print(f"\n✓ Success! Data file: {file_path}")
88
  except Exception as e:
89
- print(f"\n✗ Failed: {e}")
 
12
  try:
13
  from huggingface_hub import hf_hub_download
14
 
15
+ logger.info("Dataset: muhalwan/optimized_data_mhs")
 
16
 
17
  file_path = hf_hub_download(
18
  repo_id="muhalwan/optimized_data_mhs",
 
22
  cache_dir="./hf_cache",
23
  )
24
 
25
+ logger.info("Data loaded successfully from HF dataset")
 
26
  return file_path
27
 
 
 
 
 
 
 
28
  except Exception as e:
29
  logger.error(f"Failed to download from HF dataset: {e}")
 
30
 
31
  local_path = "data/optimized_data.xlsx"
32
 
33
  if Path(local_path).exists():
34
+ logger.info(f"Loading data from local file: {local_path}")
35
  return local_path
36
 
37
+ raise FileNotFoundError(
38
+ "No data source available. Either set HF_TOKEN environment variable "
39
+ "or place data file at 'data/optimized_data.xlsx'"
 
 
40
  )
 
 
41
 
42
 
43
  def get_data_source_info() -> dict:
 
56
 
57
  if __name__ == "__main__":
58
  logging.basicConfig(level=logging.INFO)
59
+ print("Data Information")
 
 
 
60
 
61
  info = get_data_source_info()
62
  for key, value in info.items():
63
  print(f" {key}: {value}")
64
 
 
 
 
 
65
  try:
66
  file_path = load_data_file()
67
+ print(f"\nSuccess! Data file: {file_path}")
68
  except Exception as e:
69
+ print(f"\nFailed: {e}")
data_processor.py CHANGED
@@ -1,5 +1,5 @@
1
  import logging
2
- from typing import Dict, Set, Tuple
3
 
4
  import numpy as np
5
  import pandas as pd
@@ -22,7 +22,6 @@ class DataProcessor:
22
  return self._preprocess()
23
 
24
  def _load_excel(self):
25
- logger.info(f"Loading {self.config.data.FILE_PATH}...")
26
  try:
27
  sheets = pd.read_excel(self.config.data.FILE_PATH, sheet_name=None)
28
  self.raw_data = {
@@ -36,7 +35,6 @@ class DataProcessor:
36
  raise
37
 
38
  def _validate_raw_data(self):
39
- """Validate required columns and log data quality metrics."""
40
  req_cols = {
41
  "courses": ["kode_mk", "kategori_mk"],
42
  "students_ind": ["kode_mk", "thn", "smt", "kode_mhs"],
@@ -47,46 +45,146 @@ class DataProcessor:
47
  if not all(col in self.raw_data[key].columns for col in cols):
48
  raise ValueError(f"Missing columns in {key}: {cols}")
49
 
50
- # Log data quality metrics
51
- self._log_data_quality()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
- def _log_data_quality(self):
54
- """Log data quality metrics for monitoring."""
55
- courses_df = self.raw_data["courses"]
56
- students_df = self.raw_data["students_ind"]
57
 
58
- logger.info("=" * 60)
59
- logger.info("Data Quality Report:")
60
- logger.info(f" Courses (tabel1): {len(courses_df)} records")
61
- logger.info(f" - Unique courses: {courses_df['kode_mk'].nunique()}")
62
  logger.info(
63
- f" - Duplicates: {len(courses_df) - courses_df['kode_mk'].nunique()}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  )
65
- logger.info(f" Students (tabel4): {len(students_df)} records")
66
- logger.info(f" - Unique students: {students_df['kode_mhs'].nunique()}")
67
- logger.info("=" * 60)
68
 
69
  def _clean_courses_data(self, courses: pd.DataFrame) -> pd.DataFrame:
70
- """
71
- Clean and standardize course catalog data.
72
-
73
- Cleaning steps:
74
- 1. Remove exact duplicates
75
- 2. Standardize kategori_mk values (uppercase, strip whitespace)
76
- 3. Remove courses with invalid/missing data
77
- 4. Keep first occurrence for duplicate course codes
78
- 5. Validate kategori_mk values
79
- """
80
  initial_count = len(courses)
81
 
82
- # Step 1: Remove exact duplicate rows
83
  courses = courses.drop_duplicates()
84
  if len(courses) < initial_count:
85
  logger.info(
86
  f" Removed {initial_count - len(courses)} exact duplicate rows"
87
  )
88
 
89
- # Step 2: Standardize kategori_mk
90
  courses["kategori_mk"] = (
91
  courses["kategori_mk"]
92
  .astype(str)
@@ -95,7 +193,7 @@ class DataProcessor:
95
  .replace("", np.nan)
96
  )
97
 
98
- # Step 3: Remove rows with missing critical data
99
  before_dropna = len(courses)
100
  courses = courses.dropna(subset=["kode_mk", "kategori_mk"])
101
  if len(courses) < before_dropna:
@@ -103,7 +201,7 @@ class DataProcessor:
103
  f" Removed {before_dropna - len(courses)} rows with missing kode_mk or kategori_mk"
104
  )
105
 
106
- # Step 4: Validate kategori_mk values (should be P or W)
107
  valid_categories = {"P", "W"}
108
  invalid_mask = ~courses["kategori_mk"].isin(valid_categories)
109
  if invalid_mask.any():
@@ -114,7 +212,7 @@ class DataProcessor:
114
  logger.warning(" Keeping only valid categories (P, W)")
115
  courses = courses[~invalid_mask]
116
 
117
- # Step 5: Remove duplicate course codes (keep first)
118
  before_dedup = len(courses)
119
  courses = courses.drop_duplicates(subset="kode_mk", keep="first")
120
  if len(courses) < before_dedup:
@@ -127,29 +225,20 @@ class DataProcessor:
127
  return courses
128
 
129
  def _clean_students_data(self, students: pd.DataFrame) -> pd.DataFrame:
130
- """
131
- Clean and validate student enrollment data.
132
-
133
- Cleaning steps:
134
- 1. Remove rows with missing critical data
135
- 2. Standardize data types
136
- 3. Remove invalid year/semester values
137
- 4. Remove duplicate enrollment records
138
- """
139
  initial_count = len(students)
140
 
141
- # Step 1: Remove rows with missing critical data
142
  students = students.dropna(subset=["kode_mk", "thn", "smt", "kode_mhs"])
143
  if len(students) < initial_count:
144
  logger.info(
145
  f" Removed {initial_count - len(students)} rows with missing critical data"
146
  )
147
 
148
- # Step 2: Ensure correct data types
149
  students["thn"] = pd.to_numeric(students["thn"], errors="coerce")
150
  students["smt"] = pd.to_numeric(students["smt"], errors="coerce")
151
 
152
- # Step 3: Remove rows with invalid year/semester after conversion
153
  before_invalid = len(students)
154
  students = students.dropna(subset=["thn", "smt"])
155
  if len(students) < before_invalid:
@@ -157,8 +246,8 @@ class DataProcessor:
157
  f" Removed {before_invalid - len(students)} rows with invalid year/semester values"
158
  )
159
 
160
- # Step 4: Validate semester values (should be 1, 2, or 3)
161
- valid_semesters = {1, 2, 3}
162
  invalid_sem = ~students["smt"].isin(valid_semesters)
163
  if invalid_sem.any():
164
  logger.warning(
@@ -166,7 +255,7 @@ class DataProcessor:
166
  )
167
  students = students[~invalid_sem]
168
 
169
- # Step 5: Validate year range (reasonable academic years)
170
  current_year = pd.Timestamp.now().year
171
  invalid_year = (students["thn"] < 2000) | (students["thn"] > current_year + 1)
172
  if invalid_year.any():
@@ -175,7 +264,7 @@ class DataProcessor:
175
  )
176
  students = students[~invalid_year]
177
 
178
- # Step 6: Remove exact duplicate enrollments (same student, course, semester)
179
  before_dedup = len(students)
180
  students = students.drop_duplicates(
181
  subset=["kode_mhs", "kode_mk", "thn", "smt"], keep="first"
@@ -190,14 +279,6 @@ class DataProcessor:
190
  return students
191
 
192
  def _clean_yearly_population(self, yearly_pop: pd.DataFrame) -> pd.DataFrame:
193
- """
194
- Clean and validate yearly student population data.
195
-
196
- Cleaning steps:
197
- 1. Remove duplicates
198
- 2. Validate and fill missing population data
199
- 3. Ensure chronological order
200
- """
201
  # Remove duplicate year-semester combinations
202
  before_dedup = len(yearly_pop)
203
  yearly_pop = yearly_pop.drop_duplicates(subset=["thn", "smt"], keep="first")
@@ -211,7 +292,7 @@ class DataProcessor:
211
  yearly_pop["jumlah_aktif"], errors="coerce"
212
  )
213
 
214
- # Replace zero or negative values with NaN (will be filled later)
215
  yearly_pop.loc[yearly_pop["jumlah_aktif"] <= 0, "jumlah_aktif"] = np.nan
216
 
217
  # Sort by year and semester
@@ -222,20 +303,14 @@ class DataProcessor:
222
  return yearly_pop
223
 
224
  def _preprocess(self) -> Tuple[pd.DataFrame, Set[str]]:
225
- """Clean, merge, and aggregate data with comprehensive cleaning."""
226
- logger.info("Preprocessing data...")
227
- logger.info("-" * 60)
228
-
229
- # Step 1: Clean course catalog
230
- logger.info("Step 1: Cleaning course catalog...")
231
  courses = self._clean_courses_data(self.raw_data["courses"].copy())
232
 
233
- # Step 2: Identify elective courses
234
  elective_category = self.config.data.ELECTIVE_CATEGORY
235
  self.elective_codes = set(
236
  courses[courses["kategori_mk"] == elective_category]["kode_mk"]
237
  )
238
- logger.info(f"Step 2: Identified {len(self.elective_codes)} elective courses")
239
 
240
  if len(self.elective_codes) == 0:
241
  logger.warning(
@@ -246,88 +321,52 @@ class DataProcessor:
246
  )
247
  return pd.DataFrame(), set()
248
 
249
- # Step 3: Clean student enrollment data
250
- logger.info("Step 3: Cleaning student enrollment data...")
251
  students = self._clean_students_data(self.raw_data["students_ind"].copy())
252
 
253
- # Step 4: Filter for elective courses only
254
  students = students[students["kode_mk"].isin(self.elective_codes)]
255
- logger.info(f"Step 4: Filtered to {len(students)} elective enrollment records")
256
 
257
  if len(students) == 0:
258
  logger.warning("No enrollment data found for elective courses!")
259
  return pd.DataFrame(), self.elective_codes
260
 
261
- # Step 5: Aggregate enrollment by course-semester
262
- logger.info("Step 5: Aggregating enrollment data...")
263
  enrollment = (
264
  students.groupby(["kode_mk", "thn", "smt"])["kode_mhs"]
265
  .nunique()
266
  .reset_index(name="enrollment")
267
  )
268
- logger.info(f" Created {len(enrollment)} course-semester enrollment records")
269
 
270
- # Step 6: Clean yearly population data
271
- logger.info("Step 6: Cleaning yearly population data...")
272
  yearly_pop = self._clean_yearly_population(
273
  self.raw_data["students_yearly"][["thn", "smt", "jumlah_aktif"]].copy()
274
  )
275
 
276
- # Step 7: Merge enrollment with population data
277
- logger.info("Step 7: Merging enrollment with population data...")
278
  df = enrollment.merge(yearly_pop, on=["thn", "smt"], how="left")
279
 
280
- # Step 8: Handle missing population data
281
  missing_pop = df["jumlah_aktif"].isna().sum()
282
  if missing_pop > 0:
283
- logger.warning(
284
- f" {missing_pop} records missing population data - filling with interpolation"
285
- )
286
  df["jumlah_aktif"] = df["jumlah_aktif"].ffill().bfill()
287
 
288
- # If still missing, use a reasonable default
289
  if df["jumlah_aktif"].isna().any():
290
- default_pop = 500 # Reasonable default student population
291
- logger.warning(
292
- f" Some population data still missing - using default: {default_pop}"
293
- )
294
  df["jumlah_aktif"] = df["jumlah_aktif"].fillna(default_pop)
295
 
296
- # Step 9: Validate enrollment data
297
- logger.info("Step 8: Validating final enrollment data...")
298
  df = self._validate_enrollment_data(df)
299
 
300
- # Step 10: Sort and finalize
301
  df = df.sort_values(["kode_mk", "thn", "smt"]).reset_index(drop=True)
302
  self.processed_data = df
303
 
304
- logger.info("-" * 60)
305
- logger.info(
306
- f"✓ Preprocessing complete. {len(df)} enrollment records generated."
307
- )
308
- logger.info(f"✓ Year range: {df['thn'].min():.0f} - {df['thn'].max():.0f}")
309
- logger.info(f"✓ Courses with data: {df['kode_mk'].nunique()}")
310
- logger.info("-" * 60)
311
-
312
  return df, self.elective_codes
313
 
314
  def _validate_enrollment_data(self, df: pd.DataFrame) -> pd.DataFrame:
315
- """
316
- Validate and clean the final enrollment dataset.
317
-
318
- Checks:
319
- 1. Remove records with zero enrollment
320
- 2. Check for outliers
321
- 3. Validate population data
322
- """
323
- initial_count = len(df)
324
-
325
  # Remove zero enrollments
326
  df = df[df["enrollment"] > 0]
327
- if len(df) < initial_count:
328
- logger.info(
329
- f" Removed {initial_count - len(df)} records with zero enrollment"
330
- )
331
 
332
  # Check for extreme outliers in enrollment
333
  for course in df["kode_mk"].unique():
@@ -335,7 +374,7 @@ class DataProcessor:
335
  if len(course_data) > 1:
336
  q75, q25 = course_data.quantile([0.75, 0.25])
337
  iqr = q75 - q25
338
- upper_bound = q75 + (3 * iqr) # Using 3*IQR for outliers
339
 
340
  outliers = course_data > upper_bound
341
  if outliers.any():
 
1
  import logging
2
+ from typing import Dict, Optional, Set, Tuple
3
 
4
  import numpy as np
5
  import pandas as pd
 
22
  return self._preprocess()
23
 
24
  def _load_excel(self):
 
25
  try:
26
  sheets = pd.read_excel(self.config.data.FILE_PATH, sheet_name=None)
27
  self.raw_data = {
 
35
  raise
36
 
37
  def _validate_raw_data(self):
 
38
  req_cols = {
39
  "courses": ["kode_mk", "kategori_mk"],
40
  "students_ind": ["kode_mk", "thn", "smt", "kode_mhs"],
 
45
  if not all(col in self.raw_data[key].columns for col in cols):
46
  raise ValueError(f"Missing columns in {key}: {cols}")
47
 
48
+ def get_actual_classes_opened(
49
+ self, year: int, semester: int, course_code: Optional[str] = None
50
+ ) -> Dict[str, int]:
51
+ offerings = self.raw_data.get("offerings")
52
+ if offerings is None or len(offerings) == 0:
53
+ logger.warning("No offerings data (tabel2) available")
54
+ return {}
55
+
56
+ # Standardize column names
57
+ offerings = offerings.copy()
58
+ for old_col, new_col in self.config.data.OFFERINGS_RENAME.items():
59
+ if old_col in offerings.columns and new_col not in offerings.columns:
60
+ offerings = offerings.rename(columns={old_col: new_col})
61
+
62
+ # Log column names for debugging
63
+ logger.debug(f"Offerings columns: {offerings.columns.tolist()}")
64
+
65
+ # Filter by year and semester
66
+ mask = (offerings["thn"] == year) & (offerings["smt"] == semester)
67
+ if course_code:
68
+ mask = mask & (offerings["kode_mk"] == course_code)
69
+
70
+ filtered = offerings[mask]
71
+
72
+ if len(filtered) == 0:
73
+ logger.info(f"No class offerings found for {year} semester {semester}")
74
+ return {}
75
+
76
+ class_id_candidates = [
77
+ "kelas_id",
78
+ "id_kelas",
79
+ "kode_kelas",
80
+ "class_id",
81
+ "kelas",
82
+ "section_id",
83
+ "section",
84
+ ]
85
+ class_id_col = None
86
+
87
+ for col in class_id_candidates:
88
+ if col in filtered.columns:
89
+ class_id_col = col
90
+ logger.debug(f"Using class ID column: {col}")
91
+ break
92
+
93
+ if class_id_col is None:
94
+ cols = filtered.columns.tolist()
95
+ if len(cols) > 2:
96
+ potential_id_col = cols[2]
97
+ non_id_cols = [
98
+ "nama_mk",
99
+ "smt",
100
+ "thn",
101
+ "semester",
102
+ "tahun",
103
+ "kuota",
104
+ "kapasitas",
105
+ ]
106
+ if potential_id_col.lower() not in non_id_cols:
107
+ class_id_col = potential_id_col
108
+ logger.debug(
109
+ f"Using positional class ID column (index 2): {potential_id_col}"
110
+ )
111
+
112
+ result = {}
113
+
114
+ for kode_mk in filtered["kode_mk"].unique():
115
+ course_data = filtered[filtered["kode_mk"] == kode_mk]
116
+
117
+ if class_id_col and class_id_col in course_data.columns:
118
+ unique_classes = course_data[class_id_col].nunique()
119
+ logger.debug(
120
+ f"Course {kode_mk}: {len(course_data)} rows, {unique_classes} unique classes (by {class_id_col})"
121
+ )
122
+ else:
123
+ all_cols = course_data.columns.tolist()
124
+
125
+ dosen_cols = [
126
+ col
127
+ for col in all_cols
128
+ if "dosen" in col.lower()
129
+ or "pengajar" in col.lower()
130
+ or "teacher" in col.lower()
131
+ ]
132
+
133
+ if len(all_cols) > 0:
134
+ last_col = all_cols[-1]
135
+ if last_col not in dosen_cols:
136
+ non_last_cols = [c for c in all_cols if c != last_col]
137
+ if len(non_last_cols) > 0:
138
+ grouped = course_data.groupby(non_last_cols)[
139
+ last_col
140
+ ].nunique()
141
+ if (grouped > 1).any():
142
+ dosen_cols.append(last_col)
143
+
144
+ non_dosen_cols = [col for col in all_cols if col not in dosen_cols]
145
+
146
+ if non_dosen_cols:
147
+ unique_classes = len(
148
+ course_data.drop_duplicates(subset=non_dosen_cols)
149
+ )
150
+ else:
151
+ unique_classes = len(course_data.drop_duplicates())
152
+
153
+ logger.debug(
154
+ f"Course {kode_mk}: {len(course_data)} rows, {unique_classes} unique classes (fallback method)"
155
+ )
156
 
157
+ result[kode_mk] = max(1, unique_classes)
 
 
 
158
 
 
 
 
 
159
  logger.info(
160
+ f"Found {len(result)} courses with {sum(result.values())} total classes for {year} sem {semester}"
161
+ )
162
+ return result
163
+
164
+ def get_class_count_for_validation(self, year: int, semester: int) -> pd.DataFrame:
165
+ actual_classes = self.get_actual_classes_opened(year, semester)
166
+
167
+ if not actual_classes:
168
+ return pd.DataFrame(columns=["kode_mk", "actual_classes"])
169
+
170
+ return pd.DataFrame(
171
+ [
172
+ {"kode_mk": kode, "actual_classes": count}
173
+ for kode, count in actual_classes.items()
174
+ ]
175
  )
 
 
 
176
 
177
  def _clean_courses_data(self, courses: pd.DataFrame) -> pd.DataFrame:
 
 
 
 
 
 
 
 
 
 
178
  initial_count = len(courses)
179
 
180
+ # Remove duplicate
181
  courses = courses.drop_duplicates()
182
  if len(courses) < initial_count:
183
  logger.info(
184
  f" Removed {initial_count - len(courses)} exact duplicate rows"
185
  )
186
 
187
+ # Standardize kategori_mk
188
  courses["kategori_mk"] = (
189
  courses["kategori_mk"]
190
  .astype(str)
 
193
  .replace("", np.nan)
194
  )
195
 
196
+ # Remove rows with missing critical data
197
  before_dropna = len(courses)
198
  courses = courses.dropna(subset=["kode_mk", "kategori_mk"])
199
  if len(courses) < before_dropna:
 
201
  f" Removed {before_dropna - len(courses)} rows with missing kode_mk or kategori_mk"
202
  )
203
 
204
+ # Validate kategori_mk values
205
  valid_categories = {"P", "W"}
206
  invalid_mask = ~courses["kategori_mk"].isin(valid_categories)
207
  if invalid_mask.any():
 
212
  logger.warning(" Keeping only valid categories (P, W)")
213
  courses = courses[~invalid_mask]
214
 
215
+ # Remove duplicate course codes (keep first)
216
  before_dedup = len(courses)
217
  courses = courses.drop_duplicates(subset="kode_mk", keep="first")
218
  if len(courses) < before_dedup:
 
225
  return courses
226
 
227
  def _clean_students_data(self, students: pd.DataFrame) -> pd.DataFrame:
 
 
 
 
 
 
 
 
 
228
  initial_count = len(students)
229
 
230
+ # Remove rows with missing critical data
231
  students = students.dropna(subset=["kode_mk", "thn", "smt", "kode_mhs"])
232
  if len(students) < initial_count:
233
  logger.info(
234
  f" Removed {initial_count - len(students)} rows with missing critical data"
235
  )
236
 
237
+ # Ensure correct data types
238
  students["thn"] = pd.to_numeric(students["thn"], errors="coerce")
239
  students["smt"] = pd.to_numeric(students["smt"], errors="coerce")
240
 
241
+ # Remove rows with invalid year/semester after conversion
242
  before_invalid = len(students)
243
  students = students.dropna(subset=["thn", "smt"])
244
  if len(students) < before_invalid:
 
246
  f" Removed {before_invalid - len(students)} rows with invalid year/semester values"
247
  )
248
 
249
+ # Validate semester values
250
+ valid_semesters = {1, 2}
251
  invalid_sem = ~students["smt"].isin(valid_semesters)
252
  if invalid_sem.any():
253
  logger.warning(
 
255
  )
256
  students = students[~invalid_sem]
257
 
258
+ # Validate year range
259
  current_year = pd.Timestamp.now().year
260
  invalid_year = (students["thn"] < 2000) | (students["thn"] > current_year + 1)
261
  if invalid_year.any():
 
264
  )
265
  students = students[~invalid_year]
266
 
267
+ # Remove exact duplicate enrollments (same student, course, semester)
268
  before_dedup = len(students)
269
  students = students.drop_duplicates(
270
  subset=["kode_mhs", "kode_mk", "thn", "smt"], keep="first"
 
279
  return students
280
 
281
  def _clean_yearly_population(self, yearly_pop: pd.DataFrame) -> pd.DataFrame:
 
 
 
 
 
 
 
 
282
  # Remove duplicate year-semester combinations
283
  before_dedup = len(yearly_pop)
284
  yearly_pop = yearly_pop.drop_duplicates(subset=["thn", "smt"], keep="first")
 
292
  yearly_pop["jumlah_aktif"], errors="coerce"
293
  )
294
 
295
+ # Replace zero or negative values with NaN
296
  yearly_pop.loc[yearly_pop["jumlah_aktif"] <= 0, "jumlah_aktif"] = np.nan
297
 
298
  # Sort by year and semester
 
303
  return yearly_pop
304
 
305
  def _preprocess(self) -> Tuple[pd.DataFrame, Set[str]]:
306
+ # Clean course catalog
 
 
 
 
 
307
  courses = self._clean_courses_data(self.raw_data["courses"].copy())
308
 
309
+ # Identify elective courses
310
  elective_category = self.config.data.ELECTIVE_CATEGORY
311
  self.elective_codes = set(
312
  courses[courses["kategori_mk"] == elective_category]["kode_mk"]
313
  )
 
314
 
315
  if len(self.elective_codes) == 0:
316
  logger.warning(
 
321
  )
322
  return pd.DataFrame(), set()
323
 
324
+ # Clean student enrollment data
 
325
  students = self._clean_students_data(self.raw_data["students_ind"].copy())
326
 
327
+ # Filter for elective courses only
328
  students = students[students["kode_mk"].isin(self.elective_codes)]
 
329
 
330
  if len(students) == 0:
331
  logger.warning("No enrollment data found for elective courses!")
332
  return pd.DataFrame(), self.elective_codes
333
 
334
+ # Aggregate enrollment by course-semester
 
335
  enrollment = (
336
  students.groupby(["kode_mk", "thn", "smt"])["kode_mhs"]
337
  .nunique()
338
  .reset_index(name="enrollment")
339
  )
 
340
 
341
+ # Clean yearly population data
 
342
  yearly_pop = self._clean_yearly_population(
343
  self.raw_data["students_yearly"][["thn", "smt", "jumlah_aktif"]].copy()
344
  )
345
 
346
+ # Merge enrollment with population data
 
347
  df = enrollment.merge(yearly_pop, on=["thn", "smt"], how="left")
348
 
349
+ # Handle missing population data
350
  missing_pop = df["jumlah_aktif"].isna().sum()
351
  if missing_pop > 0:
 
 
 
352
  df["jumlah_aktif"] = df["jumlah_aktif"].ffill().bfill()
353
 
 
354
  if df["jumlah_aktif"].isna().any():
355
+ default_pop = 500
 
 
 
356
  df["jumlah_aktif"] = df["jumlah_aktif"].fillna(default_pop)
357
 
358
+ # Validate enrollment data
 
359
  df = self._validate_enrollment_data(df)
360
 
361
+ # Sort and finalize
362
  df = df.sort_values(["kode_mk", "thn", "smt"]).reset_index(drop=True)
363
  self.processed_data = df
364
 
 
 
 
 
 
 
 
 
365
  return df, self.elective_codes
366
 
367
  def _validate_enrollment_data(self, df: pd.DataFrame) -> pd.DataFrame:
 
 
 
 
 
 
 
 
 
 
368
  # Remove zero enrollments
369
  df = df[df["enrollment"] > 0]
 
 
 
 
370
 
371
  # Check for extreme outliers in enrollment
372
  for course in df["kode_mk"].unique():
 
374
  if len(course_data) > 1:
375
  q75, q25 = course_data.quantile([0.75, 0.25])
376
  iqr = q75 - q25
377
+ upper_bound = q75 + (3 * iqr)
378
 
379
  outliers = course_data > upper_bound
380
  if outliers.any():
data_validator.py DELETED
@@ -1,467 +0,0 @@
1
- """
2
- Data Validation Utility
3
-
4
- Provides pre-flight checks and data quality validation for the enrollment prediction system.
5
- This module validates data availability, quality, and completeness before processing.
6
- """
7
-
8
- import logging
9
- from dataclasses import dataclass
10
- from typing import Dict, List, Optional, Tuple
11
-
12
- import pandas as pd
13
-
14
- logger = logging.getLogger(__name__)
15
-
16
-
17
- @dataclass
18
- class ValidationResult:
19
- """Result of a validation check."""
20
-
21
- passed: bool
22
- message: str
23
- severity: str = "INFO" # INFO, WARNING, ERROR
24
- details: Optional[Dict] = None
25
-
26
-
27
- @dataclass
28
- class SemesterDataStatus:
29
- """Status of data availability for a specific semester."""
30
-
31
- year: int
32
- semester: int
33
- has_offerings: bool
34
- has_enrollments: bool
35
- has_elective_enrollments: bool
36
- total_enrollments: int
37
- elective_enrollments: int
38
- elective_courses: Dict[str, int]
39
-
40
-
41
- class DataValidator:
42
- """Validates data quality and availability for the enrollment prediction system."""
43
-
44
- def __init__(self, file_path: str):
45
- """
46
- Initialize the validator.
47
-
48
- Args:
49
- file_path: Path to the Excel data file
50
- """
51
- self.file_path = file_path
52
- self.validation_results: List[ValidationResult] = []
53
-
54
- def validate_all(self) -> Tuple[bool, List[ValidationResult]]:
55
- """
56
- Run all validation checks.
57
-
58
- Returns:
59
- Tuple of (all_passed, list of validation results)
60
- """
61
- logger.info("Running comprehensive data validation...")
62
-
63
- # Load raw data
64
- try:
65
- self.raw_data = self._load_raw_data()
66
- except Exception as e:
67
- self.validation_results.append(
68
- ValidationResult(
69
- passed=False,
70
- message=f"Failed to load data: {str(e)}",
71
- severity="ERROR",
72
- )
73
- )
74
- return False, self.validation_results
75
-
76
- # Run validation checks
77
- self._validate_file_structure()
78
- self._validate_course_catalog()
79
- self._validate_elective_courses()
80
- self._validate_enrollment_data()
81
- self._validate_population_data()
82
-
83
- # Overall result
84
- all_passed = all(
85
- r.passed for r in self.validation_results if r.severity == "ERROR"
86
- )
87
-
88
- return all_passed, self.validation_results
89
-
90
- def check_semester_data_availability(
91
- self, year: int, semester: int
92
- ) -> SemesterDataStatus:
93
- """
94
- Check data availability for a specific semester.
95
-
96
- Args:
97
- year: Academic year
98
- semester: Semester (1 or 2)
99
-
100
- Returns:
101
- SemesterDataStatus object with detailed availability info
102
- """
103
- if not hasattr(self, "raw_data"):
104
- self.raw_data = self._load_raw_data()
105
-
106
- # Check course offerings (tabel2)
107
- offerings = self.raw_data["offerings"]
108
- has_offerings = (
109
- len(
110
- offerings[
111
- (offerings["tahun"] == year) & (offerings["semester"] == semester)
112
- ]
113
- )
114
- > 0
115
- )
116
-
117
- # Check enrollments (tabel4)
118
- students = self.raw_data["students"]
119
- semester_enrollments = students[
120
- (students["thn"] == year) & (students["smt"] == semester)
121
- ]
122
- has_enrollments = len(semester_enrollments) > 0
123
-
124
- # Check elective enrollments
125
- elective_codes = self._get_elective_codes()
126
- elective_enrollments = semester_enrollments[
127
- semester_enrollments["kode_mk"].isin(elective_codes)
128
- ]
129
- has_elective_enrollments = len(elective_enrollments) > 0
130
-
131
- # Get elective courses for this semester
132
- elective_courses: Dict[str, int] = {}
133
- if has_elective_enrollments:
134
- elective_courses = (
135
- elective_enrollments.groupby("kode_mk")["kode_mhs"]
136
- .nunique()
137
- .sort_values(ascending=False)
138
- .to_dict()
139
- )
140
-
141
- return SemesterDataStatus(
142
- year=year,
143
- semester=semester,
144
- has_offerings=has_offerings,
145
- has_enrollments=has_enrollments,
146
- has_elective_enrollments=has_elective_enrollments,
147
- total_enrollments=len(semester_enrollments),
148
- elective_enrollments=len(elective_enrollments),
149
- elective_courses=elective_courses,
150
- )
151
-
152
- def get_available_semesters_for_backtesting(self) -> List[Tuple[int, int]]:
153
- """
154
- Get list of semesters that have elective enrollment data (suitable for backtesting).
155
-
156
- Returns:
157
- List of (year, semester) tuples
158
- """
159
- if not hasattr(self, "raw_data"):
160
- self.raw_data = self._load_raw_data()
161
-
162
- students = self.raw_data["students"]
163
- elective_codes = self._get_elective_codes()
164
-
165
- # Filter to elective enrollments only
166
- elective_students = students[students["kode_mk"].isin(elective_codes)]
167
-
168
- # Get unique year-semester combinations
169
- available = (
170
- elective_students.groupby(["thn", "smt"]).size().reset_index(name="count")
171
- )
172
- available = available[available["count"] > 0]
173
-
174
- semesters = [
175
- (int(row["thn"]), int(row["smt"])) for _, row in available.iterrows()
176
- ]
177
- semesters.sort(reverse=True) # Most recent first
178
-
179
- return semesters
180
-
181
- def print_validation_summary(self):
182
- """Print a summary of validation results."""
183
- if not self.validation_results:
184
- print("\nWARNING: No validation has been run yet.")
185
- return
186
-
187
- print("\n" + "=" * 80)
188
- print("DATA VALIDATION SUMMARY")
189
- print("=" * 80)
190
-
191
- errors = [r for r in self.validation_results if r.severity == "ERROR"]
192
- warnings = [r for r in self.validation_results if r.severity == "WARNING"]
193
- info = [r for r in self.validation_results if r.severity == "INFO"]
194
-
195
- if errors:
196
- print(f"\nERROR ({len(errors)}):")
197
- for result in errors:
198
- print(f" - {result.message}")
199
-
200
- if warnings:
201
- print(f"\nWARNING ({len(warnings)}):")
202
- for result in warnings:
203
- print(f" - {result.message}")
204
-
205
- if info:
206
- print(f"\nINFO ({len(info)}):")
207
- for result in info:
208
- print(f" - {result.message}")
209
-
210
- print("\n" + "=" * 80)
211
- if not errors:
212
- print("VALIDATION PASSED - Data is ready for processing")
213
- else:
214
- print("VALIDATION FAILED - Please fix errors before proceeding")
215
- print("=" * 80)
216
-
217
- def _load_raw_data(self) -> Dict[str, pd.DataFrame]:
218
- """Load raw data from Excel file."""
219
- logger.info(f"Loading data from {self.file_path}...")
220
-
221
- return {
222
- "courses": pd.read_excel(self.file_path, sheet_name="tabel1_data_matkul"),
223
- "offerings": pd.read_excel(
224
- self.file_path, sheet_name="tabel2_data_matkul_dibuka"
225
- ),
226
- "population": pd.read_excel(
227
- self.file_path, sheet_name="tabel3_data_mahasiswa_per_tahun"
228
- ),
229
- "students": pd.read_excel(
230
- self.file_path, sheet_name="tabel4_data_individu_mahasiswa"
231
- ),
232
- }
233
-
234
- def _validate_file_structure(self):
235
- """Validate that all required sheets and columns exist."""
236
- required_sheets = {
237
- "courses": ["kode_mk", "nama_mk", "kategori_mk"],
238
- "offerings": ["kode_mk", "tahun", "semester"],
239
- "students": ["kode_mk", "kode_mhs", "thn", "smt"],
240
- "population": ["jumlah_aktif"], # tahun_ajaran and semester may vary
241
- }
242
-
243
- for sheet_name, required_cols in required_sheets.items():
244
- df = self.raw_data.get(sheet_name)
245
- if df is None:
246
- self.validation_results.append(
247
- ValidationResult(
248
- passed=False,
249
- message=f"Sheet '{sheet_name}' not found",
250
- severity="ERROR",
251
- )
252
- )
253
- continue
254
-
255
- missing_cols = [col for col in required_cols if col not in df.columns]
256
- if missing_cols:
257
- self.validation_results.append(
258
- ValidationResult(
259
- passed=False,
260
- message=f"Missing columns in {sheet_name}: {missing_cols}",
261
- severity="ERROR",
262
- )
263
- )
264
- else:
265
- self.validation_results.append(
266
- ValidationResult(
267
- passed=True,
268
- message=f"Sheet '{sheet_name}' has all required columns",
269
- severity="INFO",
270
- )
271
- )
272
-
273
- def _validate_course_catalog(self):
274
- """Validate course catalog (tabel1)."""
275
- courses = self.raw_data["courses"]
276
-
277
- # Check for duplicates
278
- total_records = len(courses)
279
- unique_courses = courses["kode_mk"].nunique()
280
- duplicate_count = total_records - unique_courses
281
-
282
- if duplicate_count > 0:
283
- self.validation_results.append(
284
- ValidationResult(
285
- passed=True,
286
- message=f"Course catalog has {duplicate_count:,} duplicate records (will be cleaned)",
287
- severity="WARNING",
288
- details={"total": total_records, "unique": unique_courses},
289
- )
290
- )
291
-
292
- # Check for category consistency
293
- categories = courses["kategori_mk"].unique()
294
- non_standard = [c for c in categories if c not in ["W", "P"]]
295
- if non_standard:
296
- self.validation_results.append(
297
- ValidationResult(
298
- passed=True,
299
- message=f"Non-standard categories found: {non_standard} (will be normalized)",
300
- severity="WARNING",
301
- )
302
- )
303
-
304
- def _validate_elective_courses(self):
305
- """Validate elective course identification."""
306
- courses = self.raw_data["courses"]
307
-
308
- # Clean and identify electives
309
- courses_clean = courses.drop_duplicates(subset="kode_mk").copy()
310
- courses_clean["kategori_mk"] = (
311
- courses_clean["kategori_mk"].astype(str).str.upper().str.strip()
312
- )
313
-
314
- electives = courses_clean[courses_clean["kategori_mk"] == "P"]
315
- elective_count = len(electives)
316
-
317
- if elective_count == 0:
318
- self.validation_results.append(
319
- ValidationResult(
320
- passed=False,
321
- message="No elective courses found (kategori_mk = 'P')",
322
- severity="ERROR",
323
- )
324
- )
325
- else:
326
- self.validation_results.append(
327
- ValidationResult(
328
- passed=True,
329
- message=f"Found {elective_count} elective courses",
330
- severity="INFO",
331
- details={"electives": electives["kode_mk"].tolist()},
332
- )
333
- )
334
-
335
- def _validate_enrollment_data(self):
336
- """Validate student enrollment data (tabel4)."""
337
- students = self.raw_data["students"]
338
-
339
- # Check for missing critical data
340
- critical_fields = ["kode_mk", "kode_mhs", "thn", "smt"]
341
- missing_data = students[critical_fields].isnull().any(axis=1).sum()
342
-
343
- if missing_data > 0:
344
- self.validation_results.append(
345
- ValidationResult(
346
- passed=True,
347
- message=f"{missing_data} enrollment records have missing data (will be cleaned)",
348
- severity="WARNING",
349
- )
350
- )
351
-
352
- # Check for duplicates
353
- duplicate_enrollments = students.duplicated(
354
- subset=["kode_mhs", "kode_mk", "thn", "smt"]
355
- ).sum()
356
-
357
- if duplicate_enrollments > 0:
358
- self.validation_results.append(
359
- ValidationResult(
360
- passed=True,
361
- message=f"{duplicate_enrollments:,} duplicate enrollment records (will be cleaned)",
362
- severity="WARNING",
363
- )
364
- )
365
-
366
- # Check year range
367
- min_year = students["thn"].min()
368
- max_year = students["thn"].max()
369
-
370
- self.validation_results.append(
371
- ValidationResult(
372
- passed=True,
373
- message=f"Enrollment data spans {int(min_year)} to {int(max_year)}",
374
- severity="INFO",
375
- )
376
- )
377
-
378
- def _validate_population_data(self):
379
- """Validate yearly population data (tabel3)."""
380
- population = self.raw_data["population"]
381
-
382
- if len(population) == 0:
383
- self.validation_results.append(
384
- ValidationResult(
385
- passed=False,
386
- message="No population data found",
387
- severity="ERROR",
388
- )
389
- )
390
- return
391
-
392
- # Check for required fields (note: actual columns are tahun_ajaran/semester, not in sheet_name definition)
393
- if "jumlah_aktif" in population.columns:
394
- min_pop = population["jumlah_aktif"].min()
395
- max_pop = population["jumlah_aktif"].max()
396
-
397
- self.validation_results.append(
398
- ValidationResult(
399
- passed=True,
400
- message=f"Population data: {len(population)} records, range {int(min_pop)}-{int(max_pop)} students",
401
- severity="INFO",
402
- )
403
- )
404
- else:
405
- self.validation_results.append(
406
- ValidationResult(
407
- passed=False,
408
- message="Population data missing 'jumlah_aktif' column",
409
- severity="ERROR",
410
- )
411
- )
412
-
413
- def _get_elective_codes(self) -> set:
414
- """Get set of elective course codes."""
415
- courses = self.raw_data["courses"]
416
- courses_clean = courses.drop_duplicates(subset="kode_mk").copy()
417
- courses_clean["kategori_mk"] = (
418
- courses_clean["kategori_mk"].astype(str).str.upper().str.strip()
419
- )
420
- return set(courses_clean[courses_clean["kategori_mk"] == "P"]["kode_mk"])
421
-
422
-
423
- if __name__ == "__main__":
424
- # Example usage
425
- logging.basicConfig(
426
- level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
427
- )
428
-
429
- validator = DataValidator(
430
- "data/Data Perkuliahan Mahasiswa untuk Penelitian (8 Oktober 2025).xlsx"
431
- )
432
-
433
- # Run validation
434
- passed, results = validator.validate_all()
435
- validator.print_validation_summary()
436
-
437
- # Check specific semesters
438
- print("\n" + "=" * 80)
439
- print("SEMESTER DATA AVAILABILITY")
440
- print("=" * 80)
441
-
442
- for year, semester in [(2024, 2), (2025, 1)]:
443
- status = validator.check_semester_data_availability(year, semester)
444
- print(f"\n{year} Semester {semester}:")
445
- print(f" Offerings: {'Yes' if status.has_offerings else 'No'}")
446
- print(
447
- f" Enrollments: {'Yes' if status.has_enrollments else 'No'} ({status.total_enrollments} records)"
448
- )
449
- print(
450
- f" Elective Enrollments: {'Yes' if status.has_elective_enrollments else 'No'} ({status.elective_enrollments} records)"
451
- )
452
- if status.elective_courses:
453
- print(f" Elective courses: {len(status.elective_courses)}")
454
- for code, count in list(status.elective_courses.items())[:5]:
455
- print(f" - {code}: {count} students")
456
-
457
- # Show available semesters for backtesting
458
- print("\n" + "=" * 80)
459
- print("SEMESTERS AVAILABLE FOR BACKTESTING")
460
- print("=" * 80)
461
- available = validator.get_available_semesters_for_backtesting()
462
- if available:
463
- print(f"\nFound {len(available)} semesters with elective enrollment data:")
464
- for year, sem in available:
465
- print(f" • {year} Semester {sem}")
466
- else:
467
- print("\nERROR: No semesters with elective enrollment data found!")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evaluator.py CHANGED
@@ -17,12 +17,11 @@ class Evaluator:
17
  self.config = config
18
 
19
  def run_backtest(self, full_data: pd.DataFrame, predictor):
20
- """Simulate past semesters to check accuracy."""
21
- logger.info("Starting Backtest...")
22
  results = []
23
 
24
  start_year: int = self.config.backtest.START_YEAR
25
  end_year: int = self.config.backtest.END_YEAR
 
26
 
27
  for year in range(start_year, end_year + 1):
28
  for smt in [1, 2]:
@@ -47,53 +46,251 @@ class Evaluator:
47
  row["kode_mk"], train_set, year, smt, pop_est
48
  )
49
 
 
 
 
 
 
 
 
 
 
 
 
50
  results.append(
51
  {
52
  "year": year,
53
  "semester": smt,
54
  "kode_mk": row["kode_mk"],
55
- "actual": row["enrollment"],
56
- "predicted": pred["val"],
 
 
57
  "strategy": pred["strategy"],
58
- "error": abs(row["enrollment"] - pred["val"]),
 
59
  }
60
  )
61
 
62
  return pd.DataFrame(results)
63
 
 
 
 
 
 
64
  def generate_metrics(self, results: pd.DataFrame):
65
- """Calculate and log performance metrics."""
 
 
 
66
  results["error"] = abs(results["predicted"] - results["actual"])
 
 
 
67
 
 
68
  mae = mean_absolute_error(results["actual"], results["predicted"])
69
  rmse = np.sqrt(mean_squared_error(results["actual"], results["predicted"]))
70
 
71
- logger.info("\n" + "=" * 40)
 
 
 
 
 
 
 
 
 
 
 
 
72
  logger.info("BACKTEST METRICS")
73
- logger.info("=" * 40)
74
- logger.info(f"Overall MAE: {mae:.2f}")
75
- logger.info(f"Overall RMSE: {rmse:.2f}")
 
 
 
 
 
76
 
77
  logger.info("\nPerformance by Strategy:")
78
- strat_perf = results.groupby("strategy")["error"].mean()
 
 
 
 
 
79
  logger.info(strat_perf.to_string())
80
 
 
 
81
  self._plot_results(results)
 
82
 
83
- return {"mae": mae, "rmse": rmse}
 
 
 
 
 
 
84
 
85
  def _plot_results(self, df):
86
- """Generate simple Actual vs Predicted scatter plot."""
87
  Path(self.config.output.OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
88
 
89
  plt.figure(figsize=(10, 6))
90
  sns.scatterplot(
91
- data=df, x="actual", y="predicted", hue="strategy", style="strategy"
 
 
 
 
 
92
  )
93
 
94
  limit = max(df["actual"].max(), df["predicted"].max())
95
- plt.plot([0, limit], [0, limit], "r--", alpha=0.5)
96
 
97
  plt.title("Actual vs Predicted Enrollment")
98
- plt.savefig(f"{self.config.output.OUTPUT_DIR}/backtest_scatter.png")
 
 
 
 
 
 
99
  plt.close()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  self.config = config
18
 
19
  def run_backtest(self, full_data: pd.DataFrame, predictor):
 
 
20
  results = []
21
 
22
  start_year: int = self.config.backtest.START_YEAR
23
  end_year: int = self.config.backtest.END_YEAR
24
+ class_capacity = self.config.class_capacity.DEFAULT_CLASS_CAPACITY
25
 
26
  for year in range(start_year, end_year + 1):
27
  for smt in [1, 2]:
 
46
  row["kode_mk"], train_set, year, smt, pop_est
47
  )
48
 
49
+ actual_enrollment = row["enrollment"]
50
+ predicted_enrollment = pred["val"]
51
+
52
+ actual_classes = self._calculate_classes(
53
+ actual_enrollment, class_capacity
54
+ )
55
+ predicted_classes = pred.get(
56
+ "classes_needed",
57
+ self._calculate_classes(predicted_enrollment, class_capacity),
58
+ )
59
+
60
  results.append(
61
  {
62
  "year": year,
63
  "semester": smt,
64
  "kode_mk": row["kode_mk"],
65
+ "actual": actual_enrollment,
66
+ "predicted": predicted_enrollment,
67
+ "actual_classes": actual_classes,
68
+ "predicted_classes": predicted_classes,
69
  "strategy": pred["strategy"],
70
+ "error": abs(actual_enrollment - predicted_enrollment),
71
+ "class_error": abs(actual_classes - predicted_classes),
72
  }
73
  )
74
 
75
  return pd.DataFrame(results)
76
 
77
+ def _calculate_classes(self, enrollment: float, capacity: int) -> int:
78
+ if enrollment < self.config.class_capacity.MIN_STUDENTS_TO_OPEN_CLASS:
79
+ return 0
80
+ return int(np.ceil(enrollment / capacity))
81
+
82
  def generate_metrics(self, results: pd.DataFrame):
83
+ if results.empty:
84
+ logger.warning("No results to generate metrics from")
85
+ return {"mae": 0, "rmse": 0, "class_mae": 0, "class_accuracy": 0}
86
+
87
  results["error"] = abs(results["predicted"] - results["actual"])
88
+ results["class_error"] = abs(
89
+ results["predicted_classes"] - results["actual_classes"]
90
+ )
91
 
92
+ # Enrollment metrics
93
  mae = mean_absolute_error(results["actual"], results["predicted"])
94
  rmse = np.sqrt(mean_squared_error(results["actual"], results["predicted"]))
95
 
96
+ # Class count metrics
97
+ class_mae = results["class_error"].mean()
98
+
99
+ # Class accuracy: percentage of predictions with correct class count
100
+ class_correct = (results["class_error"] == 0).sum()
101
+ class_accuracy = (class_correct / len(results)) * 100 if len(results) > 0 else 0
102
+
103
+ # Class accuracy within 1: predictions within ±1 class
104
+ class_within_1 = (results["class_error"] <= 1).sum()
105
+ class_accuracy_within_1 = (
106
+ (class_within_1 / len(results)) * 100 if len(results) > 0 else 0
107
+ )
108
+
109
  logger.info("BACKTEST METRICS")
110
+ logger.info("\nEnrollment Prediction Metrics:")
111
+ logger.info(f" Overall MAE: {mae:.2f} students")
112
+ logger.info(f" Overall RMSE: {rmse:.2f} students")
113
+
114
+ logger.info("\nClass Count Prediction Metrics:")
115
+ logger.info(f" Class MAE: {class_mae:.2f} classes")
116
+ logger.info(f" Exact Class Match: {class_accuracy:.1f}%")
117
+ logger.info(f" Within ±1 Class: {class_accuracy_within_1:.1f}%")
118
 
119
  logger.info("\nPerformance by Strategy:")
120
+ strat_perf = (
121
+ results.groupby("strategy")
122
+ .agg({"error": "mean", "class_error": "mean"})
123
+ .round(2)
124
+ )
125
+ strat_perf.columns = ["Avg Enrollment Error", "Avg Class Error"]
126
  logger.info(strat_perf.to_string())
127
 
128
+ logger.info("=" * 50)
129
+
130
  self._plot_results(results)
131
+ self._plot_class_results(results)
132
 
133
+ return {
134
+ "mae": mae,
135
+ "rmse": rmse,
136
+ "class_mae": class_mae,
137
+ "class_accuracy": class_accuracy,
138
+ "class_accuracy_within_1": class_accuracy_within_1,
139
+ }
140
 
141
  def _plot_results(self, df):
 
142
  Path(self.config.output.OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
143
 
144
  plt.figure(figsize=(10, 6))
145
  sns.scatterplot(
146
+ data=df,
147
+ x="actual",
148
+ y="predicted",
149
+ hue="strategy",
150
+ style="strategy",
151
+ alpha=0.7,
152
  )
153
 
154
  limit = max(df["actual"].max(), df["predicted"].max())
155
+ plt.plot([0, limit], [0, limit], "r--", alpha=0.5, label="Perfect Prediction")
156
 
157
  plt.title("Actual vs Predicted Enrollment")
158
+ plt.xlabel("Actual Enrollment")
159
+ plt.ylabel("Predicted Enrollment")
160
+ plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
161
+ plt.tight_layout()
162
+ plt.savefig(
163
+ f"{self.config.output.OUTPUT_DIR}/backtest_enrollment_scatter.png", dpi=150
164
+ )
165
  plt.close()
166
+
167
+ def _plot_class_results(self, df):
168
+ Path(self.config.output.OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
169
+
170
+ plt.figure(figsize=(10, 6))
171
+
172
+ jitter_strength = 0.1
173
+ df_plot = df.copy()
174
+ df_plot["actual_jitter"] = df_plot["actual_classes"] + np.random.uniform(
175
+ -jitter_strength, jitter_strength, len(df_plot)
176
+ )
177
+ df_plot["predicted_jitter"] = df_plot["predicted_classes"] + np.random.uniform(
178
+ -jitter_strength, jitter_strength, len(df_plot)
179
+ )
180
+
181
+ sns.scatterplot(
182
+ data=df_plot,
183
+ x="actual_jitter",
184
+ y="predicted_jitter",
185
+ hue="strategy",
186
+ style="strategy",
187
+ alpha=0.7,
188
+ )
189
+
190
+ limit = max(df["actual_classes"].max(), df["predicted_classes"].max()) + 1
191
+ plt.plot([0, limit], [0, limit], "r--", alpha=0.5, label="Perfect Prediction")
192
+
193
+ plt.title("Actual vs Predicted Number of Classes")
194
+ plt.xlabel("Actual Classes Needed")
195
+ plt.ylabel("Predicted Classes Needed")
196
+ plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
197
+ plt.tight_layout()
198
+ plt.savefig(
199
+ f"{self.config.output.OUTPUT_DIR}/backtest_classes_scatter.png", dpi=150
200
+ )
201
+ plt.close()
202
+
203
+ def generate_class_capacity_report(self, results: pd.DataFrame) -> pd.DataFrame:
204
+ if results.empty:
205
+ return pd.DataFrame()
206
+
207
+ course_summary = (
208
+ results.groupby("kode_mk")
209
+ .agg(
210
+ {
211
+ "actual": ["mean", "sum", "count"],
212
+ "predicted": ["mean", "sum"],
213
+ "actual_classes": ["mean", "sum"],
214
+ "predicted_classes": ["mean", "sum"],
215
+ "class_error": ["mean", "sum"],
216
+ }
217
+ )
218
+ .round(2)
219
+ )
220
+
221
+ course_summary.columns = [
222
+ "avg_actual_enrollment",
223
+ "total_actual_enrollment",
224
+ "n_semesters",
225
+ "avg_predicted_enrollment",
226
+ "total_predicted_enrollment",
227
+ "avg_actual_classes",
228
+ "total_actual_classes",
229
+ "avg_predicted_classes",
230
+ "total_predicted_classes",
231
+ "avg_class_error",
232
+ "total_class_error",
233
+ ]
234
+
235
+ course_summary = course_summary.reset_index()
236
+ course_summary = course_summary.sort_values(
237
+ "total_class_error", ascending=False
238
+ )
239
+
240
+ return course_summary
241
+
242
+ def analyze_capacity_trends(self, full_data: pd.DataFrame) -> pd.DataFrame:
243
+ class_capacity = self.config.class_capacity.DEFAULT_CLASS_CAPACITY
244
+
245
+ trend_data = full_data.copy()
246
+ trend_data["classes_needed"] = trend_data["enrollment"].apply(
247
+ lambda x: self._calculate_classes(x, class_capacity)
248
+ )
249
+
250
+ course_trends = []
251
+
252
+ for course in trend_data["kode_mk"].unique():
253
+ course_data = trend_data[trend_data["kode_mk"] == course].sort_values(
254
+ ["thn", "smt"]
255
+ )
256
+
257
+ if len(course_data) < 2:
258
+ continue
259
+
260
+ first_year = course_data.iloc[0]
261
+ last_year = course_data.iloc[-1]
262
+
263
+ enrollment_growth = last_year["enrollment"] - first_year["enrollment"]
264
+ class_growth = last_year["classes_needed"] - first_year["classes_needed"]
265
+
266
+ years_diff = last_year["thn"] - first_year["thn"]
267
+ if years_diff > 0 and first_year["enrollment"] > 0:
268
+ annual_growth_rate = (
269
+ (last_year["enrollment"] / first_year["enrollment"])
270
+ ** (1 / years_diff)
271
+ - 1
272
+ ) * 100
273
+ else:
274
+ annual_growth_rate = 0
275
+
276
+ course_trends.append(
277
+ {
278
+ "kode_mk": course,
279
+ "first_enrollment": first_year["enrollment"],
280
+ "last_enrollment": last_year["enrollment"],
281
+ "enrollment_growth": enrollment_growth,
282
+ "first_classes": first_year["classes_needed"],
283
+ "last_classes": last_year["classes_needed"],
284
+ "class_growth": class_growth,
285
+ "annual_growth_rate": round(annual_growth_rate, 1),
286
+ "data_points": len(course_data),
287
+ "year_range": f"{int(first_year['thn'])}-{int(last_year['thn'])}",
288
+ }
289
+ )
290
+
291
+ trends_df = pd.DataFrame(course_trends)
292
+
293
+ if not trends_df.empty:
294
+ trends_df = trends_df.sort_values("annual_growth_rate", ascending=False)
295
+
296
+ return trends_df
prophet_predictor.py CHANGED
@@ -1,5 +1,5 @@
1
  import logging
2
- from typing import Optional
3
 
4
  import numpy as np
5
  import pandas as pd
@@ -24,8 +24,13 @@ class ProphetPredictor:
24
  )
25
  df["y"] = df["jumlah_aktif"]
26
 
27
- self.student_model = Prophet(daily_seasonality=False, weekly_seasonality=False) # type: ignore[arg-type]
28
- self.student_model.fit(df)
 
 
 
 
 
29
  logger.info("Student population model trained.")
30
 
31
  def get_student_forecast(self, year: int, semester: int) -> float:
@@ -37,6 +42,19 @@ class ProphetPredictor:
37
  forecast = self.student_model.predict(future)
38
  return max(forecast["yhat"].values[0], 100)
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  def predict_course(
41
  self,
42
  course_code: str,
@@ -46,23 +64,41 @@ class ProphetPredictor:
46
  student_pop: float,
47
  ) -> dict:
48
  hist = df_history[
49
- (df_history["kode_mk"] == course_code) &
50
- (df_history["smt"] == target_smt)
51
  ].sort_values(["thn", "smt"])
52
 
53
- if len(hist) == 0:
 
 
54
  return {
55
  "val": self.config.model.FALLBACK_DEFAULT,
56
  "strategy": "cold_start",
57
  "confidence": "low",
 
 
 
 
 
 
 
 
58
  }
59
 
60
- return self._predict_prophet_logistic(
61
- hist, target_year, target_smt, student_pop
 
 
 
 
 
 
 
62
  )
63
 
64
- def _predict_prophet_logistic(
65
- self, hist: pd.DataFrame, year: int, smt: int, pop: float
 
 
66
  ) -> dict:
67
  df = hist.copy()
68
  df["ds"] = pd.to_datetime(
@@ -89,14 +125,20 @@ class ProphetPredictor:
89
  "confidence": "low",
90
  }
91
 
92
- hist_max = df["y"].max()
93
- hist_mean = df["y"].mean()
 
 
94
 
95
  cap_value = min(
96
  hist_max * self.config.prediction.MAX_CAPACITY_MULTIPLIER,
97
  self.config.prediction.ABSOLUTE_MAX_STUDENTS,
98
  )
99
 
 
 
 
 
100
  df["cap"] = cap_value
101
  df["floor"] = 0
102
 
@@ -109,8 +151,11 @@ class ProphetPredictor:
109
  weekly_seasonality=False, # type: ignore[arg-type]
110
  )
111
 
112
- m.add_regressor("jumlah_aktif", mode="multiplicative")
113
- m.fit(df[["ds", "y", "cap", "floor", "jumlah_aktif"]])
 
 
 
114
 
115
  future_date = pd.to_datetime(
116
  f"{year}-{self.config.prediction.SEMESTER_TO_MONTH[smt]}"
@@ -121,10 +166,12 @@ class ProphetPredictor:
121
  "ds": [future_date],
122
  "cap": [cap_value],
123
  "floor": [0],
124
- "jumlah_aktif": [pop],
125
  }
126
  )
127
 
 
 
 
128
  forecast = m.predict(future)
129
  raw_pred = forecast["yhat"].values[0]
130
 
@@ -135,18 +182,17 @@ class ProphetPredictor:
135
  or raw_pred > cap_value * 2
136
  ):
137
  logger.warning(
138
- f"Prophet prediction ({raw_pred:.1f}) unrealistic. "
139
  f"Using trend-based fallback. (hist_max={hist_max}, cap={cap_value})"
140
  )
 
141
  if len(df) >= 3:
142
- recent_trend = df["y"].tail(3).mean()
143
- pop_growth_factor = pop / df["jumlah_aktif"].mean()
144
- growth_factor = min(
145
- max(pop_growth_factor, 0.8), 1.3
146
- )
147
  pred = recent_trend * growth_factor
148
  else:
149
- pop_growth_factor = pop / df["jumlah_aktif"].mean()
150
  pred = hist_mean * min(max(pop_growth_factor, 0.8), 1.3)
151
 
152
  pred = min(max(pred, 0), cap_value)
@@ -166,13 +212,37 @@ class ProphetPredictor:
166
  }
167
 
168
  except Exception as e:
169
- logger.warning(f"Prophet failed for course. Error: {e}. Using fallback.")
 
 
170
  return {
171
  "val": hist["enrollment"].mean(),
172
  "strategy": "fallback_mean",
173
  "confidence": "medium",
174
  }
175
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176
  def generate_batch_predictions(
177
  self,
178
  full_data: pd.DataFrame,
@@ -180,8 +250,7 @@ class ProphetPredictor:
180
  electives: set,
181
  year: int,
182
  smt: int,
183
- ):
184
- """Generate predictions for all courses."""
185
  student_pop = self.get_student_forecast(year, smt)
186
  results = []
187
 
@@ -190,35 +259,57 @@ class ProphetPredictor:
190
  )
191
 
192
  for code in electives:
193
- meta = course_metadata[course_metadata["kode_mk"] == code].iloc[0]
 
 
 
 
194
 
195
  pred_result = self.predict_course(code, full_data, year, smt, student_pop)
196
  pred_val = pred_result["val"]
197
-
198
- rec_quota = int(
199
- np.ceil(pred_val * (1 + self.config.prediction.BUFFER_PERCENT))
 
 
 
 
 
200
  )
201
- rec_quota = max(rec_quota, self.config.prediction.MIN_QUOTA_OPEN)
202
 
203
- status = (
204
- "BUKA"
205
- if pred_val >= self.config.prediction.MIN_PREDICT_THRESHOLD
206
- else "TUTUP"
 
 
 
 
 
 
207
  )
 
 
 
 
 
 
 
208
 
209
  results.append(
210
  {
211
  "kode_mk": code,
212
  "nama_mk": meta["nama_mk"],
213
- "sks": meta["sks_mk"],
214
  "predicted_enrollment": round(pred_val, 1),
215
- "recommended_quota": rec_quota if status == "BUKA" else 0,
 
 
 
216
  "recommendation": status,
 
217
  "strategy": pred_result["strategy"],
218
  "confidence": pred_result["confidence"],
219
- "classes_est": int(np.ceil(rec_quota / 40))
220
- if status == "BUKA"
221
- else 0,
222
  }
223
  )
224
 
@@ -226,6 +317,98 @@ class ProphetPredictor:
226
  "predicted_enrollment", ascending=False
227
  )
228
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
229
  def predict_course_enrollment(
230
  self,
231
  course_code: str,
@@ -233,7 +416,7 @@ class ProphetPredictor:
233
  test_year: int,
234
  test_semester: int,
235
  test_student_count: float,
236
- ) -> tuple[float, str]:
237
  result = self.predict_course(
238
  course_code=course_code,
239
  df_history=train_data,
 
1
  import logging
2
+ from typing import Dict, List, Optional, Tuple
3
 
4
  import numpy as np
5
  import pandas as pd
 
24
  )
25
  df["y"] = df["jumlah_aktif"]
26
 
27
+ self.student_model = Prophet(
28
+ growth="linear",
29
+ daily_seasonality=False, # type: ignore[arg-type]
30
+ weekly_seasonality=False, # type: ignore[arg-type]
31
+ yearly_seasonality=True, # type: ignore[arg-type]
32
+ )
33
+ self.student_model.fit(df[["ds", "y"]])
34
  logger.info("Student population model trained.")
35
 
36
  def get_student_forecast(self, year: int, semester: int) -> float:
 
42
  forecast = self.student_model.predict(future)
43
  return max(forecast["yhat"].values[0], 100)
44
 
45
+ def get_multi_year_student_forecast(
46
+ self, start_year: int, semester: int, years_ahead: int
47
+ ) -> List[Tuple[int, float]]:
48
+ assert self.student_model is not None, "Student model must be trained first"
49
+
50
+ forecasts = []
51
+ for i in range(years_ahead + 1):
52
+ year = start_year + i
53
+ pop = self.get_student_forecast(year, semester)
54
+ forecasts.append((year, pop))
55
+
56
+ return forecasts
57
+
58
  def predict_course(
59
  self,
60
  course_code: str,
 
64
  student_pop: float,
65
  ) -> dict:
66
  hist = df_history[
67
+ (df_history["kode_mk"] == course_code) & (df_history["smt"] == target_smt)
 
68
  ].sort_values(["thn", "smt"])
69
 
70
+ has_historical_data = len(hist) > 0
71
+
72
+ if not has_historical_data:
73
  return {
74
  "val": self.config.model.FALLBACK_DEFAULT,
75
  "strategy": "cold_start",
76
  "confidence": "low",
77
+ "classes_needed": self.config.calculate_classes_needed(
78
+ self.config.model.FALLBACK_DEFAULT,
79
+ course_code,
80
+ has_historical_data=False,
81
+ ),
82
+ "capacity_status": self.config.get_capacity_status(
83
+ self.config.model.FALLBACK_DEFAULT, course_code
84
+ ),
85
  }
86
 
87
+ result = self._predict_prophet_with_capacity(
88
+ hist, target_year, target_smt, student_pop, course_code
89
+ )
90
+
91
+ result["classes_needed"] = self.config.calculate_classes_needed(
92
+ result["val"], course_code, has_historical_data=has_historical_data
93
+ )
94
+ result["capacity_status"] = self.config.get_capacity_status(
95
+ result["val"], course_code
96
  )
97
 
98
+ return result
99
+
100
+ def _predict_prophet_with_capacity(
101
+ self, hist: pd.DataFrame, year: int, smt: int, pop: float, course_code: str
102
  ) -> dict:
103
  df = hist.copy()
104
  df["ds"] = pd.to_datetime(
 
125
  "confidence": "low",
126
  }
127
 
128
+ hist_max = float(df["y"].max())
129
+ hist_mean = float(df["y"].mean())
130
+
131
+ class_capacity = self.config.get_class_capacity(course_code)
132
 
133
  cap_value = min(
134
  hist_max * self.config.prediction.MAX_CAPACITY_MULTIPLIER,
135
  self.config.prediction.ABSOLUTE_MAX_STUDENTS,
136
  )
137
 
138
+ if self.config.class_capacity.ENABLE_CAPACITY_CONSTRAINTS:
139
+ max_realistic_cap = class_capacity * 4
140
+ cap_value = min(cap_value, max_realistic_cap)
141
+
142
  df["cap"] = cap_value
143
  df["floor"] = 0
144
 
 
151
  weekly_seasonality=False, # type: ignore[arg-type]
152
  )
153
 
154
+ if self.config.model.USE_POPULATION_REGRESSOR:
155
+ m.add_regressor("jumlah_aktif", mode="multiplicative")
156
+ m.fit(df[["ds", "y", "cap", "floor", "jumlah_aktif"]])
157
+ else:
158
+ m.fit(df[["ds", "y", "cap", "floor"]])
159
 
160
  future_date = pd.to_datetime(
161
  f"{year}-{self.config.prediction.SEMESTER_TO_MONTH[smt]}"
 
166
  "ds": [future_date],
167
  "cap": [cap_value],
168
  "floor": [0],
 
169
  }
170
  )
171
 
172
+ if self.config.model.USE_POPULATION_REGRESSOR:
173
+ future["jumlah_aktif"] = pop
174
+
175
  forecast = m.predict(future)
176
  raw_pred = forecast["yhat"].values[0]
177
 
 
182
  or raw_pred > cap_value * 2
183
  ):
184
  logger.warning(
185
+ f"Prophet prediction ({raw_pred:.1f}) unrealistic for {course_code}. "
186
  f"Using trend-based fallback. (hist_max={hist_max}, cap={cap_value})"
187
  )
188
+ pop_mean = float(df["jumlah_aktif"].mean())
189
  if len(df) >= 3:
190
+ recent_trend = float(df["y"].tail(3).mean())
191
+ pop_growth_factor = pop / pop_mean if pop_mean > 0 else 1.0
192
+ growth_factor = min(max(pop_growth_factor, 0.8), 1.3)
 
 
193
  pred = recent_trend * growth_factor
194
  else:
195
+ pop_growth_factor = pop / pop_mean if pop_mean > 0 else 1.0
196
  pred = hist_mean * min(max(pop_growth_factor, 0.8), 1.3)
197
 
198
  pred = min(max(pred, 0), cap_value)
 
212
  }
213
 
214
  except Exception as e:
215
+ logger.warning(
216
+ f"Prophet failed for course {course_code}. Error: {e}. Using fallback."
217
+ )
218
  return {
219
  "val": hist["enrollment"].mean(),
220
  "strategy": "fallback_mean",
221
  "confidence": "medium",
222
  }
223
 
224
+ def predict_multi_year(
225
+ self,
226
+ course_code: str,
227
+ df_history: pd.DataFrame,
228
+ start_year: int,
229
+ target_smt: int,
230
+ years_ahead: int = 3,
231
+ ) -> List[Dict]:
232
+ predictions = []
233
+
234
+ for i in range(years_ahead + 1):
235
+ year = start_year + i
236
+ pop = self.get_student_forecast(year, target_smt)
237
+
238
+ pred = self.predict_course(course_code, df_history, year, target_smt, pop)
239
+ pred["year"] = year
240
+ pred["semester"] = target_smt
241
+ pred["student_population"] = pop
242
+ predictions.append(pred)
243
+
244
+ return predictions
245
+
246
  def generate_batch_predictions(
247
  self,
248
  full_data: pd.DataFrame,
 
250
  electives: set,
251
  year: int,
252
  smt: int,
253
+ ) -> pd.DataFrame:
 
254
  student_pop = self.get_student_forecast(year, smt)
255
  results = []
256
 
 
259
  )
260
 
261
  for code in electives:
262
+ meta_rows = course_metadata[course_metadata["kode_mk"] == code]
263
+ if len(meta_rows) == 0:
264
+ logger.warning(f"No metadata found for course {code}, skipping")
265
+ continue
266
+ meta = meta_rows.iloc[0]
267
 
268
  pred_result = self.predict_course(code, full_data, year, smt, student_pop)
269
  pred_val = pred_result["val"]
270
+ course_history = full_data[full_data["kode_mk"] == code]
271
+ has_history = len(course_history) > 0
272
+
273
+ classes_needed = pred_result.get(
274
+ "classes_needed",
275
+ self.config.calculate_classes_needed(
276
+ pred_val, code, has_historical_data=has_history
277
+ ),
278
  )
 
279
 
280
+ course_capacity = self.config.get_class_capacity(code)
281
+
282
+ if classes_needed > 0:
283
+ rec_quota = classes_needed * course_capacity
284
+ else:
285
+ rec_quota = 0
286
+
287
+ min_threshold = self.config.class_capacity.MIN_STUDENTS_TO_OPEN_CLASS
288
+ should_open = pred_val >= min_threshold or (
289
+ has_history and self.config.class_capacity.OPEN_CLASS_IF_HAS_HISTORY
290
  )
291
+ status = "BUKA" if should_open else "TUTUP"
292
+
293
+ if classes_needed > 0:
294
+ total_capacity = classes_needed * course_capacity
295
+ utilization = (pred_val / total_capacity) * 100
296
+ else:
297
+ utilization = 0
298
 
299
  results.append(
300
  {
301
  "kode_mk": code,
302
  "nama_mk": meta["nama_mk"],
303
+ "sks": meta.get("sks_mk", 0),
304
  "predicted_enrollment": round(pred_val, 1),
305
+ "class_capacity": course_capacity,
306
+ "classes_needed": classes_needed,
307
+ "total_quota": rec_quota,
308
+ "utilization_pct": round(utilization, 1),
309
  "recommendation": status,
310
+ "capacity_status": pred_result.get("capacity_status", "NORMAL"),
311
  "strategy": pred_result["strategy"],
312
  "confidence": pred_result["confidence"],
 
 
 
313
  }
314
  )
315
 
 
317
  "predicted_enrollment", ascending=False
318
  )
319
 
320
+ def generate_multi_year_forecast(
321
+ self,
322
+ full_data: pd.DataFrame,
323
+ course_metadata: pd.DataFrame,
324
+ electives: set,
325
+ start_year: int,
326
+ smt: int,
327
+ years_ahead: int = 3,
328
+ ) -> pd.DataFrame:
329
+ all_results = []
330
+
331
+ for code in electives:
332
+ meta_rows = course_metadata[course_metadata["kode_mk"] == code]
333
+ if len(meta_rows) == 0:
334
+ continue
335
+ meta = meta_rows.iloc[0]
336
+
337
+ year_predictions = self.predict_multi_year(
338
+ code, full_data, start_year, smt, years_ahead
339
+ )
340
+
341
+ for pred in year_predictions:
342
+ course_capacity = self.config.get_class_capacity(code)
343
+ classes_needed = pred.get("classes_needed", 0)
344
+
345
+ all_results.append(
346
+ {
347
+ "kode_mk": code,
348
+ "nama_mk": meta["nama_mk"],
349
+ "year": pred["year"],
350
+ "semester": pred["semester"],
351
+ "predicted_enrollment": round(pred["val"], 1),
352
+ "classes_needed": classes_needed,
353
+ "total_capacity": classes_needed * course_capacity,
354
+ "student_population": round(pred["student_population"], 0),
355
+ "strategy": pred["strategy"],
356
+ "confidence": pred["confidence"],
357
+ }
358
+ )
359
+
360
+ return pd.DataFrame(all_results).sort_values(["kode_mk", "year"])
361
+
362
+ def get_course_trend_analysis(
363
+ self,
364
+ course_code: str,
365
+ df_history: pd.DataFrame,
366
+ target_smt: int,
367
+ ) -> Dict:
368
+ hist = df_history[
369
+ (df_history["kode_mk"] == course_code) & (df_history["smt"] == target_smt)
370
+ ].sort_values("thn")
371
+
372
+ if len(hist) < 2:
373
+ return {
374
+ "has_sufficient_data": False,
375
+ "data_points": len(hist),
376
+ }
377
+
378
+ enrollments = np.array(hist["enrollment"].values, dtype=float)
379
+ years = np.array(hist["thn"].values, dtype=float)
380
+
381
+ growth_rates = []
382
+ for i in range(1, len(enrollments)):
383
+ if enrollments[i - 1] > 0:
384
+ rate = (enrollments[i] - enrollments[i - 1]) / enrollments[i - 1]
385
+ growth_rates.append(rate)
386
+
387
+ avg_growth_rate = float(np.mean(growth_rates)) if growth_rates else 0.0
388
+
389
+ if len(years) >= 2:
390
+ coeffs = np.polyfit(years, enrollments, 1)
391
+ trend_slope = float(coeffs[0])
392
+ else:
393
+ trend_slope = 0.0
394
+
395
+ return {
396
+ "has_sufficient_data": True,
397
+ "data_points": len(hist),
398
+ "min_enrollment": int(enrollments.min()),
399
+ "max_enrollment": int(enrollments.max()),
400
+ "avg_enrollment": round(float(enrollments.mean()), 1),
401
+ "latest_enrollment": int(enrollments[-1]),
402
+ "avg_growth_rate": round(avg_growth_rate * 100, 1), # as percentage
403
+ "trend_slope": round(trend_slope, 2),
404
+ "trend_direction": "increasing"
405
+ if trend_slope > 0
406
+ else "decreasing"
407
+ if trend_slope < 0
408
+ else "stable",
409
+ "year_range": f"{int(years.min())}-{int(years.max())}",
410
+ }
411
+
412
  def predict_course_enrollment(
413
  self,
414
  course_code: str,
 
416
  test_year: int,
417
  test_semester: int,
418
  test_student_count: float,
419
+ ) -> tuple:
420
  result = self.predict_course(
421
  course_code=course_code,
422
  df_history=train_data,
ui_components.py ADDED
@@ -0,0 +1,322 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Dict
2
+
3
+
4
+ def get_color(value: float, thresholds: tuple = (50, 25)) -> str:
5
+ high, low = thresholds
6
+ if value >= high:
7
+ return "#4ade80"
8
+ elif value >= low:
9
+ return "#fb923c"
10
+ else:
11
+ return "#f87171"
12
+
13
+
14
+ def get_diff_color(value: float) -> str:
15
+ return "#4ade80" if value >= 0 else "#f87171"
16
+
17
+
18
+ # Card Components
19
+ def metric_card(title: str, value: str, color: str, subtitle: str = "") -> str:
20
+ subtitle_html = (
21
+ f'<div style="font-size: 11px; color: #9ca3af; margin-top: 4px;">{subtitle}</div>'
22
+ if subtitle
23
+ else ""
24
+ )
25
+ return f"""
26
+ <div style="background: #1e293b; padding: 20px; border-radius: 12px; border-left: 4px solid {color};">
27
+ <div style="font-size: 12px; color: #9ca3af; text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px;">{title}</div>
28
+ <div style="font-size: 28px; font-weight: 700; color: {color};">{value}</div>
29
+ {subtitle_html}
30
+ </div>
31
+ """
32
+
33
+
34
+ def info_row(label: str, value: str, color: str = "#fff", border: bool = True) -> str:
35
+ border_style = "border-bottom: 1px solid #334155;" if border else ""
36
+ return f"""
37
+ <div style="display: flex; justify-content: space-between; padding: 12px 0; {border_style}">
38
+ <span style="color: #9ca3af;">{label}</span>
39
+ <span style="font-weight: 600; color: {color};">{value}</span>
40
+ </div>
41
+ """
42
+
43
+
44
+ def info_card(title: str, rows: list) -> str:
45
+ rows_html = "".join(rows)
46
+ return f"""
47
+ <div style="background: #1e293b; padding: 20px; border-radius: 12px;">
48
+ <h4 style="margin: 0 0 16px 0; color: #fff; font-size: 14px; font-weight: 600;">{title}</h4>
49
+ {rows_html}
50
+ </div>
51
+ """
52
+
53
+
54
+ # Summary Templates
55
+ def build_validation_summary(data: Dict) -> str:
56
+ """Build summary HTML for validation mode (when actual data exists)."""
57
+ year = data["year"]
58
+ semester_name = data["semester_name"]
59
+ class_capacity = data["class_capacity"]
60
+ data_source = data.get("data_source", "kalkulasi")
61
+
62
+ # Metrics
63
+ class_accuracy_pct = data.get("class_accuracy_pct", 0)
64
+ class_within_one_pct = data.get("class_within_one_pct", 0)
65
+ total_classes = data.get("total_classes", 0)
66
+ comparison_mae = data.get("comparison_mae", 0)
67
+ comparison_rmse = data.get("comparison_rmse", 0)
68
+ total_for_class_accuracy = data.get("total_for_class_accuracy", 0)
69
+
70
+ # Enrollment metrics
71
+ total_actual = data.get("total_actual", 0)
72
+ total_predicted = data.get("total_predicted", 0)
73
+ accuracy_pct = data.get("accuracy_pct", 0)
74
+ class_matches = data.get("class_matches", 0)
75
+ class_within_one = data.get("class_within_one", 0)
76
+
77
+ # Colors
78
+ class_accuracy_color = get_color(class_accuracy_pct)
79
+ diff_color = get_diff_color(total_predicted - total_actual)
80
+
81
+ return f"""
82
+ <div style="padding: 24px;">
83
+ <div style="margin-bottom: 24px;">
84
+ <h2 style="margin: 0 0 8px 0; color: #fff; font-size: 24px; font-weight: 600;">{
85
+ year
86
+ } Semester {semester_name}</h2>
87
+ <p style="color: #9ca3af; margin: 0; font-size: 14px;">Validasi prediksi terhadap data aktual | Kapasitas per kelas: {
88
+ class_capacity
89
+ } mahasiswa | Sumber kelas aktual: {data_source}</p>
90
+ </div>
91
+
92
+ <div style="display: grid; grid-template-columns: repeat(4, 1fr); gap: 16px; margin-bottom: 24px;">
93
+ {
94
+ metric_card(
95
+ "Akurasi Kelas",
96
+ f"{class_accuracy_pct:.1f}%",
97
+ class_accuracy_color,
98
+ f"±1 kelas: {class_within_one_pct:.1f}%",
99
+ )
100
+ }
101
+ {metric_card("Total Kelas Prediksi", str(total_classes), "#60a5fa")}
102
+ {
103
+ metric_card(
104
+ "MAE / RMSE", f"{comparison_mae:.1f} / {comparison_rmse:.1f}", "#a78bfa"
105
+ )
106
+ }
107
+ {metric_card("MK Divalidasi", str(total_for_class_accuracy), "#fb923c")}
108
+ </div>
109
+
110
+ <div style="display: grid; grid-template-columns: repeat(2, 1fr); gap: 16px;">
111
+ {
112
+ info_card(
113
+ "Ringkasan Enrollment",
114
+ [
115
+ info_row("Total Aktual", str(int(total_actual))),
116
+ info_row("Total Prediksi", str(int(total_predicted))),
117
+ info_row(
118
+ "Selisih",
119
+ f"{int(total_predicted - total_actual):+d}",
120
+ diff_color,
121
+ border=False,
122
+ ),
123
+ ],
124
+ )
125
+ }
126
+ {
127
+ info_card(
128
+ f"Akurasi Prediksi Kelas (dari {data_source})",
129
+ [
130
+ info_row(
131
+ "Kelas Tepat",
132
+ f"{class_matches}/{total_for_class_accuracy}",
133
+ "#4ade80",
134
+ ),
135
+ info_row(
136
+ "Selisih ±1 Kelas",
137
+ f"{class_within_one}/{total_for_class_accuracy}",
138
+ "#60a5fa",
139
+ ),
140
+ info_row("Akurasi Enrollment", f"{accuracy_pct:.1f}%", border=False),
141
+ ],
142
+ )
143
+ }
144
+ </div>
145
+ </div>
146
+ """
147
+
148
+
149
+ def build_no_match_summary(data: Dict) -> str:
150
+ year = data["year"]
151
+ semester_name = data["semester_name"]
152
+ metrics = data.get("metrics", {"mae": 0, "rmse": 0})
153
+ total_to_open = data.get("total_to_open", 0)
154
+ total_classes = data.get("total_classes", 0)
155
+
156
+ return f"""
157
+ <div style="padding: 24px;">
158
+ <div style="margin-bottom: 24px;">
159
+ <h2 style="margin: 0 0 8px 0; color: #fff; font-size: 24px; font-weight: 600;">{year} Semester {semester_name}</h2>
160
+ <p style="color: #9ca3af; margin: 0; font-size: 14px;">Data semester ada, tetapi tidak ditemukan MK pilihan yang cocok</p>
161
+ </div>
162
+
163
+ <div style="display: grid; grid-template-columns: repeat(4, 1fr); gap: 16px;">
164
+ {metric_card("MAE (Backtest)", f"{metrics['mae']:.2f}", "#60a5fa")}
165
+ {metric_card("RMSE (Backtest)", f"{metrics['rmse']:.2f}", "#a78bfa")}
166
+ {metric_card("MK Dibuka", str(total_to_open), "#4ade80")}
167
+ {metric_card("Total Kelas", str(total_classes), "#fb923c")}
168
+ </div>
169
+ </div>
170
+ """
171
+
172
+
173
+ def build_future_prediction_summary(data: Dict) -> str:
174
+ year = data["year"]
175
+ semester_name = data["semester_name"]
176
+ class_capacity = data["class_capacity"]
177
+ metrics = data.get("metrics", {"mae": 0, "rmse": 0})
178
+
179
+ total_to_open = data.get("total_to_open", 0)
180
+ total_classes = data.get("total_classes", 0)
181
+ total_predicted_students = data.get("total_predicted_students", 0)
182
+ total_capacity = data.get("total_capacity", 0)
183
+
184
+ avg_utilization = (
185
+ (total_predicted_students / total_capacity * 100) if total_capacity > 0 else 0
186
+ )
187
+
188
+ return f"""
189
+ <div style="padding: 24px;">
190
+ <div style="margin-bottom: 24px;">
191
+ <h2 style="margin: 0 0 8px 0; color: #fff; font-size: 24px; font-weight: 600;">{
192
+ year
193
+ } Semester {semester_name}</h2>
194
+ <p style="color: #9ca3af; margin: 0; font-size: 14px;">Prediksi masa depan berdasarkan tren historis | Kapasitas per kelas: {
195
+ class_capacity
196
+ } mahasiswa</p>
197
+ </div>
198
+
199
+ <div style="display: grid; grid-template-columns: repeat(4, 1fr); gap: 16px; margin-bottom: 24px;">
200
+ {metric_card("MK Dibuka", str(total_to_open), "#4ade80")}
201
+ {metric_card("Total Kelas Dibuka", str(total_classes), "#60a5fa")}
202
+ {metric_card("Prediksi Mahasiswa", str(total_predicted_students), "#a78bfa")}
203
+ {metric_card("Total Kuota", str(total_capacity), "#fb923c")}
204
+ </div>
205
+
206
+ <div style="display: grid; grid-template-columns: repeat(2, 1fr); gap: 16px;">
207
+ {
208
+ info_card(
209
+ "Backtest Metrics",
210
+ [
211
+ info_row("MAE", f"{metrics['mae']:.2f}"),
212
+ info_row("RMSE", f"{metrics['rmse']:.2f}", border=False),
213
+ ],
214
+ )
215
+ }
216
+ {
217
+ info_card(
218
+ "Kapasitas Info",
219
+ [
220
+ info_row("Kapasitas/Kelas", f"{class_capacity} mhs"),
221
+ info_row("Avg Utilization", f"{avg_utilization:.1f}%", border=False),
222
+ ],
223
+ )
224
+ }
225
+ </div>
226
+ </div>
227
+ """
228
+
229
+
230
+ def build_prediction_summary(data: Dict) -> str:
231
+ has_actual_data = data.get("has_actual_data", False)
232
+
233
+ if has_actual_data:
234
+ if "comparison_mae" in data:
235
+ return build_validation_summary(data)
236
+ else:
237
+ return build_no_match_summary(data)
238
+ else:
239
+ return build_future_prediction_summary(data)
240
+
241
+
242
+ def build_multi_year_summary(data: Dict) -> str:
243
+ year = data["year"]
244
+ years_ahead = data["years_ahead"]
245
+ semester_name = data["semester_name"]
246
+ class_capacity = data["class_capacity"]
247
+
248
+ first_year_classes = data["first_year_classes"]
249
+ last_year_classes = data["last_year_classes"]
250
+ growth_classes = data["growth_classes"]
251
+ growth_students = data["growth_students"]
252
+
253
+ growth_class_color = get_diff_color(growth_classes)
254
+ growth_student_color = get_diff_color(growth_students)
255
+
256
+ return f"""
257
+ <div style="padding: 24px;">
258
+ <div style="margin-bottom: 24px;">
259
+ <h2 style="margin: 0 0 8px 0; color: #fff; font-size: 24px; font-weight: 600;">Proyeksi {years_ahead} Tahun ke Depan - Semester {semester_name}</h2>
260
+ <p style="color: #9ca3af; margin: 0; font-size: 14px;">Forecasting kebutuhan kelas {year} - {year + years_ahead} | Kapasitas per kelas: {class_capacity} mahasiswa</p>
261
+ </div>
262
+
263
+ <div style="display: grid; grid-template-columns: repeat(4, 1fr); gap: 16px; margin-bottom: 24px;">
264
+ {metric_card(f"Kelas ({year})", str(first_year_classes), "#4ade80")}
265
+ {metric_card(f"Kelas ({year + years_ahead})", str(last_year_classes), "#60a5fa")}
266
+ {metric_card("Pertumbuhan Kelas", f"{growth_classes:+d}", growth_class_color)}
267
+ {metric_card("Pertumbuhan Mhs", f"{growth_students:+d}", growth_student_color)}
268
+ </div>
269
+ </div>
270
+ """
271
+
272
+
273
+ # Placeholder Templates
274
+ def placeholder_card(title: str, subtitle: str) -> str:
275
+ return f"""
276
+ <div style="padding: 60px 40px; text-align: center; background: #1e293b; border-radius: 12px;">
277
+ <h3 style="color: #fff; margin: 0 0 8px 0; font-size: 18px; font-weight: 600;">{title}</h3>
278
+ <p style="color: #9ca3af; margin: 0; font-size: 14px;">{subtitle}</p>
279
+ </div>
280
+ """
281
+
282
+
283
+ def get_prediction_placeholder() -> str:
284
+ return placeholder_card(
285
+ "Pilih tahun dan semester",
286
+ "Klik Generate Predictions untuk melihat rekomendasi jumlah kelas",
287
+ )
288
+
289
+
290
+ def get_forecast_placeholder() -> str:
291
+ return placeholder_card(
292
+ "Proyeksi Multi-Tahun", "Lihat tren kebutuhan kelas beberapa tahun ke depan"
293
+ )
294
+
295
+
296
+ # Data Info Component
297
+ def build_data_info(data: Dict) -> str:
298
+ if "error" in data:
299
+ return f"<p style='color: #f87171;'>{data['error']}</p>"
300
+
301
+ return f"""
302
+ <div style="padding: 8px 0;">
303
+ <div style="display: grid; grid-template-columns: repeat(2, 1fr); gap: 12px;">
304
+ <div style="background: #1e293b; padding: 16px; border-radius: 8px; text-align: center;">
305
+ <div style="font-size: 11px; color: #9ca3af; margin-bottom: 6px;">Total MK</div>
306
+ <div style="font-size: 20px; font-weight: 700; color: #fff;">{data["total_courses"]}</div>
307
+ </div>
308
+ <div style="background: #1e293b; padding: 16px; border-radius: 8px; text-align: center;">
309
+ <div style="font-size: 11px; color: #9ca3af; margin-bottom: 6px;">MK Pilihan</div>
310
+ <div style="font-size: 20px; font-weight: 700; color: #4ade80;">{data["elective_courses"]}</div>
311
+ </div>
312
+ <div style="background: #1e293b; padding: 16px; border-radius: 8px; text-align: center;">
313
+ <div style="font-size: 11px; color: #9ca3af; margin-bottom: 6px;">Kapasitas/Kelas</div>
314
+ <div style="font-size: 20px; font-weight: 700; color: #60a5fa;">{data["class_capacity"]}</div>
315
+ </div>
316
+ <div style="background: #1e293b; padding: 16px; border-radius: 8px; text-align: center;">
317
+ <div style="font-size: 11px; color: #9ca3af; margin-bottom: 6px;">Tahun Data</div>
318
+ <div style="font-size: 20px; font-weight: 700; color: #fb923c;">{data["year_min"]}-{data["year_max"]}</div>
319
+ </div>
320
+ </div>
321
+ </div>
322
+ """