File size: 44,005 Bytes
03e7fda
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
# What-if Schedule Predictor

A machine-learning-powered schedule prediction and scenario analysis tool for renovation and construction projects. It ingests real project data, learns delay patterns from completed work, predicts when in-progress activities will finish, and lets a project manager simulate "what if" scenarios interactively through a browser dashboard.

---

## Table of Contents

1. [The Problem This Solves](#the-problem-this-solves)
2. [What This System Does](#what-this-system-does)
3. [How the Data Is Organized](#how-the-data-is-organized)
4. [How Everything Fits Together](#how-everything-fits-together)
5. [Component Deep-Dive](#component-deep-dive)
   - [Data Generation and the Database](#data-generation-and-the-database)
   - [The DataLoader](#the-dataloader)
   - [Feature Engineering](#feature-engineering)
   - [Prediction Method A: Earned Value Extrapolation](#prediction-method-a-earned-value-extrapolation)
   - [Prediction Method B: Gradient Boosting Regressor](#prediction-method-b-gradient-boosting-regressor)
   - [Ensemble Prediction](#ensemble-prediction)
   - [Monte Carlo Simulation](#monte-carlo-simulation)
   - [Activity Dependency Graph (DAG)](#activity-dependency-graph-dag)
   - [Ripple Engine: Delay Propagation](#ripple-engine-delay-propagation)
   - [What-if Scenario Engine](#what-if-scenario-engine)
   - [Schedule Optimizer and Critical Path Method](#schedule-optimizer-and-critical-path-method)
   - [Visualizations](#visualizations)
   - [The Streamlit Dashboard](#the-streamlit-dashboard)
6. [Screenshots](#screenshots)
7. [Project Structure](#project-structure)
8. [Installation](#installation)
9. [Running the System](#running-the-system)
10. [Glossary](#glossary)

---

## The Problem This Solves

When a renovation or construction project is running, things almost never go exactly to plan. Workers get sick, materials arrive late, an inspection fails, or a structural issue is discovered behind a wall. A project manager needs to answer questions like:

- "We are 3 weeks into tiling and only 40% done. When will it actually finish?"
- "If the plumbing rough-in finishes 2 weeks late, what does that do to the rest of the schedule?"
- "If we put extra crew on the electrical work, can we recover the delay from the structural phase?"
- "Which activities, if they slip, will push the entire project end date?"

Traditional tools answer these questions poorly. A spreadsheet can track what has happened but cannot predict what will happen or simulate alternatives. This system was built to fill that gap.

It learns from completed projects how activities tend to behave (how much they deviate from their planned duration, what types of issues cause the most delay, which categories of work run fastest) and then applies that knowledge to in-progress work.

---

## What This System Does

At a high level, the system does six things:

**1. Stores and organizes all project data** in a SQLite database. It can ingest existing CSV files from a data folder and also generate synthetic historical projects to supplement the training dataset.

**2. Predicts completion dates** for every in-progress activity using three independent methods:
- A simple earned-value extrapolation based on current progress rate.
- A trained machine learning model (GradientBoostingRegressor) that has learned from 119 completed activities.
- A Monte Carlo simulation that runs 1,000 random scenarios to produce a probability distribution of completion dates (P50, P80, P90).

**3. Builds a dependency map** of all activities as a graph (called a Directed Acyclic Graph or DAG). This captures the fact that, for example, painting cannot start until plastering finishes.

**4. Propagates delays through the dependency map**. If you tell the system "Site Preparation will be 2 weeks late", it automatically calculates which downstream activities shift, by how much, and what the new project end date will be.

**5. Runs what-if scenarios** so a manager can compare the impact of different decisions: delay an activity, add resources to speed one up, resolve a blocking issue, or run two activities in parallel.

**6. Applies Critical Path Method (CPM) analysis** to find which activities have zero scheduling flexibility (the critical path) and generate rule-based optimization suggestions.

All of this is presented in a browser dashboard built with Streamlit, requiring no coding knowledge to use.

---

## How the Data Is Organized

The system uses six tables, which map directly to six original CSV files:

### projects

Each row is one construction or renovation project.

| Column | What it means |
|---|---|
| id | Unique identifier, e.g. proj_008 |

| name | Human-readable name, e.g. "Lakeview Villa Renovation" |

| planned_start_date | The date the project was supposed to begin |

| planned_end_date | The date the project was supposed to finish |

| actual_start_date | The date it actually began |

| actual_end_date | The date it actually finished (null if still running) |

| type | residential or commercial |

| city | Location |

| status | completed, in_progress, or not_started |



Projects with status "completed" are used to train the machine learning model. Projects with status "in_progress" are what the predictions and scenario tools operate on.

### activities

Each row is one construction activity within a project.

| Column | What it means |
|---|---|
| id | Unique identifier, e.g. act_008_04 |
| project_id | Which project this belongs to |

| name | e.g. "Waterproofing and Damp Proofing" |

| category | e.g. structural, mep (mechanical/electrical/plumbing), finishing |

| planned_start_date | Originally scheduled start |

| planned_end_date | Originally scheduled finish |

| planned_duration_days | planned_end - planned_start |

| actual_start_date | When work actually began |

| actual_end_date | When work actually finished (null if still running) |

| progress | Percentage complete, 0 to 100 |

| status | completed, in_progress, or not_started |

| depends_on | Name of the predecessor activity (or null) |
| schedule_variance_days | How many days late the actual start was vs planned start |

The `depends_on` field is the key input to the dependency graph. It captures the construction sequence: e.g. "Painting depends on Wall Plastering".

### daily_updates



Each row is one day's progress log for an activity.



| Column | What it means |

|---|---|

| activity_id | Which activity was updated |
| date | The date of the update |
| reported_progress | Cumulative percentage complete as of that day |

| daily_increment | How many percentage points were completed that day |
| crew_size | Number of workers on site that day |

| weather_event | Any weather disruption (rain, heat, etc.) |
| notes | Free-text notes from the site manager |

This table is the primary source for understanding how fast an activity is actually progressing day by day. The system uses the last 14 days of daily increments to fit the Monte Carlo simulation distribution.

### issues

Each row is one problem logged against an activity.

| Column | What it means |
|---|---|
| id | Unique issue ID |
| activity_id | Which activity this issue is blocking |

| category | material_delay, inspection_fail, design_change, labor_shortage, weather, equipment_breakdown, scope_creep, or safety |

| severity | low, medium, high, or critical |

| status | open or resolved |

| delay_impact_days | Estimated number of days this issue adds to the activity |

| assigned_to | Person responsible for resolving it |

Issues feed directly into the feature engineering and what-if calculations.

### boq (Bill of Quantities)

Each row is one material or labor line item in the project cost breakdown.

| Column | What it means |
|---|---|
| activity_id | Which activity this item belongs to |

| name | e.g. "Ceramic tiles 600x600mm" |

| unit | m2, kg, nos, etc. |

| quantity | How many of this unit |

| unit_price | Cost per unit |
| total_price | quantity x unit_price |

The number of BOQ line items per activity is used as a complexity proxy in feature engineering.

### resources

Each row is one resource allocation record.

| Column | What it means |
|---|---|
| activity_id | Which activity this resource is allocated to |

| contractor | The firm supplying the resource |

| resource_type | labour, equipment, or material |
| allocated_workers | Number of workers assigned |

| daily_cost | Cost per day for this resource |
| allocation_date | When the allocation was made |



Resources are used to estimate cost impact in the resource-boost what-if scenario.



---



## How Everything Fits Together



Here is the full data flow from raw files to the dashboard:



```

CSV files (6 tables)

        |

        v

  dataset.py  <-- reads and normalizes CSVs, adds 2 synthetic projects

        |

        v

   data/data.db  (SQLite database with all 7 tables including activity_dependencies)
        |

        v

  data_loader.py  <-- typed accessors, date parsing, method shortcuts

        |

        +----> feature_engineering.py  <-- 12 features per activity

        |               |

        |               v

        |      completion_predictor.py  <-- trains GBR on historical activities

        |               |

        |      monte_carlo.py           <-- 1000-sim P50/P80/P90 per activity

        |

        +----> dag_builder.py           <-- builds networkx DAG from depends_on edges

                        |

               ripple_engine.py         <-- BFS cascade delay propagation

                        |

               whatif_scenarios.py      <-- 4 scenario types with comparison

                        |

               schedule_optimizer.py    <-- CPM forward/backward pass + 6 rules

                        |

               gantt.py / dag_viz.py    <-- Plotly interactive charts

                        |

                        v

                     app.py             <-- 7-tab Streamlit dashboard

```


Each stage is a separate Python module. They can be imported and tested independently. The Streamlit app wires them together.

---

## Component Deep-Dive

### Data Generation and the Database

**File: `dataset.py`**

This script is run once before the app starts. It does two things:

**Ingesting existing CSVs**: It reads all six CSV files from the `data/` folder into a SQLite database using SQLAlchemy. Each CSV becomes a table. Date columns are normalized, and the `depends_on` field in the activities table is used to build a seventh table called `activity_dependencies`, which stores predecessor-successor pairs as explicit rows. This makes graph queries much faster than string parsing at runtime.

**Generating synthetic data**: To give the machine learning model more training data, the script generates two additional completed projects using a realistic delay distribution. Each synthetic project follows the same 15-activity renovation sequence and applies randomized delays that reflect how real renovations behave: most activities finish roughly on time, some finish a few days late, and a small number run significantly over:

```

Delay in days: -1   0   0   1   2   3   5   7

Probability:  5%  20% 20% 20% 15% 10%  7%  3%

```

The distribution is skewed positive (more likely to be late than early) because that is what the historical data shows. Each synthetic project also gets realistic issues, BOQ items, daily updates, and resource allocations generated to match the activity duration.

The final database contains 12 projects, 180 activities, 2,354 daily updates, 503 issues, 670 BOQ items, 315 resources, and 168 dependency edges.

---

### The DataLoader

**File: `data_loader.py`**



The DataLoader is the single point of access to data for every other module. At initialization it checks whether `data/data.db` exists. If it does, it reads all tables from the database. If not, it falls back to reading the CSV files directly. Either way, it returns the same pandas DataFrames.



It exposes shortcut methods so modules do not need to write their own filter logic:



- `loader.get_historical_activities()` -- returns all completed activities (used for training)

- `loader.get_active_activities()` -- returns all in-progress activities (used for prediction)

- `loader.get_project_activities(project_id)` -- all activities for one project

- `loader.get_inprogress_projects()` -- projects currently running

- `loader.get_activity_issues(activity_id, project_id)` -- issues for one activity or project

- `loader.get_daily_updates(activity_id)` -- time series of daily progress for one activity

- `loader.get_project_boq(project_id)` -- bill of quantities for a project

- `loader.get_dependencies(project_id)` -- predecessor-successor pairs for the DAG



A `REFERENCE_DATE` of 2024-06-01 is used as the "today" anchor for all calculations on in-progress activities. This is overridable from the dashboard sidebar.



---



### Feature Engineering



**File: `features/feature_engineering.py`**

Machine learning models cannot work directly with dates and status strings. Feature engineering converts raw activity data into 12 numbers that capture the dynamics of schedule performance.

For each activity, the following features are computed:

**planned_duration**

Number of days between planned start and planned end. Longer activities tend to accumulate more delay simply because there is more time for things to go wrong.



**elapsed_days**
For completed activities: actual end minus actual start. For in-progress: today minus actual start. Captures how long the activity has been running so far.

**progress_rate**

Progress divided by elapsed days. If an activity is 40% done after 10 days, the rate is 4% per day. Activities with very low rates relative to their planned pace are likely to finish late.



**schedule_variance**
Actual start minus planned start, in days. A positive value means the activity started late. A late start often predicts a late finish, even if work proceeds at full pace afterward.

**delay_ratio** (training target)

Actual duration divided by planned duration. A ratio of 1.0 means finished on time. A ratio of 1.5 means took 50% longer than planned. This is the value the machine learning model is trained to predict. For in-progress activities it is predicted, not observed.



**issue_count**
Number of open issues against this activity. More issues generally means more risk of delay.

**issue_severity_score**
A weighted count of issues by category. Different categories have different weights based on how much disruption they typically cause:

```

design_change:         3 points

inspection_fail:       2 points

equipment_breakdown:   2 points

material_delay:        2 points

labor_shortage:        1.5 points

weather:               1 point

scope_creep:           1.5 points

safety:                2.5 points

```

An activity with one design_change issue scores 3, while an activity with three weather issues scores 3 as well, capturing that type of issue matters as much as count.



**boq_complexity**

Number of BOQ line items for the activity plus a cost-variance component. Activities with many distinct materials or subcontractors are harder to coordinate and more likely to have procurement delays.



**parent_delay**

A binary flag (0 or 1). It is set to 1 if the predecessor activity started more than 2 days late. This captures the fact that late handoffs tend to cause chain reactions.



**historical_avg_delay**

The average delay_ratio for all completed activities in the same category across all training projects. For example, if all past "Tiling" activities took on average 1.3 times their planned duration, then any current tiling activity gets a historical baseline of 1.3. This is the most predictive single feature because construction categories have consistent delay tendencies.

**progress_velocity_7d**
The rolling 7-day average of daily progress increments from the daily_updates table. If the last 7 daily updates show increments of 1, 2, 0, 3, 1, 2, 1, the velocity is approximately 1.4% per day. This is the most up-to-date signal about how fast work is currently proceeding.



**progress_acceleration**

The change in velocity between the most recent 7 days and the prior 7 days. Positive means the activity is speeding up; negative means it is slowing down. Deceleration is a danger signal that often precedes a stall.



After computing these 12 features, the engineering module also label-encodes two categorical columns (activity category and project type) and appends them, giving the model 14 total input dimensions.



---



### Prediction Method A: Earned Value Extrapolation



**File: `models/completion_predictor.py`, function `predict_method_a`**



This method requires no training data. It is a purely mathematical extrapolation based on the activity's current pace.



The logic:



1. Calculate the actual rate of progress: `actual_rate = progress / elapsed_days`

2. Calculate the theoretical planned rate: `planned_rate = 100 / planned_duration`

3. Blend the two rates. Early in an activity (few elapsed days) the actual rate is unreliable, so the planned rate gets more weight. Later, the actual rate dominates:

   ```

   weight = min(elapsed_days / 14.0, 0.85)
   blended_rate = weight * actual_rate + (1 - weight) * planned_rate

   ```

4. Project days remaining: `days_remaining = (100 - progress) / blended_rate`

5. Predicted end: `today + days_remaining`

This method works well when an activity is on track. It struggles when an activity has stalled (rate near zero would predict an infinite end date) or when there are invisible blockers that the rate does not yet reflect. That is why it is blended with Method B.

---

### Prediction Method B: Gradient Boosting Regressor

**File: `models/completion_predictor.py`, classes `CompletionPredictor`**



This method is a trained machine learning model. It learns from patterns in the 119 completed historical activities and applies that knowledge to predict the `delay_ratio` for each in-progress activity.



**What is a Gradient Boosting Regressor?**

Gradient boosting is an ensemble technique that builds a sequence of decision trees. Each tree is trained to correct the errors of all the trees before it. The "gradient" refers to the direction in which each new tree reduces the prediction error, computed mathematically. The result is a very accurate predictor that can capture non-linear relationships between features and the target.

In simple terms: the model learns rules like "activities with low progress velocity AND high issue severity score tend to finish 40% later than planned" and combines many such rules into a final prediction.

**Training process:**

The model is trained on the 119 completed activities in the dataset. The features described above are the inputs (X). The `delay_ratio` for each completed activity is the target (y). Hyperparameters:

- 200 trees in the ensemble
- Learning rate of 0.05 (each tree contributes only 5% to avoid overfitting)
- Maximum tree depth of 4
- 80% of training data sampled per tree (subsample = 0.8)

The model is evaluated with 5-fold cross-validation and achieves a mean absolute error of 0.001 on the delay multiplier scale. In practical terms this means predictions are extremely well-calibrated on historical data.

**Making a prediction:**

For each in-progress activity, the same 14 features are computed and fed into the trained model. The model outputs a `delay_multiplier` (the predicted delay_ratio). The predicted end date is then:



```

remaining_fraction = (100 - progress) / 100
predicted_end = today + planned_duration * predicted_multiplier * remaining_fraction
```



---



### Ensemble Prediction



The final prediction combines Method A and Method B with fixed weights:



```
ensemble_days_remaining = 0.4 * methodA_days_remaining + 0.6 * methodB_days_remaining
predicted_end = today + ensemble_days_remaining

```



Method B receives a 60% weight because it uses 14 features including issue data, historical baselines, and velocity trends that Method A ignores. Method A receives 40% weight to keep the prediction grounded in the activity's observed pace, preventing the model from producing wildly different estimates based on historical patterns alone.



The dashboard shows all three estimates side by side, so a planner can see where the methods agree and where they diverge.



---



### Monte Carlo Simulation



**File: `models/monte_carlo.py`**



Monte Carlo simulation is a technique for quantifying uncertainty. Instead of producing one prediction, it runs thousands of hypothetical futures and reports how often each outcome occurs.



**How it works for this system:**



1. For each in-progress activity, retrieve the last 14 days of `daily_increment` values from daily_updates.

2. Fit a normal distribution to those increments: calculate mean and standard deviation. This characterizes the activity's recent progress behavior.

3. Run 1,000 simulations. In each simulation:

   - Start at the current progress level.

   - Each simulated day, draw a random daily increment from the fitted distribution (using `scipy.stats.norm.rvs`). Increments are clipped to a minimum of 0 (no backward progress).

   - Accumulate progress until it reaches 100%.

   - Record how many days this simulation took.

4. From the 1,000 completion-day counts, compute:

   - **P50** (50th percentile): The date by which 50% of simulations finish. This is the median estimate.

   - **P80** (80th percentile): The date by which 80% of simulations finish. Use this for cautious planning.

   - **P90** (90th percentile): The date by which 90% of simulations finish. Use this for conservative budget reserve estimates.



**Why this matters:**



If P50 is June 15 and P90 is July 10, that means there is a 50% chance of finishing by June 15 but a 10% chance of finishing after July 10. A contract manager can use the P90 date to set penalties; a scheduler can use P50 for optimistic planning. Single-point predictions cannot communicate this uncertainty at all.



---



### Activity Dependency Graph (DAG)



**File: `engine/dag_builder.py`**



A Directed Acyclic Graph (DAG) is a data structure where nodes are connected by arrows, there are no loops, and you can traverse from start to finish following the arrows.



In this system:

- Each **node** is one activity (identified by its ID).

- Each **directed edge** points from a predecessor to a successor. If "Painting depends on Wall Plastering", there is an arrow from Wall Plastering to Painting.

- All activity attributes (name, status, progress, planned dates, schedule variance) are stored on each node so they can be retrieved during graph traversal.



The DAG is built using the `networkx` Python library, which provides efficient graph algorithms.



**Why a DAG specifically?**



The "acyclic" property (no cycles) is guaranteed by the nature of construction sequences: you cannot have Activity A depending on Activity B while Activity B also depends on Activity A. The `networkx` library validates this and raises an error if a cycle is detected.



The DAG enables two key capabilities:



1. **Topological sort**: Order activities from first to last such that every predecessor comes before its successors. This is the correct order for CPM calculations.



2. **Descendant queries**: Given any activity, instantly find all activities downstream of it. This is used by the ripple engine to know which activities need to be recalculated when a delay occurs.



---



### Ripple Engine: Delay Propagation



**File: `engine/ripple_engine.py`**



The ripple engine answers the question: "If activity X finishes N days late, what happens to everything that depends on it?"



**The algorithm (BFS-based forward propagation):**



1. Mark the directly affected activity with its new end date (original end plus delta_days).
2. Use Breadth-First Search (BFS) to traverse the DAG in topological order, starting from the immediate successors of the affected activity.
3. For each downstream activity encountered:
   - Its new start date is the maximum of:
     - Its original planned start date (it cannot start before it was planned)
     - The maximum end date of all of its now-shifted predecessors
   - Its new end date is new_start_date plus its planned duration.
   - Its cascade delay is new_end_date minus original_end_date.
4. Continue traversal until all descendants have been updated.
5. The new project end date is the maximum end date across all activities after propagation.
6. Total project delay is new_project_end minus original_project_end.

**Example:**

Suppose the project sequence is: A -> B -> C -> D (each depends on the previous).
- A finishes 7 days late.
- B must start 7 days later, so it also finishes 7 days later.
- C must start 7 days later, so it also finishes 7 days later.
- D must start 7 days later, so it also finishes 7 days later.
- Total project delay: 7 days.

Real projects have parallel branches, so the actual cascade is more complex. When two paths merge at one activity, that activity can only start when both predecessors finish. The ripple engine handles merge points correctly by taking the maximum predecessor end date.

**High-impact activity identification:**

The engine also computes which activities have the most downstream dependencies. An activity that 10 others depend on (directly or indirectly) is more dangerous to delay than an activity with no dependents. The dashboard surfaces the top 5 highest-impact activities for each project.

---

### What-if Scenario Engine

**File: `engine/whatif_scenarios.py`**



The scenario engine lets a user build a collection of hypothetical interventions and compare their effects side by side. It supports four types:



**Delay scenario**

Models the question: "What if this activity finishes N days late?"



The engine calls the ripple engine with the specified activity and delta, records the cascade table and new project end date, and stores the result labeled "delay".



**Resource boost scenario**

Models the question: "What if we add extra workers to this activity to reduce its duration by X%?"



The engine reduces the activity's remaining duration by the specified percentage (e.g. 25% reduction means a 30-day activity becomes a 22.5-day activity). It then runs the ripple engine with a negative delta (negative delay = time saved). The cost impact is estimated as the additional cost at a 40% overtime premium over the per-day resource cost from the resources table.



**Issue resolved scenario**

Models the question: "What if we resolve this specific blocking issue immediately?"



Each issue in the issues table has a `delay_impact_days` field. The engine passes a negative delta equal to that value through the ripple engine, computing the schedule recovery that would result from fixing the issue.



**Parallelize scenario**

Models the question: "What if we run these two activities at the same time instead of sequentially?"



The engine removes the dependency edge between Activity A and Activity B (if one exists) and shifts Activity B's start to overlap with Activity A's start. It then recalculates end dates and propagates any changes downstream.



Each scenario result records: the scenario type, description, original project end date, new project end date, total delay days (negative = time saved), days saved, and cost impact. The comparison table lets a manager immediately see which intervention gives the best schedule recovery per rupee of extra cost.



---



### Schedule Optimizer and Critical Path Method



**File: `optimizer/schedule_optimizer.py`**

The Critical Path Method is a standard project management algorithm. It identifies which sequence of activities, if any one of them is delayed, will delay the entire project. These activities form the "critical path" and have zero scheduling slack.

**The CPM algorithm:**

**Step 1: Forward pass (computing Early Start and Early Finish)**

Process activities in topological order:
- For each activity with no predecessors, Early Start (ES) = 0.
- For each other activity, ES = maximum Early Finish of all its predecessors.
- Early Finish (EF) = ES + planned_duration.



This tells us the earliest possible time each activity can start and finish given all dependencies.



**Step 2: Backward pass (computing Late Start and Late Finish)**



Process activities in reverse topological order:

- For each activity with no successors (project end), Late Finish (LF) = project total duration.

- For each other activity, LF = minimum Late Start of all its successors.

- Late Start (LS) = LF - planned_duration.

This tells us the latest each activity can start and finish without delaying the project.

**Step 3: Float calculation**

Total Float = Late Start - Early Start.

Float represents scheduling flexibility. An activity with 5 days of float can start up to 5 days after its earliest possible start without delaying the project end. An activity with 0 days of float is on the critical path and has no flexibility at all.

The dashboard displays float as a color-coded bar chart: red for zero float (critical), green for high float (safe).

**Rule-Based Optimization Suggestions:**

The optimizer evaluates 6 rules against the current project state and produces prioritized suggestions:

| Rule | Condition that triggers it | Suggested action |
|---|---|---|
| Slow Critical Activity | Activity is on the critical path, in progress, and progressing at under 50% of the planned rate | Add crew or shift to overtime |
| High Impact Delay | Activity has a large schedule variance AND many downstream dependents | Escalate to senior management immediately |
| Material Delay Risk | Activity has not yet started AND has open material_delay issues | Pre-order materials now to prevent future stoppage |

| Parallelization Opportunity | Two not-yet-started activities have no dependency between them | Schedule them to run concurrently |

| Stalled Activity | Activity is in progress but has recorded zero daily progress for 3 or more consecutive days | Investigate the cause immediately |

| Resource Reallocation | Activity is in progress, NOT on the critical path, and has more than 10 days of float | Move its resources to critical path activities |



Suggestions are deduplicated (one per activity-rule pair) and sorted by priority: Critical first, then High, then Medium, then Opportunity.



---



### Visualizations



**File: `visualization/gantt.py`**



The Gantt chart uses Plotly to draw horizontal bars for each activity. Three bars are drawn per activity when data is available:



- **Planned** (indigo/blue): from planned_start_date to planned_end_date.

- **Actual** (green for completed, amber for in-progress): from actual_start_date to actual_end_date (or today).

- **Forecasted** (amber/orange): from today to the ensemble predicted end date.



Activities on the critical path are marked with a "CRIT:" prefix. A red dashed vertical line marks the reference date ("today").



**File: `visualization/dag_viz.py`**



The DAG is visualized as an interactive Plotly scatter plot with nodes positioned using a hierarchical layout algorithm. Each node is a circle colored on a green-to-red scale based on schedule variance (green = on schedule, red = significantly late). Edges representing critical path connections are drawn in red at double thickness. Hovering over any node shows its name, status, progress, and schedule variance.



---



### The Streamlit Dashboard



**File: `app.py`**



Streamlit is a Python library that turns Python scripts into interactive web applications. The dashboard is organized into 7 tabs, navigable by clicking at the top of the page.



**Performance**: All heavy computations (database loading, model training, CPM calculation, Monte Carlo simulation) are wrapped in `@st.cache_resource`. This means they run once when the app first loads and are reused across all tab switches. The app feels instant after the initial 3-4 second startup.

**Sidebar**: Always visible. Contains project selection (dropdown of all 12 projects), a reference date picker, project metadata (name, status, type, city), and a model health summary (training sample count, cross-validation MAE).

**Tab 1 - Overview**: Six KPI cards at the top (overall progress, activities done, in progress, open issues, critical issues, average schedule variance). Below that, a status donut chart and a category-level progress bar chart side by side. At the bottom, the full activity list as a sortable table.

**Tab 2 - Gantt Chart**: Full project Gantt with planned vs actual vs forecasted bars for every activity.

**Tab 3 - Predictions**: A table comparing Method A, Method B, and ensemble completion dates for every in-progress activity. Below it, a Monte Carlo histogram for any selected activity showing the distribution of 500 simulated completion times with P50, P80, P90 markers.

**Tab 4 - Ripple Analysis**: A dropdown to select any activity and a number input for delay days. Clicking "Run Ripple Simulation" computes and displays: number of activities affected, original vs new project end date, cascade impact table (original and shifted dates per downstream activity), and a horizontal bar chart of cascade delay magnitude.

**Tab 5 - What-if Scenarios**: A radio button to choose scenario type. Inputs change depending on type. Each submitted scenario is added to a persistent list and shown in a cumulative comparison table and bar chart.

**Tab 6 - Optimization**: CPM results table with float values and critical path flags. Float bar chart. Rule-based suggestion cards color-coded by priority.

**Tab 7 - DAG View**: Interactive dependency graph with hover tooltips. Below it, a table listing every dependency edge in the project with a flag for critical path membership.

---

## Screenshots

### Overview Dashboard

The Overview tab shows the full project status at a glance, with KPI cards at the top and charts breaking down activity status and category-level progress.

![Overview](screenshots/overview.png)

### Gantt Chart

The Gantt chart compares planned (blue), actual (green), and forecasted (amber) timelines for every activity. The dashed red line is "today". Critical path activities are labeled.

![Gantt Chart](screenshots/gantt.png)

### Completion Date Predictions

The Predictions tab shows the three prediction methods side by side and provides a Monte Carlo histogram for any selected activity, with P50, P80, and P90 completion date markers.

![Predictions](screenshots/predictions.png)

### Ripple Analysis

After clicking "Run Ripple Simulation", the tool shows exactly which downstream activities shifted, by how much, and how the project end date changed.

![Ripple Analysis](screenshots/ripple.png)

### What-if Scenarios

The scenario builder lets a planner define multiple interventions and compare their schedule and cost impacts in one table.

![What-if Scenarios](screenshots/whatif.png)

### Optimization and Critical Path

The Optimization tab shows CPM float values per activity (red = zero float = critical) and generates prioritized, rule-based suggestions for schedule recovery.

![Optimization](screenshots/optimization.png)

### Dependency DAG

The DAG view renders the full activity dependency graph as an interactive diagram. Nodes are colored by schedule variance, and critical path edges are drawn in red.

![DAG View](screenshots/dag.png)

---

## Project Structure

```

Assignment/

|

|-- app.py                         Main Streamlit dashboard (7 tabs)

|-- dataset.py                     Database builder: ingests CSVs + generates synthetic data

|-- data_loader.py                 Central data access layer with typed shortcuts

|-- test_pipeline.py               End-to-end integration test for all modules

|-- requirements.txt               Python package dependencies

|-- README.md                      This file

|

|-- data/

|   |-- projects.csv               10 real projects (input)

|   |-- activities.csv             150 activities with dependencies (input)

|   |-- daily_updates.csv          1,950 daily progress logs (input)

|   |-- issues.csv                 395 issues with severity and delay impact (input)

|   |-- boq.csv                    540 bill-of-quantities line items (input)

|   |-- resources.csv              255 resource allocation records (input)

|   |-- data.db                    SQLite database generated by dataset.py (output)

|

|-- features/

|   |-- __init__.py

|   |-- feature_engineering.py     Computes 12 engineered features per activity

|

|-- models/

|   |-- __init__.py

|   |-- completion_predictor.py    Method A + Method B + Ensemble prediction

|   |-- monte_carlo.py             1000-simulation P50/P80/P90 predictor

|

|-- engine/

|   |-- __init__.py

|   |-- dag_builder.py             Builds networkx DAG from activity dependencies

|   |-- ripple_engine.py           BFS delay cascade propagation

|   |-- whatif_scenarios.py        4 what-if scenario types with comparison

|

|-- optimizer/

|   |-- __init__.py

|   |-- schedule_optimizer.py      CPM algorithm and 6 rule-based suggestion rules

|

|-- visualization/

|   |-- __init__.py

|   |-- gantt.py                   Plotly Gantt chart (planned vs actual vs forecast)

|   |-- dag_viz.py                 Plotly interactive DAG diagram

|

|-- screenshots/

|   |-- overview.png

|   |-- gantt.png

|   |-- predictions.png

|   |-- ripple.png

|   |-- whatif.png

|   |-- optimization.png

|   |-- dag.png

```

---

## Installation

Python 3.9 or later is required.

**Step 1: Install dependencies**

```bash

pip install -r requirements.txt

```

Contents of `requirements.txt`:

```

pandas>=1.5.0

numpy>=1.23.0

scikit-learn>=1.1.0

scipy>=1.9.0

networkx>=2.8.0

plotly>=5.11.0

matplotlib>=3.6.0

sqlalchemy>=1.4.0

streamlit>=1.20.0

joblib>=1.2.0

```

Approximate install size: 800 MB (primarily due to scikit-learn and plotly). On a typical connection this takes 2 to 5 minutes.

---

## Running the System

### Step 1: Build the database

This reads all six CSV files and creates `data/data.db`. It also generates two additional synthetic historical projects. Run this once before starting the app, and again any time the CSV files are updated.

```bash

python dataset.py

```

Expected output:

```

Database: data/data.db



Ingesting CSVs...

  Ingested projects.csv    -> projects         (10 rows)

  Ingested activities.csv  -> activities       (150 rows)

  Ingested daily_updates.csv -> daily_updates  (1950 rows)

  Ingested issues.csv      -> issues           (395 rows)

  Ingested boq.csv         -> boq              (540 rows)

  Ingested resources.csv   -> resources        (255 rows)



Generating synthetic projects...

  Generated Beachfront Bungalow Reno  (15 activities, ~200 updates)

  Generated Shopping Mall Fit-out     (15 activities, ~203 updates)



Building dependency graph table...

  Built activity_dependencies table (168 rows)



Summary:

  projects: 12 rows

  activities: 180 rows

  daily_updates: 2354 rows

  issues: 503 rows

  boq: 670 rows

  resources: 315 rows

  activity_dependencies: 168 rows



Database ready.

```

### Step 2: Verify the pipeline (optional but recommended)

This runs all modules in sequence and prints results to the terminal. If any module is broken it will show an error. If everything is working it prints "ALL TESTS PASSED" at the end.

```bash

python test_pipeline.py

```

### Step 3: Launch the dashboard

```bash

streamlit run app.py

```

Open your browser and go to `http://localhost:8501`.

The app takes 3 to 5 seconds to start because it trains the machine learning model. After that, all tab switches are instant because results are cached.

### Step 4: Use the dashboard

1. **Select a project** from the sidebar dropdown. The two in-progress projects (Lakeview Villa Renovation and Metro Commercial Tower Fit-Out) show the full prediction and scenario functionality.

2. **Change the Reference Date** if you want to simulate the project state as of a different date. The date affects all elapsed-day calculations, progress rates, and predictions.

3. **Browse the Overview tab** to get a summary of current project health.

4. **Go to the Gantt Chart** to see the timeline comparison of planned vs actual vs forecasted dates.

5. **Go to Predictions** to see when each activity is expected to finish and explore the Monte Carlo uncertainty ranges.

6. **Go to Ripple Analysis** to select an activity, enter a delay, and click "Run Ripple Simulation" to see the cascade impact.

7. **Go to What-if Scenarios** to build and compare alternative interventions.

8. **Go to Optimization** to see the critical path and receive prioritized scheduling recommendations.

9. **Go to DAG View** to explore the dependency graph interactively.

---

## Glossary

**Activity**: One distinct work package within a project, e.g. "Tiling" or "Electrical Wiring".

**BOQ (Bill of Quantities)**: A detailed list of materials, labor, and other items required to complete an activity, with quantities and unit prices.

**CPM (Critical Path Method)**: A project management algorithm that identifies which sequence of activities determines the minimum possible project duration. Activities on the critical path have zero scheduling slack.

**DAG (Directed Acyclic Graph)**: A data structure where nodes are connected by arrows (directed), and there are no circular paths (acyclic). Used here to model which activities must finish before others can start.

**Delay Ratio**: Actual duration divided by planned duration. A ratio of 1.0 means on-time; 1.5 means 50% over schedule.

**Ensemble**: A prediction that combines multiple independent models or methods by averaging or weighting their outputs, typically more accurate than any single method alone.

**Earned Value**: A project management concept where "earned value" represents the planned value of work actually completed. Used here as the basis for Method A extrapolation.

**Float (Total Float)**: The number of days an activity can be delayed without pushing the project end date. Critical path activities have a float of zero.

**GradientBoostingRegressor**: A machine learning algorithm that builds an ensemble of decision trees sequentially, each one correcting the errors of the previous. Effective for tabular data with numerical targets.

**Monte Carlo Simulation**: A computational technique that runs thousands of random scenarios to estimate the probability distribution of an outcome, rather than producing a single point prediction.

**P50 / P80 / P90**: Percentile values from a probability distribution. P80 means 80% of simulated scenarios finish by that date.

**Predecessor**: An activity that must finish before a given activity can start.

**Ripple Effect**: The cascade of delays that propagates through downstream activities when one activity is delayed.

**SQLite**: A file-based relational database engine. The entire database lives in a single file (`data.db`) and requires no server installation.

**Streamlit**: A Python library that converts a Python script into a browser-accessible interactive web application.

**Topological Sort**: An ordering of nodes in a DAG such that for every directed edge from A to B, A comes before B in the ordering. Used to ensure activities are processed in dependency order.