- π Global Terrorism Database β Classification, Regression & Clustering
- Video Presentation
- Project Overview
- Part 2: Data Cleaning & EDA
- Data Cleaning
- Outlier Detection
- Question 1: How has the number of terrorist attacks changed over time?
- Question 2: Which countries are most affected by terrorism?
- Question 3: What are the most common types of terrorist attacks?
- Question 4: Which attack types are most deadly?
- Question 5: Who are the deadliest terrorist organizations?
- BONUS: Interactive World Map
- Part 3: Baseline Model
- Part 4: Feature Engineering
- Part 5: Regression Models
- The residuals are approximately centered around 0 with a near-normal distribution β confirming the model is unbiased. The fan-shaped spread in the left plot (increasing variance at higher predicted values) is expected: extreme events like mass casualty attacks are inherently harder to predict precisely, even with the best model.
- Part 6: Model Saving
- Part 7: Regression β Classification
- Part 8: Classification Models
- The tuned model was saved as
classification_model.pkland uploaded to this repository. - Summary
- AI Usage Disclosure
- Video Presentation
π Global Terrorism Database β Classification, Regression & Clustering
Video Presentation
Project Overview
This project applies a full end-to-end data science pipeline to the Global Terrorism Database (GTD) β one of the most comprehensive open-source datasets on terrorism, containing over 180,000 attacks worldwide from 1970 to 2017.
"What makes a terrorist attack deadly? Can we predict casualty counts β and classify attack severity β using historical patterns from 47 years of global terrorism data?"
| Research Question | Can we predict the number of casualties in a terrorist attack based on its characteristics? |
| Dataset | Global Terrorism Database β GTD |
| Dataset Size | 181,691 rows Γ 135 columns (sampled 50,000 rows) |
| Target Variable | nkill β number of people killed per attack |
| Task Types | Regression + Classification + Clustering |
Part 2: Data Cleaning & EDA
Data Cleaning
Out of 135 raw columns, we carefully selected 20 that were most relevant to predicting casualties. The cleaning process preserved as much data as possible while ensuring model reliability.
- Selected 20 relevant columns out of 135 (date, location, attack type, weapon, target, casualties)
- Dropped rows where
nkillwas missing β our target variable (~2,795 rows removed) - Filled missing values:
nwound,nhostkid,claimedβ filled with 0 (assumed no data = none reported) - Dropped rows with missing coordinates (~1,265 rows removed)
- Final clean dataset: ~50,000 rows ready for modeling
Outlier Detection
The nkill distribution is heavily right-skewed. Median is 0 killed (50% of attacks kill nobody), but max is 1,384 (9/11). Decision: Keep all outliers β these represent real historical events, not data errors.
Question 1: How has the number of terrorist attacks changed over time?
Answer: Terrorist attacks increased dramatically over the past 5 decades, with a massive surge after 2010 driven by the rise of ISIS and instability across the Middle East. The peak came in 2014 with over 4,500 attacks in our sample alone. Notably, attacks actually dropped sharply right after 9/11 β likely due to heightened global security β before climbing again to unprecedented levels. The red dashed line marks 9/11 as a turning point in the history of global counter-terrorism.
Question 2: Which countries are most affected by terrorism?
Answer: Iraq dominates as the most affected country with 6,552 attacks β nearly double the second-place Pakistan (3,911) and Afghanistan (3,337). A clear pattern emerges: every country in the top 10 is a developing nation experiencing political instability, civil conflict, or foreign military presence. Notably, Israel ranks #21 globally with 592 recorded attacks β highlighted in red as a personal point of context, showing that even countries
Question 3: What are the most common types of terrorist attacks?
Answer: Bombing/Explosion is by far the most common attack type (22,890 attacks β nearly half of all attacks), followed by Armed Assault (10,665) and Assassination (5,107). Hijacking is the rarest but caused the single deadliest attack in history β 9/11.
Question 4: Which attack types are most deadly?
Answer: Frequency and lethality don't always go together. Bombing/Explosion dominates in frequency (22,000+ attacks) but has moderate lethality (2.0 avg killed). Hijacking is the rarest attack type but the most lethal on average (9.8 avg killed per attack) β driven by 9/11. Armed Assault sits in the middle with high frequency and high total casualties.
Question 5: Who are the deadliest terrorist organizations?
Answer: ISIL is the deadliest organization with 9,752 people killed, followed by the Taliban (8,263) and Boko Haram (6,399). These three alone account for the vast majority of all terrorism-related deaths. Notably, all top 3 are modern organizations active post-2000, reflecting the dramatic rise of religiously-motivated terrorism in the 21st century.
BONUS: Interactive World Map
Note: The interactive version of this map (with hover tooltips showing exact attack counts per country) is available in the notebook. The image above is a static preview.
Global distribution of terrorist attacks from 1970β2017. Iraq dominates in dark red, with the Middle East and South Asia clearly the most affected regions.
Part 3: Baseline Model
Before building complex models, we established a Linear Regression baseline using only raw features β no engineering, no transformations. This gives us a reference point to measure how much our improvements actually help.
| Metric | Value |
|---|---|
| MAE | 2.49 |
| RMSE | 6.89 |
| RΒ² | 0.145 |
The baseline explains only 14.5% of the variance in casualties β a humble but expected result. Predicting terrorism casualties is inherently difficult: 50% of attacks kill nobody, while a handful of extreme outliers like 9/11 kill hundreds. No simple linear model can capture that dynamic.
This result set a clear challenge: can we do significantly better with feature engineering and more powerful models?
Part 4: Feature Engineering
Raw data alone is rarely enough. We engineered 4 new features designed to capture patterns that the original columns couldn't express on their own:
| Feature | Description | Intuition |
|---|---|---|
decade |
Which decade the attack occurred | Terrorism evolves over time β the 2010s look very different from the 1980s |
suicide_success |
Suicide attack that also succeeded | A suicide attacker who succeeds is far more lethal than one who doesn't |
has_hostages |
Whether hostages were taken | Hostage situations tend to escalate and involve more casualties |
cluster |
K-Means cluster ID (k=4) | Groups attacks by behavioral profile β see below |
K-Means Clustering (k=4)
We applied K-Means clustering on numeric features to automatically group attacks into 4 behavioral profiles. This is unsupervised learning working alongside our supervised models β the cluster label becomes a feature that carries rich information about the attack's overall pattern.
- Cluster 2 captured the most deadly attacks with the highest average
nkill - The
clusterfeature ranked #4 in feature importance out of 69 total features - This proves that clustering added real, measurable predictive value β not just noise

Part 5: Regression Models
We trained 6 models in total β first 3 on the original nkill target, then 3 more on a log-transformed target (log1p(nkill)) to handle the extreme right skew of the data. Log transformation compresses the scale of outliers, making it much easier for tree-based models to learn meaningful patterns.
| Model | MAE | RMSE | RΒ² |
|---|---|---|---|
| Baseline Linear Regression | 2.49 | 6.89 | 0.145 |
| LR + Features | 2.49 | 6.89 | 0.147 |
| Random Forest | 2.13 | 7.05 | 0.106 |
| XGBoost | 2.18 | 6.97 | 0.127 |
| LR (log) | 1.98 | 7.22 | 0.061 |
| Random Forest (log) | 1.71 | 6.62 | 0.211 |
| XGBoost (log) β | 1.72 | 6.52 | 0.235 |
Winner: XGBoost with log-transformed target (RΒ² = 0.235, MAE = 1.72)
Key takeaways:
- Log transformation significantly improved both Random Forest and XGBoost
- XGBoost (log) reduced MAE by 31% compared to the baseline (2.49 β 1.72)
- RΒ² improved from 0.145 to 0.235 β a 62% improvement over baseline
- Linear Regression showed almost no improvement from feature engineering, confirming the relationships are non-linear
An RΒ² of 0.235 may seem modest, but given that terrorism casualty prediction is one of the hardest real-world regression problems β with extreme outliers, sparse data, and inherently random violence β this result is meaningful and defensible.
Feature Importance
The top predictors reveal a clear story about what makes attacks deadly:
suicide(#1) β Suicide attacks are significantly more lethal by designAssassination(#2) β Targeted killings are planned for maximum impactFirearms(#3) β Weapon type is a strong predictor of lethalitycluster(#4) β Our engineered feature! Proves that unsupervised clustering added real signalnwound(#5) β Injuries and deaths are strongly correlated
Residual Analysis
The residuals are approximately centered around 0 with a near-normal distribution β confirming the model is unbiased. The fan-shaped spread in the left plot (increasing variance at higher predicted values) is expected: extreme events like mass casualty attacks are inherently harder to predict precisely, even with the best model.
Part 6: Model Saving
The winning model β XGBoost with log-transformed target β was saved and uploaded to this HuggingFace repository for reproducibility and deployment.
| File | Size | Description |
|---|---|---|
regression_model.pkl |
413 KB | XGBoost regression model (log-transformed) |
classification_model.pkl |
191 MB | Tuned Random Forest classifier |
Part 7: Regression β Classification
To make the problem more actionable, we converted the continuous nkill target into 3 meaningful severity classes. This allows emergency responders, intelligence analysts, and policymakers to think in terms of attack severity rather than exact body counts.
| Class | Definition | % of Data | Interpretation |
|---|---|---|---|
| 0 | 0 killed | 51.5% | No casualties β attack failed or was non-lethal |
| 1 | 1β3 killed | 34.2% | Low casualties β typical small-scale attack |
| 2 | 4+ killed | 14.3% | High casualties β mass casualty event |
The dataset is significantly imbalanced β Class 2 (high casualties) represents only 14.3% of attacks, yet it is by far the most important class to predict correctly. Missing a high-casualty attack is far more costly than a false alarm.
For this reason, we focused on F1 score and Recall for Class 2 as our primary evaluation metrics β not overall accuracy, which can be misleadingly high on imbalanced data.
Part 8: Classification Models
We trained 3 classification models to predict attack severity class (0/1/2). Each model brings a different approach β from simple linear boundaries to complex ensemble methods.
| Model | Accuracy | F1 (High casualties) |
|---|---|---|
| Logistic Regression | 64% | 0.48 |
| Gradient Boosting | 72% | 0.50 |
| Random Forest β | 74% | 0.54 |
Winner: Random Forest β best accuracy AND best F1 for the critical "High casualties" class.
The confusion matrices below reveal how each model handles the 3 classes:
Key observations:
- Logistic Regression struggles with Class 2 β only 66% recall for high casualties
- Random Forest achieves the best balance across all 3 classes
- Gradient Boosting performs well overall but falls short of Random Forest on Class 2
- All models correctly identify "No casualties" attacks most reliably (majority class)
BONUS: Hyperparameter Tuning
We applied GridSearchCV to find the optimal hyperparameters for the Random Forest β systematically testing 12 parameter combinations with 3-fold cross validation (36 total fits).
- Best parameters found:
n_estimators=200,max_depth=None,min_samples_split=5 - F1 (High casualties) improved: 0.54 β 0.58 (+7.4% improvement)
- Overall accuracy remained stable at 74% β confirming the tuning improved precision without sacrificing overall performance
The tuned model was saved as classification_model.pkl and uploaded to this repository.
Summary
This project demonstrates a complete data science pipeline applied to one of the most challenging real-world datasets available. Starting from raw terrorism records, we built a system that can predict casualty counts and classify attack severity with meaningful accuracy. The combination of feature engineering, log transformation, clustering, and hyperparameter tuning resulted in a 62% improvement over the baseline model.
Author: Omer Inbar | Reichman University β Adelson School of Entrepreneurship | Introduction to Data Science | 2026
AI Usage Disclosure
This project was completed with assistance from Claude (Anthropic) for code debugging, chart design, and README writing. All analysis, decisions, and interpretations are my own.














