🌍 Global Terrorism Database — Classification, Regression & Clustering

Video Presentation

Project Overview

This project applies a full end-to-end data science pipeline to the Global Terrorism Database (GTD) — one of the most comprehensive open-source datasets on terrorism, containing over 180,000 attacks worldwide from 1970 to 2017.

"What makes a terrorist attack deadly? Can we predict casualty counts — and classify attack severity — using historical patterns from 47 years of global terrorism data?"


Research Question	Can we predict the number of casualties in a terrorist attack based on its characteristics?
Dataset	Global Terrorism Database — GTD
Dataset Size	181,691 rows × 135 columns (sampled 50,000 rows)
Target Variable	`nkill` — number of people killed per attack
Task Types	Regression + Classification + Clustering

Part 2: Data Cleaning & EDA

Data Cleaning

Out of 135 raw columns, we carefully selected 20 that were most relevant to predicting casualties. The cleaning process preserved as much data as possible while ensuring model reliability.

Selected 20 relevant columns out of 135 (date, location, attack type, weapon, target, casualties)
Dropped rows where nkill was missing — our target variable (~2,795 rows removed)
Filled missing values: nwound, nhostkid, claimed → filled with 0 (assumed no data = none reported)
Dropped rows with missing coordinates (~1,265 rows removed)
Final clean dataset: ~50,000 rows ready for modeling

Outlier Detection

The nkill distribution is heavily right-skewed. Median is 0 killed (50% of attacks kill nobody), but max is 1,384 (9/11). Decision: Keep all outliers — these represent real historical events, not data errors.

Question 1: How has the number of terrorist attacks changed over time?

Answer: Terrorist attacks increased dramatically over the past 5 decades, with a massive surge after 2010 driven by the rise of ISIS and instability across the Middle East. The peak came in 2014 with over 4,500 attacks in our sample alone. Notably, attacks actually dropped sharply right after 9/11 — likely due to heightened global security — before climbing again to unprecedented levels. The red dashed line marks 9/11 as a turning point in the history of global counter-terrorism.

Question 2: Which countries are most affected by terrorism?

Answer: Iraq dominates as the most affected country with 6,552 attacks — nearly double the second-place Pakistan (3,911) and Afghanistan (3,337). A clear pattern emerges: every country in the top 10 is a developing nation experiencing political instability, civil conflict, or foreign military presence. Notably, Israel ranks #21 globally with 592 recorded attacks — highlighted in red as a personal point of context, showing that even countries

Question 3: What are the most common types of terrorist attacks?

Answer: Bombing/Explosion is by far the most common attack type (22,890 attacks — nearly half of all attacks), followed by Armed Assault (10,665) and Assassination (5,107). Hijacking is the rarest but caused the single deadliest attack in history — 9/11.

Question 4: Which attack types are most deadly?

Answer: Frequency and lethality don't always go together. Bombing/Explosion dominates in frequency (22,000+ attacks) but has moderate lethality (2.0 avg killed). Hijacking is the rarest attack type but the most lethal on average (9.8 avg killed per attack) — driven by 9/11. Armed Assault sits in the middle with high frequency and high total casualties.

Question 5: Who are the deadliest terrorist organizations?

Answer: ISIL is the deadliest organization with 9,752 people killed, followed by the Taliban (8,263) and Boko Haram (6,399). These three alone account for the vast majority of all terrorism-related deaths. Notably, all top 3 are modern organizations active post-2000, reflecting the dramatic rise of religiously-motivated terrorism in the 21st century.

BONUS: Interactive World Map

Note: The interactive version of this map (with hover tooltips showing exact attack counts per country) is available in the notebook. The image above is a static preview.

Global distribution of terrorist attacks from 1970–2017. Iraq dominates in dark red, with the Middle East and South Asia clearly the most affected regions.

Part 3: Baseline Model

Before building complex models, we established a Linear Regression baseline using only raw features — no engineering, no transformations. This gives us a reference point to measure how much our improvements actually help.

Metric	Value
MAE	2.49
RMSE	6.89
R²	0.145

The baseline explains only 14.5% of the variance in casualties — a humble but expected result. Predicting terrorism casualties is inherently difficult: 50% of attacks kill nobody, while a handful of extreme outliers like 9/11 kill hundreds. No simple linear model can capture that dynamic.

This result set a clear challenge: can we do significantly better with feature engineering and more powerful models?

Part 4: Feature Engineering

Raw data alone is rarely enough. We engineered 4 new features designed to capture patterns that the original columns couldn't express on their own:

Feature	Description	Intuition
`decade`	Which decade the attack occurred	Terrorism evolves over time — the 2010s look very different from the 1980s
`suicide_success`	Suicide attack that also succeeded	A suicide attacker who succeeds is far more lethal than one who doesn't
`has_hostages`	Whether hostages were taken	Hostage situations tend to escalate and involve more casualties
`cluster`	K-Means cluster ID (k=4)	Groups attacks by behavioral profile — see below

K-Means Clustering (k=4)

We applied K-Means clustering on numeric features to automatically group attacks into 4 behavioral profiles. This is unsupervised learning working alongside our supervised models — the cluster label becomes a feature that carries rich information about the attack's overall pattern.

Cluster 2 captured the most deadly attacks with the highest average nkill
The cluster feature ranked #4 in feature importance out of 69 total features
This proves that clustering added real, measurable predictive value — not just noise

Part 5: Regression Models

We trained 6 models in total — first 3 on the original nkill target, then 3 more on a log-transformed target (log1p(nkill)) to handle the extreme right skew of the data. Log transformation compresses the scale of outliers, making it much easier for tree-based models to learn meaningful patterns.

Model	MAE	RMSE	R²
Baseline Linear Regression	2.49	6.89	0.145
LR + Features	2.49	6.89	0.147
Random Forest	2.13	7.05	0.106
XGBoost	2.18	6.97	0.127
LR (log)	1.98	7.22	0.061
Random Forest (log)	1.71	6.62	0.211
XGBoost (log) ⭐	1.72	6.52	0.235

Winner: XGBoost with log-transformed target (R² = 0.235, MAE = 1.72)

Key takeaways:

Log transformation significantly improved both Random Forest and XGBoost
XGBoost (log) reduced MAE by 31% compared to the baseline (2.49 → 1.72)
R² improved from 0.145 to 0.235 — a 62% improvement over baseline
Linear Regression showed almost no improvement from feature engineering, confirming the relationships are non-linear

An R² of 0.235 may seem modest, but given that terrorism casualty prediction is one of the hardest real-world regression problems — with extreme outliers, sparse data, and inherently random violence — this result is meaningful and defensible.

Feature Importance

The top predictors reveal a clear story about what makes attacks deadly:

suicide (#1) — Suicide attacks are significantly more lethal by design
Assassination (#2) — Targeted killings are planned for maximum impact
Firearms (#3) — Weapon type is a strong predictor of lethality
cluster (#4) — Our engineered feature! Proves that unsupervised clustering added real signal
nwound (#5) — Injuries and deaths are strongly correlated

Residual Analysis

The residuals are approximately centered around 0 with a near-normal distribution — confirming the model is unbiased. The fan-shaped spread in the left plot (increasing variance at higher predicted values) is expected: extreme events like mass casualty attacks are inherently harder to predict precisely, even with the best model.

Part 6: Model Saving

The winning model — XGBoost with log-transformed target — was saved and uploaded to this HuggingFace repository for reproducibility and deployment.

File	Size	Description
`regression_model.pkl`	413 KB	XGBoost regression model (log-transformed)
`classification_model.pkl`	191 MB	Tuned Random Forest classifier

Part 7: Regression → Classification

To make the problem more actionable, we converted the continuous nkill target into 3 meaningful severity classes. This allows emergency responders, intelligence analysts, and policymakers to think in terms of attack severity rather than exact body counts.

Class	Definition	% of Data	Interpretation
0	0 killed	51.5%	No casualties — attack failed or was non-lethal
1	1–3 killed	34.2%	Low casualties — typical small-scale attack
2	4+ killed	14.3%	High casualties — mass casualty event

The dataset is significantly imbalanced — Class 2 (high casualties) represents only 14.3% of attacks, yet it is by far the most important class to predict correctly. Missing a high-casualty attack is far more costly than a false alarm.

For this reason, we focused on F1 score and Recall for Class 2 as our primary evaluation metrics — not overall accuracy, which can be misleadingly high on imbalanced data.

Part 8: Classification Models

We trained 3 classification models to predict attack severity class (0/1/2). Each model brings a different approach — from simple linear boundaries to complex ensemble methods.

Model	Accuracy	F1 (High casualties)
Logistic Regression	64%	0.48
Gradient Boosting	72%	0.50
Random Forest ⭐	74%	0.54

Winner: Random Forest — best accuracy AND best F1 for the critical "High casualties" class.

The confusion matrices below reveal how each model handles the 3 classes:

Key observations:

Logistic Regression struggles with Class 2 — only 66% recall for high casualties
Random Forest achieves the best balance across all 3 classes
Gradient Boosting performs well overall but falls short of Random Forest on Class 2
All models correctly identify "No casualties" attacks most reliably (majority class)

BONUS: Hyperparameter Tuning

We applied GridSearchCV to find the optimal hyperparameters for the Random Forest — systematically testing 12 parameter combinations with 3-fold cross validation (36 total fits).

Best parameters found: n_estimators=200, max_depth=None, min_samples_split=5
F1 (High casualties) improved: 0.54 → 0.58 (+7.4% improvement)
Overall accuracy remained stable at 74% — confirming the tuning improved precision without sacrificing overall performance

The tuned model was saved as `classification_model.pkl` and uploaded to this repository.

Summary

This project demonstrates a complete data science pipeline applied to one of the most challenging real-world datasets available. Starting from raw terrorism records, we built a system that can predict casualty counts and classify attack severity with meaningful accuracy. The combination of feature engineering, log transformation, clustering, and hyperparameter tuning resulted in a 62% improvement over the baseline model.

Author: Omer Inbar | Reichman University — Adelson School of Entrepreneurship | Introduction to Data Science | 2026

AI Usage Disclosure

This project was completed with assistance from Claude (Anthropic) for code debugging, chart design, and README writing. All analysis, decisions, and interpretations are my own.

Downloads last month: -; Downloads are not tracked for this model. How to track