🌍 Global Terrorism Database β€” Classification, Regression & Clustering


Video Presentation


Project Overview

This project applies a full end-to-end data science pipeline to the Global Terrorism Database (GTD) β€” one of the most comprehensive open-source datasets on terrorism, containing over 180,000 attacks worldwide from 1970 to 2017.

"What makes a terrorist attack deadly? Can we predict casualty counts β€” and classify attack severity β€” using historical patterns from 47 years of global terrorism data?"

Research Question Can we predict the number of casualties in a terrorist attack based on its characteristics?
Dataset Global Terrorism Database β€” GTD
Dataset Size 181,691 rows Γ— 135 columns (sampled 50,000 rows)
Target Variable nkill β€” number of people killed per attack
Task Types Regression + Classification + Clustering

Part 2: Data Cleaning & EDA

Data Cleaning

Out of 135 raw columns, we carefully selected 20 that were most relevant to predicting casualties. The cleaning process preserved as much data as possible while ensuring model reliability.

  • Selected 20 relevant columns out of 135 (date, location, attack type, weapon, target, casualties)
  • Dropped rows where nkill was missing β€” our target variable (~2,795 rows removed)
  • Filled missing values: nwound, nhostkid, claimed β†’ filled with 0 (assumed no data = none reported)
  • Dropped rows with missing coordinates (~1,265 rows removed)
  • Final clean dataset: ~50,000 rows ready for modeling

Outlier Detection

Screenshot 2026-05-03 at 14.02.40

The nkill distribution is heavily right-skewed. Median is 0 killed (50% of attacks kill nobody), but max is 1,384 (9/11). Decision: Keep all outliers β€” these represent real historical events, not data errors.


Question 1: How has the number of terrorist attacks changed over time?

Screenshot 2026-05-03 at 14.04.11

Answer: Terrorist attacks increased dramatically over the past 5 decades, with a massive surge after 2010 driven by the rise of ISIS and instability across the Middle East. The peak came in 2014 with over 4,500 attacks in our sample alone. Notably, attacks actually dropped sharply right after 9/11 β€” likely due to heightened global security β€” before climbing again to unprecedented levels. The red dashed line marks 9/11 as a turning point in the history of global counter-terrorism.


Question 2: Which countries are most affected by terrorism?

Screenshot 2026-05-03 at 14.04.30

Answer: Iraq dominates as the most affected country with 6,552 attacks β€” nearly double the second-place Pakistan (3,911) and Afghanistan (3,337). A clear pattern emerges: every country in the top 10 is a developing nation experiencing political instability, civil conflict, or foreign military presence. Notably, Israel ranks #21 globally with 592 recorded attacks β€” highlighted in red as a personal point of context, showing that even countries


Question 3: What are the most common types of terrorist attacks?

Screenshot 2026-05-03 at 14.18.40

Answer: Bombing/Explosion is by far the most common attack type (22,890 attacks β€” nearly half of all attacks), followed by Armed Assault (10,665) and Assassination (5,107). Hijacking is the rarest but caused the single deadliest attack in history β€” 9/11.


Question 4: Which attack types are most deadly?

Screenshot 2026-05-03 at 14.19.02

Answer: Frequency and lethality don't always go together. Bombing/Explosion dominates in frequency (22,000+ attacks) but has moderate lethality (2.0 avg killed). Hijacking is the rarest attack type but the most lethal on average (9.8 avg killed per attack) β€” driven by 9/11. Armed Assault sits in the middle with high frequency and high total casualties.


Question 5: Who are the deadliest terrorist organizations?

Screenshot 2026-05-04 at 12.45.57

Answer: ISIL is the deadliest organization with 9,752 people killed, followed by the Taliban (8,263) and Boko Haram (6,399). These three alone account for the vast majority of all terrorism-related deaths. Notably, all top 3 are modern organizations active post-2000, reflecting the dramatic rise of religiously-motivated terrorism in the 21st century.


BONUS: Interactive World Map

Screenshot 2026-04-30 at 14.42.20

Note: The interactive version of this map (with hover tooltips showing exact attack counts per country) is available in the notebook. The image above is a static preview.

Global distribution of terrorist attacks from 1970–2017. Iraq dominates in dark red, with the Middle East and South Asia clearly the most affected regions.


Part 3: Baseline Model

Before building complex models, we established a Linear Regression baseline using only raw features β€” no engineering, no transformations. This gives us a reference point to measure how much our improvements actually help.

Metric Value
MAE 2.49
RMSE 6.89
RΒ² 0.145

The baseline explains only 14.5% of the variance in casualties β€” a humble but expected result. Predicting terrorism casualties is inherently difficult: 50% of attacks kill nobody, while a handful of extreme outliers like 9/11 kill hundreds. No simple linear model can capture that dynamic.

This result set a clear challenge: can we do significantly better with feature engineering and more powerful models?

Screenshot 2026-05-03 at 14.36.45


Part 4: Feature Engineering

Raw data alone is rarely enough. We engineered 4 new features designed to capture patterns that the original columns couldn't express on their own:

Feature Description Intuition
decade Which decade the attack occurred Terrorism evolves over time β€” the 2010s look very different from the 1980s
suicide_success Suicide attack that also succeeded A suicide attacker who succeeds is far more lethal than one who doesn't
has_hostages Whether hostages were taken Hostage situations tend to escalate and involve more casualties
cluster K-Means cluster ID (k=4) Groups attacks by behavioral profile β€” see below

K-Means Clustering (k=4)

We applied K-Means clustering on numeric features to automatically group attacks into 4 behavioral profiles. This is unsupervised learning working alongside our supervised models β€” the cluster label becomes a feature that carries rich information about the attack's overall pattern.

  • Cluster 2 captured the most deadly attacks with the highest average nkill
  • The cluster feature ranked #4 in feature importance out of 69 total features
  • This proves that clustering added real, measurable predictive value β€” not just noise
  • Screenshot 2026-05-03 at 14.38.07

Part 5: Regression Models

We trained 6 models in total β€” first 3 on the original nkill target, then 3 more on a log-transformed target (log1p(nkill)) to handle the extreme right skew of the data. Log transformation compresses the scale of outliers, making it much easier for tree-based models to learn meaningful patterns.

Model MAE RMSE RΒ²
Baseline Linear Regression 2.49 6.89 0.145
LR + Features 2.49 6.89 0.147
Random Forest 2.13 7.05 0.106
XGBoost 2.18 6.97 0.127
LR (log) 1.98 7.22 0.061
Random Forest (log) 1.71 6.62 0.211
XGBoost (log) ⭐ 1.72 6.52 0.235

Winner: XGBoost with log-transformed target (RΒ² = 0.235, MAE = 1.72)

Key takeaways:

  • Log transformation significantly improved both Random Forest and XGBoost
  • XGBoost (log) reduced MAE by 31% compared to the baseline (2.49 β†’ 1.72)
  • RΒ² improved from 0.145 to 0.235 β€” a 62% improvement over baseline
  • Linear Regression showed almost no improvement from feature engineering, confirming the relationships are non-linear

An RΒ² of 0.235 may seem modest, but given that terrorism casualty prediction is one of the hardest real-world regression problems β€” with extreme outliers, sparse data, and inherently random violence β€” this result is meaningful and defensible.

Screenshot 2026-05-03 at 14.39.03

Feature Importance

Screenshot 2026-05-03 at 14.39.59

The top predictors reveal a clear story about what makes attacks deadly:

  • suicide (#1) β€” Suicide attacks are significantly more lethal by design
  • Assassination (#2) β€” Targeted killings are planned for maximum impact
  • Firearms (#3) β€” Weapon type is a strong predictor of lethality
  • cluster (#4) β€” Our engineered feature! Proves that unsupervised clustering added real signal
  • nwound (#5) β€” Injuries and deaths are strongly correlated

Residual Analysis

Screenshot 2026-05-03 at 14.40.44

The residuals are approximately centered around 0 with a near-normal distribution β€” confirming the model is unbiased. The fan-shaped spread in the left plot (increasing variance at higher predicted values) is expected: extreme events like mass casualty attacks are inherently harder to predict precisely, even with the best model.

Part 6: Model Saving

The winning model β€” XGBoost with log-transformed target β€” was saved and uploaded to this HuggingFace repository for reproducibility and deployment.

File Size Description
regression_model.pkl 413 KB XGBoost regression model (log-transformed)
classification_model.pkl 191 MB Tuned Random Forest classifier

Part 7: Regression β†’ Classification

To make the problem more actionable, we converted the continuous nkill target into 3 meaningful severity classes. This allows emergency responders, intelligence analysts, and policymakers to think in terms of attack severity rather than exact body counts.

Class Definition % of Data Interpretation
0 0 killed 51.5% No casualties β€” attack failed or was non-lethal
1 1–3 killed 34.2% Low casualties β€” typical small-scale attack
2 4+ killed 14.3% High casualties β€” mass casualty event

The dataset is significantly imbalanced β€” Class 2 (high casualties) represents only 14.3% of attacks, yet it is by far the most important class to predict correctly. Missing a high-casualty attack is far more costly than a false alarm.

Screenshot 2026-05-03 at 14.42.26

For this reason, we focused on F1 score and Recall for Class 2 as our primary evaluation metrics β€” not overall accuracy, which can be misleadingly high on imbalanced data.


Part 8: Classification Models

We trained 3 classification models to predict attack severity class (0/1/2). Each model brings a different approach β€” from simple linear boundaries to complex ensemble methods.

Model Accuracy F1 (High casualties)
Logistic Regression 64% 0.48
Gradient Boosting 72% 0.50
Random Forest ⭐ 74% 0.54

Winner: Random Forest β€” best accuracy AND best F1 for the critical "High casualties" class.

The confusion matrices below reveal how each model handles the 3 classes:

Screenshot 2026-05-03 at 14.43.52

Key observations:

  • Logistic Regression struggles with Class 2 β€” only 66% recall for high casualties
  • Random Forest achieves the best balance across all 3 classes
  • Gradient Boosting performs well overall but falls short of Random Forest on Class 2
  • All models correctly identify "No casualties" attacks most reliably (majority class)

Screenshot 2026-05-03 at 14.44.04

BONUS: Hyperparameter Tuning

We applied GridSearchCV to find the optimal hyperparameters for the Random Forest β€” systematically testing 12 parameter combinations with 3-fold cross validation (36 total fits).

  • Best parameters found: n_estimators=200, max_depth=None, min_samples_split=5
  • F1 (High casualties) improved: 0.54 β†’ 0.58 (+7.4% improvement)
  • Overall accuracy remained stable at 74% β€” confirming the tuning improved precision without sacrificing overall performance

Screenshot 2026-05-03 at 14.44.14

The tuned model was saved as classification_model.pkl and uploaded to this repository.

Summary

This project demonstrates a complete data science pipeline applied to one of the most challenging real-world datasets available. Starting from raw terrorism records, we built a system that can predict casualty counts and classify attack severity with meaningful accuracy. The combination of feature engineering, log transformation, clustering, and hyperparameter tuning resulted in a 62% improvement over the baseline model.

Author: Omer Inbar | Reichman University β€” Adelson School of Entrepreneurship | Introduction to Data Science | 2026

AI Usage Disclosure

This project was completed with assistance from Claude (Anthropic) for code debugging, chart design, and README writing. All analysis, decisions, and interpretations are my own.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support