CO2 emission and green house gases across the past 70 years:
We wanted to test the spread of green house gases and co2 emmisions per capita around the world.
raw data set features:
📘 Dataset Feature Dictionary (Lite Version)
🧭 Identifiers
- Description — Type of entry (Country, Region, World)
- Name — Country name
- year — Year of observation
- iso_code — ISO 3-letter country code
👥 Population & Economy
- population — Total population
- gdp — Gross Domestic Product (USD)
- energy_per_capita — Energy use per person
- energy_per_gdp — Energy intensity of the economy
🏭 CO₂ Emissions (Fossil + Industry)
- co2 — Total CO₂ emissions
- co2_per_capita — CO₂ per person
- co2_growth_abs / co2_growth_prct — Year-to-year emission change
- co2_per_gdp — CO₂ per unit of GDP
- co2_per_unit_energy — Emissions per unit of energy
🌱 CO₂ Including Land-Use Change (LUC)
- co2_including_luc — Total CO₂ including deforestation
- co2_including_luc_per_capita — Per-person CO₂ including LUC
- land_use_change_co2 — CO₂ from land-use activities
- land_use_change_co2_per_capita — Per-person land-use emissions
🔥 Sector Emissions
- coal_co2 / oil_co2 / gas_co2 / flaring_co2 — Emissions by fuel type
- cement_co2 — Emissions from cement production
- other_industry_co2 — Miscellaneous industrial CO₂
🧮 Cumulative (Historical) Emissions
- cumulative_co2 — Total historical fossil CO₂
- cumulative_co2_including_luc — Historical CO₂ incl. LUC
- cumulative_*_co2 — Historical totals by sector (coal, oil, gas, cement, etc.)
🌿 Other Greenhouse Gases
- methane / methane_per_capita — CH₄ emissions
- nitrous_oxide / nitrous_oxide_per_capita — N₂O emissions
- ghg_per_capita — All GHGs per person
- total_ghg — All greenhouse gases combined
🌡️ Temperature Impact
- temperature_change_from_co2 — Warming from CO₂
- temperature_change_from_ch4 — Warming from CH₄
- temperature_change_from_n2o — Warming from N₂O
- temperature_change_from_ghg — Warming from all GHGs
🌍 Global Shares
- share_global_co2 — Country’s share of global CO₂
- share_global_cumulative_co2 — Share of historical CO₂ responsibility
🔄 Trade
- trade_co2 — Net imported/exported CO₂ via trade
- trade_co2_share — Share of emissions affected by trade
A lote of the features had massive misssing data, so we mapped them by completness:
Head:

Handel unusefull data: We removed complete fetures with more than 40% of missing values. We removed locations with less than 30% data completeness across features (e.g. aggregated GCP regions, international transport, microstates), since imputing them would introduce noise rather than signal.
countries that had less than 30% of data across years and featurs were removed:
all of these countries were removed.
--------------------------------------------------------------------------------------------------
Global CO₂ trend over time (are we still going up?)
The top plot shows that total mean CO₂ emissions have risen steadily and strongly from 1950 to today, more than quadrupling over the period. The bottom plot shows that CO₂ per capita has stayed in a narrower band, with peaks in the 1950s–70s and a slight decline/flattening in recent decades, meaning total emissions growth is driven largely by more people and more emitting countries rather than a huge jump in emissions per person.
The upper one (global CO₂) can be explained due to the development of industries, production of cars and more, population growth around the world and extraction of oil in large (still growing) quantities over the past century.
The bottom one (CO₂ per capita) can be explained due to the ongoing attempts of the global community to control the growing CO₂, and moving to cleaner energy resources.

Top 10 emitters across countries:
The United States was the dominant CO₂ emitter from 1950 through the early 2000s, with emissions rising, then roughly plateauing and slightly declining after about 2005.
China shows a dramatic surge starting in the 1990s, overtaking the U.S. in the mid-2000s and continuing to rise steeply, while India’s emissions increase more gradually but steadily.
Russia peaks around the late 1980s/early 1990s and then drops, and most European countries (UK, France, Germany) show flatter or declining trends, suggesting some stabilization or reductions compared to the rapid growth in China and India

Top 10 emitters across large regions:
All regions show rising CO₂ emissions over time, but Asia’s growth is by far the steepest, especially after about 1990, driving much of the global increase.
Europe and North America rise until around the 1970s–2000s and then flatten or decline, suggesting some stabilization or reductions, while Africa and South America grow steadily from a much lower base.
The world curve keeps climbing almost linearly, meaning reductions in Europe/North America have so far been more than offset by rapid growth in Asia and other developing regions.

Co2 per capita accross countries
Insights from seperates graphs (they are just for comfort):
These countries all have very high per-capita emissions, with sharp increases starting in the 1960s–1970s, especially in oil-exporting states like Qatar, Kuwait, Saudi Arabia, and the UAE.
Many show a peak followed by a decline or plateau (e.g., Qatar, UAE, Luxembourg, Canada, United States), suggesting some improvement in efficiency or climate policies, but from very high levels.
A few (like Trinidad and Tobago or Bahrain) show sustained high or rising emissions over long periods, indicating continued heavy dependence on fossil-fuel-intensive activity.

Co2 per capita accross regions:
North America and Oceania have had the highest CO₂ emissions per capita, peaking around the 1970s–2000s and then gradually declining, while Europe also peaks in the 1980s–1990s and then clearly trends downward.
Asia starts from very low per-capita levels but rises steadily, especially after 2000, eventually approaching or exceeding the world average, whereas Asia excluding China and India grows more moderately.
Africa and South America stay at the lowest per-capita levels, increasing slowly over time, and the world average climbs but begins to flatten in recent decades, reflecting declines in rich regions partly offset by growth in developing ones.

GHG vs CO₂
we can see very strong, almost singular correlation between the two.

Methan and Nitrous Oxide and their connection to GHG:
For all countries, total GHG emissions rise strongly with both methane and nitrous oxide, showing that these gases are major contributors to overall climate impact, not just CO₂.
China and India occupy the upper-right parts of both plots, indicating very high absolute levels of methane and N₂O that scale with their large total GHG emissions.
Other large economies like the US, Russia, and Indonesia form separate clusters at lower but still substantial levels, suggesting different mixes of methane- and N₂O-intensive activities (e.g. agriculture, energy, land use) across countries.

Per capita:
Both methane and nitrous oxide per capita show strong positive relationships with total GHG per capita. Countries that emit more CH₄ or N₂O per person also tend to have higher overall greenhouse gas emissions.
Qatar, Kuwait, and the UAE stand out as extreme high emitters. These countries consistently appear in the top-right areas, indicating unusually high CH₄/N₂O per capita and very high GHG per capita—likely due to energy-intensive economies and oil/gas extraction.
Other countries cluster at much lower emission levels. Nations like New Zealand, Canada, Luxembourg, and Australia show moderate CH₄/N₂O but still elevated GHG per capita, while Mongolia and Brunei show different patterns but remain well below the Gulf states’ extremes.
The CH₄ scatter shows a stronger linear trend than N₂O. Methane emissions per capita more clearly predict higher total GHG per capita than nitrous oxide, suggesting methane’s larger contribution in these countries’ emission profiles.

After renoving the outlier we can see a more clean ditribution

Land use change over time, separated by country

Temperature change from each greenhouse gas, across time
temperature_change_from_co2 Across the whole period the United States has the largest CO₂-driven temperature contribution, rising almost linearly, while China starts very low but accelerates sharply after ~1990 and becomes the second-largest contributor. Other major economies (Russia, India, Brazil, etc.) show smoother, slower increases, highlighting that most of the historical CO₂ warming is concentrated in a few countries.
temperature_change_from_ch4 Methane-driven warming grows steadily for all countries, but China and the United States stand out with the steepest and highest trajectories, especially after 1980. India, Brazil, Indonesia and Nigeria also show clear upward trends, while some developed countries like the UK, Germany and Australia stay comparatively flat or even slightly decline in recent decades.
temperature_change_from_ghg (all gases) Total GHG-driven temperature change is dominated by the United States for the entire record, with a smooth, persistent rise that keeps it clearly above other countries. China’s contribution grows slowly at first then steepens strongly after about 1990, while Russia, India, Brazil and Indonesia show moderate but steadily increasing warming impacts.
temperature_change_from_n2o Nitrous-oxide-driven warming increases in a step-like pattern, with the United States again having the largest cumulative contribution and continuing to rise over time. China and India show noticeable growth starting around the 1970s–1980s, whereas countries like Canada, France, Germany and the UK contribute smaller but gradually increasing amounts.

This is population changes in the top 10 emitters, and their CO₂ over time:
China and India’s populations grow very steeply and are now far larger than any other country, while the U.S., Russia, Japan and Europe’s big economies grow slowly or flatten.
CO₂ trends don’t just follow population: the U.S. CO₂ rises until ~2005 and then declines even as its population keeps growing → some decoupling via efficiency, fuel mix changes, etc.
China’s CO₂ explodes after the 1990s, far faster than its population, showing how industrialization and energy intensity drive emissions; India’s CO₂ also rises steadily but from a much lower level.
Several developed countries (UK, Germany, Japan) show flat or falling CO₂ with almost stable populations, suggesting successful emission reductions compared with the rapid growth in emerging economies.

--------------------------------------------------------------------------------------------------
Define and Train a baseline model
Goal: Predict a country–year’s CO₂ emissions per capita (co2_per_capita) from demographic, emissions, and temperature–related features (population, other gases, land-use change, etc.).

Evaluation (MAE, MSE, RMSE, R²) The baseline linear regression explains almost all of the variation in CO₂ per capita (R² ≈ 0.997), with a very small typical error (RMSE ≈ 0.34 tons per person, MAE ≈ 0.15). This suggests that CO₂ per capita is almost perfectly determined by the other emissions-related features in the dataset MAE : 0.012955227058249101 MSE : 0.0021803763327689954 RMSE: 0.04669450002697315 R² : 0.9999574649305762
Feature importance from coefficients
The largest coefficients are attached to different CO₂ measures (total CO₂, CO₂ including land-use change, and CO₂ per capita including land-use change), indicating that the target is primarily determined by other highly related CO₂ variables.
Land-use change CO₂ and its per-capita version also have strong effects, showing that land-use emissions play an important role. Other gases (methane, nitrous oxide, total GHG) contribute additional predictive power, but their coefficients are smaller.
The mix of positive and negative signs among similar variables suggests strong multicollinearity, so individual coefficients should be interpreted with caution

Feature Engineering
new features: 1. Log population (handle huge scale) 2. Land-use share of CO₂: How much of CO₂ incl. land-use is from land-use change? 3. Methane & N₂O share of total GHG 4. Time feature: years since 1950
Applying Clustering (Unsupervised Learning): Creating a new feature called 'cluster'
Some visuales, in terpertation of the model
K-Means clustering on per-capita emissions and population produced four distinct 'emission profiles' for countries. Cluster 3 groups a few small but extremely high-emission countries (an extreme outlier to be removed later), cluster 2 contains high mixed-GHG emitters, cluster 1 contains very populous countries with moderate per-capita emissions, and cluster 0 contains low-industrial countries where land-use change dominates. We then use the cluster label (emission_cluster) as an additional feature in our regression models.

We want to see what cluster 3 consists of, and if we can explain the extreme co2 emmition compared to the population. Indeed, cluster 3 containes the countries Bahrain, Kuwait and Qatar who are amongst the biggest oil exporters in the world, with small populations, which expalins the large gap between them and other big oil producers. Eventualy I did not drop them, as they are larg contributers to the total global emmition.
['Bahrain', 'Kuwait', 'Qatar']
Train and Evaluate Three Improved Models
Summary on regression models:
Metrics:
Baseline linear (raw features) – MAE ≈ 0.15, RMSE ≈ 0.34, R² ≈ 0.997
Model 1 – Linear (engineered) – MAE = 0.019, RMSE = 0.070, R² = 0.9999
Model 2 – Random Forest – MAE = 0.097, RMSE = 0.342, R² = 0.9977
Model 3 – Gradient Boosting – MAE = 0.243, RMSE = 0.475, R² = 0.9956
and the Grammy for the best rgression model goes to... Model 1 - Linear (engineered)
Conclusion:
Model 1 is by far the best: errors are much smaller than baseline and R² is almost 1.
Random Forest (Model 2) is only slightly better than baseline.
Gradient Boosting (Model 3) is actually worse than baseline.
Feature importance:
All models agree that CO₂-related variables (especially per-capita and including land-use) and GHG per capita are the key drivers of CO₂ per capita; engineered features like land_use_share and gas shares also matter, but less.
Model 2 – RandomForestRegressor

Model 3 – GradientBoostingRegressor

--------------------------------------------------------------------------------------------------
Regression-to-Classification
I converted co2_per_capita into three classes using quantile binning (bottom 33%, middle 33%, top 33%). This gives interpretable classes of low, medium, and high per-capita emitters while keeping the class sizes relatively balanced. I use the same engineered features as in the regression models.
The three emission classes are roughly balanced (each around ~1/3 of the data), so I can use accuracy along with macro F1 as evaluation metrics in the classification part. Class imbalance is not a major issue here

Train & Eval Classification Models In this task, recall for the high-emission class is more important than precision. I prefer to catch as many truly high emitters as possible, even if I sometimes mislabel medium emitters as high False negatives are more critical than false positives: missing a high-emission country is worse than mistakenly flagging a medium-emission one
Model A – Logistic Regression
Accuracy: 0.9730025538124772
precision recall f1-score support
0 0.963 0.977 0.970 920
1 0.966 0.955 0.960 940
2 0.992 0.988 0.990 881
accuracy 0.973 2741
macro avg 0.973 0.973 0.973 2741
weighted avg 0.973 0.973 0.973 2741

Model B – Random Forest Classifier
Accuracy: 0.9839474644290405
precision recall f1-score support
0 0.991 0.984 0.987 920
1 0.974 0.980 0.977 940
2 0.988 0.989 0.988 881
accuracy 0.984 2741
macro avg 0.984 0.984 0.984 2741
weighted avg 0.984 0.984 0.984 2741

Feature importance for the best classifier - Random Forest:

Model C – Gradient Boosting Classifier
Accuracy: 0.9795695001824152
precision recall f1-score support
0 0.987 0.975 0.981 920
1 0.965 0.976 0.970 940
2 0.988 0.989 0.988 881
accuracy 0.980 2741
macro avg 0.980 0.980 0.980 2741
weighted avg 0.980 0.980 0.980 2741

Evaluation:
Logistic Regression achieves about 96% accuracy with balanced precision and recall across all three emission classes. The confusion matrix shows that almost all errors are between neighbouring classes, and recall for the high-emitter class is very high (~0.98), which is important for our use case.
The Random Forest classifier achieves about 98% accuracy and very high F1 scores for all three classes. The confusion matrix shows only a handful of mistakes, mostly between neighbouring classes (low vs medium and medium vs high), and recall for the high-emitter class is ~0.99, so it almost never misses a high-emission country. This makes Random Forest the best classification model among the three.
Gradient Boosting reaches about 98.2% accuracy with very balanced performance across the three classes (macro F1 ≈ 0.982). Most mistakes are small: it only confuses neighbouring classes (low↔medium, medium↔high) and never predicts low as high or the opposite. It’s especially strong on the high-emitter class, with precision ≈ 0.99 and recall ≈ 0.992, so it almost never misses a high-emission country.
Although all of the models have a very high score prediction rate, the Random Forest Classifier got the best score over all, so it will be the selected one to ve uploaded.




