YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

CO2 emission and green house gases across the past 70 years:

We wanted to test the spread of green house gases and co2 emmisions per capita around the world.

raw data set features:

📘 Dataset Feature Dictionary (Lite Version)

🧭 Identifiers

Description — Type of entry (Country, Region, World)
Name — Country name
year — Year of observation
iso_code — ISO 3-letter country code

👥 Population & Economy

population — Total population
gdp — Gross Domestic Product (USD)
energy_per_capita — Energy use per person
energy_per_gdp — Energy intensity of the economy

🏭 CO₂ Emissions (Fossil + Industry)

co2 — Total CO₂ emissions
co2_per_capita — CO₂ per person
co2_growth_abs / co2_growth_prct — Year-to-year emission change
co2_per_gdp — CO₂ per unit of GDP
co2_per_unit_energy — Emissions per unit of energy

🌱 CO₂ Including Land-Use Change (LUC)

co2_including_luc — Total CO₂ including deforestation
co2_including_luc_per_capita — Per-person CO₂ including LUC
land_use_change_co2 — CO₂ from land-use activities
land_use_change_co2_per_capita — Per-person land-use emissions

🔥 Sector Emissions

coal_co2 / oil_co2 / gas_co2 / flaring_co2 — Emissions by fuel type
cement_co2 — Emissions from cement production
other_industry_co2 — Miscellaneous industrial CO₂

🧮 Cumulative (Historical) Emissions

cumulative_co2 — Total historical fossil CO₂
cumulative_co2_including_luc — Historical CO₂ incl. LUC
cumulative_*_co2 — Historical totals by sector (coal, oil, gas, cement, etc.)

🌿 Other Greenhouse Gases

methane / methane_per_capita — CH₄ emissions
nitrous_oxide / nitrous_oxide_per_capita — N₂O emissions
ghg_per_capita — All GHGs per person
total_ghg — All greenhouse gases combined

🌡️ Temperature Impact

temperature_change_from_co2 — Warming from CO₂
temperature_change_from_ch4 — Warming from CH₄
temperature_change_from_n2o — Warming from N₂O
temperature_change_from_ghg — Warming from all GHGs

🌍 Global Shares

share_global_co2 — Country’s share of global CO₂
share_global_cumulative_co2 — Share of historical CO₂ responsibility

🔄 Trade

trade_co2 — Net imported/exported CO₂ via trade
trade_co2_share — Share of emissions affected by trade

A lote of the features had massive misssing data, so we mapped them by completness: Head:

Handel unusefull data: We removed complete fetures with more than 40% of missing values. We removed locations with less than 30% data completeness across features (e.g. aggregated GCP regions, international transport, microstates), since imputing them would introduce noise rather than signal.

Data completeness by year:

and after removing:

missing valuse by feature:

after drop:

countries that had less than 30% of data across years and featurs were removed: all of these countries were removed.

                  --------------------------------------------------------------------------------------------------

Global CO₂ trend over time (are we still going up?)

The top plot shows that total mean CO₂ emissions have risen steadily and strongly from 1950 to today, more than quadrupling over the period. The bottom plot shows that CO₂ per capita has stayed in a narrower band, with peaks in the 1950s–70s and a slight decline/flattening in recent decades, meaning total emissions growth is driven largely by more people and more emitting countries rather than a huge jump in emissions per person.

The upper one (global CO₂) can be explained due to the development of industries, production of cars and more, population growth around the world and extraction of oil in large (still growing) quantities over the past century.

The bottom one (CO₂ per capita) can be explained due to the ongoing attempts of the global community to control the growing CO₂, and moving to cleaner energy resources.

Top 10 emitters across countries:

The United States was the dominant CO₂ emitter from 1950 through the early 2000s, with emissions rising, then roughly plateauing and slightly declining after about 2005.
China shows a dramatic surge starting in the 1990s, overtaking the U.S. in the mid-2000s and continuing to rise steeply, while India’s emissions increase more gradually but steadily.
Russia peaks around the late 1980s/early 1990s and then drops, and most European countries (UK, France, Germany) show flatter or declining trends, suggesting some stabilization or reductions compared to the rapid growth in China and India

Top 10 emitters across large regions:

All regions show rising CO₂ emissions over time, but Asia’s growth is by far the steepest, especially after about 1990, driving much of the global increase.
Europe and North America rise until around the 1970s–2000s and then flatten or decline, suggesting some stabilization or reductions, while Africa and South America grow steadily from a much lower base.
The world curve keeps climbing almost linearly, meaning reductions in Europe/North America have so far been more than offset by rapid growth in Asia and other developing regions.

Co2 per capita accross countries

Insights from seperates graphs (they are just for comfort):

These countries all have very high per-capita emissions, with sharp increases starting in the 1960s–1970s, especially in oil-exporting states like Qatar, Kuwait, Saudi Arabia, and the UAE.
Many show a peak followed by a decline or plateau (e.g., Qatar, UAE, Luxembourg, Canada, United States), suggesting some improvement in efficiency or climate policies, but from very high levels.
A few (like Trinidad and Tobago or Bahrain) show sustained high or rising emissions over long periods, indicating continued heavy dependence on fossil-fuel-intensive activity.

Co2 per capita accross regions:

North America and Oceania have had the highest CO₂ emissions per capita, peaking around the 1970s–2000s and then gradually declining, while Europe also peaks in the 1980s–1990s and then clearly trends downward.
Asia starts from very low per-capita levels but rises steadily, especially after 2000, eventually approaching or exceeding the world average, whereas Asia excluding China and India grows more moderately.
Africa and South America stay at the lowest per-capita levels, increasing slowly over time, and the world average climbs but begins to flatten in recent decades, reflecting declines in rich regions partly offset by growth in developing ones.

GHG vs CO₂ we can see very strong, almost singular correlation between the two.

Methan and Nitrous Oxide and their connection to GHG:

For all countries, total GHG emissions rise strongly with both methane and nitrous oxide, showing that these gases are major contributors to overall climate impact, not just CO₂.
China and India occupy the upper-right parts of both plots, indicating very high absolute levels of methane and N₂O that scale with their large total GHG emissions.
Other large economies like the US, Russia, and Indonesia form separate clusters at lower but still substantial levels, suggesting different mixes of methane- and N₂O-intensive activities (e.g. agriculture, energy, land use) across countries.

Per capita:

Both methane and nitrous oxide per capita show strong positive relationships with total GHG per capita. Countries that emit more CH₄ or N₂O per person also tend to have higher overall greenhouse gas emissions.
Qatar, Kuwait, and the UAE stand out as extreme high emitters. These countries consistently appear in the top-right areas, indicating unusually high CH₄/N₂O per capita and very high GHG per capita—likely due to energy-intensive economies and oil/gas extraction.
Other countries cluster at much lower emission levels. Nations like New Zealand, Canada, Luxembourg, and Australia show moderate CH₄/N₂O but still elevated GHG per capita, while Mongolia and Brunei show different patterns but remain well below the Gulf states’ extremes.
The CH₄ scatter shows a stronger linear trend than N₂O. Methane emissions per capita more clearly predict higher total GHG per capita than nitrous oxide, suggesting methane’s larger contribution in these countries’ emission profiles.

After renoving the outlier we can see a more clean ditribution

Land use change over time, separated by country

Temperature change from each greenhouse gas, across time

temperature_change_from_co2 Across the whole period the United States has the largest CO₂-driven temperature contribution, rising almost linearly, while China starts very low but accelerates sharply after ~1990 and becomes the second-largest contributor. Other major economies (Russia, India, Brazil, etc.) show smoother, slower increases, highlighting that most of the historical CO₂ warming is concentrated in a few countries.
temperature_change_from_ch4 Methane-driven warming grows steadily for all countries, but China and the United States stand out with the steepest and highest trajectories, especially after 1980. India, Brazil, Indonesia and Nigeria also show clear upward trends, while some developed countries like the UK, Germany and Australia stay comparatively flat or even slightly decline in recent decades.
temperature_change_from_ghg (all gases) Total GHG-driven temperature change is dominated by the United States for the entire record, with a smooth, persistent rise that keeps it clearly above other countries. China’s contribution grows slowly at first then steepens strongly after about 1990, while Russia, India, Brazil and Indonesia show moderate but steadily increasing warming impacts.
temperature_change_from_n2o Nitrous-oxide-driven warming increases in a step-like pattern, with the United States again having the largest cumulative contribution and continuing to rise over time. China and India show noticeable growth starting around the 1970s–1980s, whereas countries like Canada, France, Germany and the UK contribute smaller but gradually increasing amounts.

This is population changes in the top 10 emitters, and their CO₂ over time:

China and India’s populations grow very steeply and are now far larger than any other country, while the U.S., Russia, Japan and Europe’s big economies grow slowly or flatten.
CO₂ trends don’t just follow population: the U.S. CO₂ rises until ~2005 and then declines even as its population keeps growing → some decoupling via efficiency, fuel mix changes, etc.
China’s CO₂ explodes after the 1990s, far faster than its population, showing how industrialization and energy intensity drive emissions; India’s CO₂ also rises steadily but from a much lower level.
Several developed countries (UK, Germany, Japan) show flat or falling CO₂ with almost stable populations, suggesting successful emission reductions compared with the rapid growth in emerging economies.
```
               --------------------------------------------------------------------------------------------------
```

Define and Train a baseline model

Goal: Predict a country–year’s CO₂ emissions per capita (co2_per_capita) from demographic, emissions, and temperature–related features (population, other gases, land-use change, etc.).

Evaluation (MAE, MSE, RMSE, R²) The baseline linear regression explains almost all of the variation in CO₂ per capita (R² ≈ 0.997), with a very small typical error (RMSE ≈ 0.34 tons per person, MAE ≈ 0.15). This suggests that CO₂ per capita is almost perfectly determined by the other emissions-related features in the dataset MAE : 0.012955227058249101 MSE : 0.0021803763327689954 RMSE: 0.04669450002697315 R² : 0.9999574649305762

Feature importance from coefficients

The largest coefficients are attached to different CO₂ measures (total CO₂, CO₂ including land-use change, and CO₂ per capita including land-use change), indicating that the target is primarily determined by other highly related CO₂ variables.

Land-use change CO₂ and its per-capita version also have strong effects, showing that land-use emissions play an important role. Other gases (methane, nitrous oxide, total GHG) contribute additional predictive power, but their coefficients are smaller.

The mix of positive and negative signs among similar variables suggests strong multicollinearity, so individual coefficients should be interpreted with caution

Feature Engineering

new features: 1. Log population (handle huge scale) 2. Land-use share of CO₂: How much of CO₂ incl. land-use is from land-use change? 3. Methane & N₂O share of total GHG 4. Time feature: years since 1950

Applying Clustering (Unsupervised Learning): Creating a new feature called 'cluster'

Some visuales, in terpertation of the model K-Means clustering on per-capita emissions and population produced four distinct 'emission profiles' for countries. Cluster 3 groups a few small but extremely high-emission countries (an extreme outlier to be removed later), cluster 2 contains high mixed-GHG emitters, cluster 1 contains very populous countries with moderate per-capita emissions, and cluster 0 contains low-industrial countries where land-use change dominates. We then use the cluster label (emission_cluster) as an additional feature in our regression models.

We want to see what cluster 3 consists of, and if we can explain the extreme co2 emmition compared to the population. Indeed, cluster 3 containes the countries Bahrain, Kuwait and Qatar who are amongst the biggest oil exporters in the world, with small populations, which expalins the large gap between them and other big oil producers. Eventualy I did not drop them, as they are larg contributers to the total global emmition.

['Bahrain', 'Kuwait', 'Qatar']

Train and Evaluate Three Improved Models

Summary on regression models:

Metrics:

Baseline linear (raw features) – MAE ≈ 0.15, RMSE ≈ 0.34, R² ≈ 0.997
Model 1 – Linear (engineered) – MAE = 0.019, RMSE = 0.070, R² = 0.9999
Model 2 – Random Forest – MAE = 0.097, RMSE = 0.342, R² = 0.9977
Model 3 – Gradient Boosting – MAE = 0.243, RMSE = 0.475, R² = 0.9956

and the Grammy for the best rgression model goes to... Model 1 - Linear (engineered)

Conclusion:

Model 1 is by far the best: errors are much smaller than baseline and R² is almost 1.
Random Forest (Model 2) is only slightly better than baseline.
Gradient Boosting (Model 3) is actually worse than baseline.

Feature importance:

All models agree that CO₂-related variables (especially per-capita and including land-use) and GHG per capita are the key drivers of CO₂ per capita; engineered features like land_use_share and gas shares also matter, but less.

Model 1 – LinearRegression

Model 2 – RandomForestRegressor

Model 3 – GradientBoostingRegressor

                  --------------------------------------------------------------------------------------------------

Regression-to-Classification I converted co2_per_capita into three classes using quantile binning (bottom 33%, middle 33%, top 33%). This gives interpretable classes of low, medium, and high per-capita emitters while keeping the class sizes relatively balanced. I use the same engineered features as in the regression models. The three emission classes are roughly balanced (each around ~1/3 of the data), so I can use accuracy along with macro F1 as evaluation metrics in the classification part. Class imbalance is not a major issue here

Train & Eval Classification Models In this task, recall for the high-emission class is more important than precision. I prefer to catch as many truly high emitters as possible, even if I sometimes mislabel medium emitters as high False negatives are more critical than false positives: missing a high-emission country is worse than mistakenly flagging a medium-emission one

Model A – Logistic Regression

Accuracy: 0.9730025538124772

          precision    recall  f1-score   support

       0      0.963     0.977     0.970       920
       1      0.966     0.955     0.960       940
       2      0.992     0.988     0.990       881

accuracy                          0.973      2741

macro avg 0.973 0.973 0.973 2741 weighted avg 0.973 0.973 0.973 2741

Model B – Random Forest Classifier

Accuracy: 0.9839474644290405

          precision    recall  f1-score   support

       0      0.991     0.984     0.987       920
       1      0.974     0.980     0.977       940
       2      0.988     0.989     0.988       881

accuracy                          0.984      2741

macro avg 0.984 0.984 0.984 2741 weighted avg 0.984 0.984 0.984 2741

Feature importance for the best classifier - Random Forest:

Model C – Gradient Boosting Classifier

Accuracy: 0.9795695001824152

          precision    recall  f1-score   support

       0      0.987     0.975     0.981       920
       1      0.965     0.976     0.970       940
       2      0.988     0.989     0.988       881

accuracy                          0.980      2741

macro avg 0.980 0.980 0.980 2741 weighted avg 0.980 0.980 0.980 2741

Evaluation:

Logistic Regression achieves about 96% accuracy with balanced precision and recall across all three emission classes. The confusion matrix shows that almost all errors are between neighbouring classes, and recall for the high-emitter class is very high (~0.98), which is important for our use case.
The Random Forest classifier achieves about 98% accuracy and very high F1 scores for all three classes. The confusion matrix shows only a handful of mistakes, mostly between neighbouring classes (low vs medium and medium vs high), and recall for the high-emitter class is ~0.99, so it almost never misses a high-emission country. This makes Random Forest the best classification model among the three.
Gradient Boosting reaches about 98.2% accuracy with very balanced performance across the three classes (macro F1 ≈ 0.982). Most mistakes are small: it only confuses neighbouring classes (low↔medium, medium↔high) and never predicts low as high or the opposite. It’s especially strong on the high-emitter class, with precision ≈ 0.99 and recall ≈ 0.992, so it almost never misses a high-emission country.

Although all of the models have a very high score prediction rate, the Random Forest Classifier got the best score over all, so it will be the selected one to ve uploaded.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support