Satyam0077 commited on
Commit
9142902
·
0 Parent(s):

Initial commit - Project Samarth Intelligent Q&A System

Browse files
.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ venv/
2
+ __pycache__/
3
+ *.csv
4
+ *.ipynb_checkpoints
5
+ .DS_Store
6
+ .env
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🌾 Project Samarth — Intelligent Q&A System
2
+ **Bridging Agriculture & Climate Insights using Live Government Data**
3
+
4
+ ---
5
+
6
+ ### 🧠 Overview
7
+
8
+ **Project Samarth** is an intelligent **Q&A system** built to analyze and answer complex, data-driven questions about **India’s agricultural economy** and its relationship with **climate patterns** — powered entirely by **live datasets from [data.gov.in](https://data.gov.in/)**.
9
+
10
+ This system fetches real-time data from the:
11
+ - 🏛️ **Ministry of Agriculture & Farmers Welfare**
12
+ - 🌦️ **India Meteorological Department (IMD)**
13
+
14
+ It integrates both datasets and allows users to query them in **natural language** through a clean **Streamlit-based interface**.
15
+
16
+ ---
17
+
18
+ ### 🎯 Problem Statement
19
+
20
+ Government portals like **data.gov.in** contain thousands of valuable datasets — but they exist in diverse formats across ministries, making it difficult to extract cross-domain insights.
21
+
22
+ **Your Mission:**
23
+ To design and build a **functional end-to-end prototype** that:
24
+ 1. Fetches live government data using APIs.
25
+ 2. Integrates multiple datasets (Agriculture + IMD Rainfall).
26
+ 3. Enables users to ask **natural language questions**.
27
+ 4. Returns accurate, traceable, and data-backed insights with proper citations.
28
+
29
+ ---
30
+
31
+ ### 🚀 Features
32
+
33
+ ✅ **Real-Time API Integration**
34
+ - Fetches data directly from `data.gov.in` via official API keys and resource IDs.
35
+ - Agriculture: Crop production data (1997–2014)
36
+ - IMD: Sub-divisional rainfall data (1901–2017)
37
+
38
+ ✅ **Data Integration Layer**
39
+ - Automatically merges climate and crop production datasets using cleaned and normalized state names.
40
+
41
+ ✅ **Intelligent Q&A Engine**
42
+ - Understands queries like:
43
+ - “Compare rainfall and rice production in Bihar and Jharkhand for the last 5 years.”
44
+ - “Analyze crop trends in Andhra Pradesh.”
45
+
46
+ ✅ **Streamlit Chat Interface**
47
+ - Simple user input box.
48
+ - Clean, markdown-based formatted answers.
49
+ - Auto-citation of data sources.
50
+
51
+ ✅ **Accuracy & Traceability**
52
+ - Every answer is directly backed by the live dataset and cited source.
53
+
54
+ ---
55
+
56
+ ### 🧩 System Architecture
57
+
58
+ User (Streamlit UI)
59
+
60
+
61
+ Natural Language Parser (LLM / Keyword Extractor)
62
+
63
+
64
+ Query Engine (Pandas Logic)
65
+
66
+
67
+ Data Layer (APIs + Local CSV Integration)
68
+
69
+
70
+ Answer Generator (Formatter + Citation)
71
+
72
+
73
+ ---
74
+
75
+ ### 🧰 Tech Stack
76
+
77
+ | Layer | Tools / Libraries |
78
+ |-------|--------------------|
79
+ | Data Fetching | `requests`, `pandas`, `json` |
80
+ | Data Integration | `pandas`, `numpy` |
81
+ | NLP Parsing | Custom keyword parser / rule-based |
82
+ | Visualization | `matplotlib`, `seaborn`, `plotly` |
83
+ | Frontend | `streamlit`, `style.css` |
84
+ | Backend Logic | Python 3.10+ |
85
+ | Source | [data.gov.in](https://data.gov.in) APIs |
86
+
87
+ ---
88
+
89
+ ### ⚙️ Setup Instructions
90
+
91
+ 1️⃣ **Clone the Repository**
92
+ ```bash
93
+ git clone https://github.com/<your-username>/Project_Samarth.git
94
+ cd Project_Samarth
95
+
96
+ 2️⃣ Create a Virtual Environment
97
+
98
+ python -m venv venv
99
+ source venv/bin/activate # (or venv\Scripts\activate on Windows)
100
+
101
+
102
+ 3️⃣ Install Dependencies
103
+
104
+ pip install -r requirements.txt
105
+
106
+
107
+ 4️⃣ Fetch & Integrate Data
108
+
109
+ python main.py
110
+
111
+
112
+ 5️⃣ Run the Streamlit Q&A Interface
113
+
114
+ streamlit run ui/app_streamlit.py
115
+
116
+ 🧠 Example Query
117
+
118
+ Input:
119
+
120
+ Compare rainfall and rice production in Andaman and Nicobar Islands for the last 5 years
121
+
122
+
123
+ Output:
124
+
125
+ 📊 Analysis for Andaman and Nicobar Islands — Crop: Rice
126
+
127
+ 🌧️ Average Rainfall (mm):
128
+ • Andaman and Nicobar Islands: 1142.46
129
+
130
+ 🌾 Total Production (tonnes):
131
+ • Andaman and Nicobar Islands: 45,451
132
+
133
+ 📚 Data Source: Ministry of Agriculture & Farmers Welfare and India Meteorological Department (IMD), data.gov.in
134
+
135
+ 🧩 Key Dataset References
136
+ Dataset Ministry API Resource ID
137
+ District-wise Crop Production Statistics (1997–2014) Ministry of Agriculture & Farmers Welfare xxxxx
138
+ Sub-divisional Rainfall Data (1901–2017) India Meteorological Department (IMD) xxxxxx
139
+
140
+ 👨‍💻 Developed By
141
+ Satyam Kumar
answer_generator/__init__.py ADDED
File without changes
answer_generator/citation_manager.py ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ def get_source(query_type: str):
2
+ """
3
+ Returns the accurate data source(s) based on query type.
4
+ Ensures correct citation for all Q&A responses.
5
+ """
6
+
7
+ if query_type in ["compare_rainfall", "compare_rainfall_production"]:
8
+ return "Ministry of Agriculture & Farmers Welfare and India Meteorological Department (IMD), data.gov.in"
9
+
10
+ elif query_type in ["highest_production", "crop_trend"]:
11
+ return "Ministry of Agriculture & Farmers Welfare, data.gov.in"
12
+
13
+ elif query_type == "climate_correlation":
14
+ return "India Meteorological Department (IMD), data.gov.in"
15
+
16
+ # Default fallback
17
+ return "Government Open Data Portal (data.gov.in)"
18
+
19
+
20
+ # 🧪 Test
21
+ if __name__ == "__main__":
22
+ print(get_source("compare_rainfall_production"))
answer_generator/formatter.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ def format_response(result: dict, source: str):
2
+ """Format the Q&A result for Streamlit display."""
3
+
4
+ if not result or "error" in result:
5
+ return f"❌ {result.get('error', 'No valid data found.')}"
6
+
7
+ states = [s.title() for s in result.get("states", [])]
8
+ crop = result.get("crop", "N/A").title()
9
+ text = f"📊 Analysis for {', '.join(states)} — Crop: {crop}\n\n"
10
+
11
+ # 🌧️ Rainfall Summary
12
+ rainfall = result.get("rainfall_summary", {})
13
+ if rainfall:
14
+ text += "🌧️ Average Rainfall (mm):\n"
15
+ for state, value in rainfall.items():
16
+ text += f" • {state.title()}: {round(value, 2)}\n"
17
+ text += "\n"
18
+
19
+ # 🌾 Production Summary
20
+ production = result.get("production_summary", {})
21
+ if production:
22
+ text += "🌾 Total Production (tonnes):\n"
23
+ for state, value in production.items():
24
+ text += f" • {state.title()}: {int(value):,}\n"
25
+ text += "\n"
26
+
27
+ # 📚 Citation
28
+ text += f"📚 Data Source: {source}"
29
+ return text
data_layer/__init__.py ADDED
File without changes
data_layer/config.py ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ BASE_URL = "https://api.data.gov.in/resource/"
2
+ API_KEY = "579b464db66ec23bdd000001375fd3eede8e49af7458c4f371a43d02"
3
+ AGRI_RESOURCE_ID = "35be999b-0208-4354-b557-f6ca9a5355de"
4
+ IMD_RESOURCE_ID = "8e0bd482-4aba-4d99-9cb9-ff124f6f1c2f"
data_layer/fetch_agriculture_api.py ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ import pandas as pd
3
+ import os
4
+ import time
5
+ from data_layer.config import BASE_URL, API_KEY, AGRI_RESOURCE_ID
6
+
7
+ def fetch_agriculture_data(limit=500, retries=3, max_records=2000):
8
+ """
9
+ Fetch agriculture data from data.gov.in API in chunks and save as CSV.
10
+ Handles rate limits and saves automatically into hybrid_dataset folder.
11
+ """
12
+
13
+ os.makedirs("hybrid_dataset", exist_ok=True)
14
+ csv_path = "hybrid_dataset/agriculture_data.csv"
15
+ all_data = []
16
+
17
+ print("🌾 Starting Agriculture data fetch...")
18
+
19
+ offset = 0
20
+ total_fetched = 0
21
+
22
+ while total_fetched < max_records:
23
+ url = f"{BASE_URL}{AGRI_RESOURCE_ID}?api-key={API_KEY}&format=json&limit={limit}&offset={offset}"
24
+
25
+ for attempt in range(retries):
26
+ try:
27
+ response = requests.get(url, timeout=20)
28
+ response.raise_for_status()
29
+
30
+ data = response.json().get("records", [])
31
+ if not data:
32
+ print("✅ No more records found.")
33
+ break
34
+
35
+ df_chunk = pd.DataFrame(data)
36
+ all_data.append(df_chunk)
37
+
38
+ total_fetched += len(df_chunk)
39
+ offset += limit
40
+
41
+ print(f"✅ Chunk fetched: {len(df_chunk)} rows (Total: {total_fetched})")
42
+
43
+ # small delay to avoid rate limit
44
+ time.sleep(2)
45
+ break
46
+
47
+ except requests.exceptions.HTTPError as e:
48
+ if "429" in str(e):
49
+ print("⚠️ Too Many Requests — waiting 20 seconds...")
50
+ time.sleep(20)
51
+ elif "403" in str(e):
52
+ print("🚫 Forbidden: Check your API key or URL in config.py")
53
+ return pd.DataFrame()
54
+ else:
55
+ print(f"⚠️ Attempt {attempt+1} failed: {e}")
56
+ time.sleep(3)
57
+
58
+ else:
59
+ print("❌ Max retries reached, skipping this chunk.")
60
+ break
61
+
62
+ if all_data:
63
+ final_df = pd.concat(all_data, ignore_index=True)
64
+ final_df.to_csv(csv_path, index=False)
65
+ print(f"✅ Agriculture data fetched & saved → {csv_path} ({len(final_df)} rows total)")
66
+ return final_df
67
+ else:
68
+ print("❌ No data fetched.")
69
+ return pd.DataFrame()
data_layer/fetch_imd_api.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ import pandas as pd
3
+ import os
4
+ import time
5
+ from data_layer.config import BASE_URL, API_KEY, IMD_RESOURCE_ID
6
+
7
+ def fetch_rainfall_data(limit=500, retries=3, max_records=2000):
8
+ """
9
+ Fetch IMD rainfall data from data.gov.in API in chunks and save as CSV.
10
+ Automatically handles rate limits and saves into hybrid_dataset folder.
11
+ """
12
+ os.makedirs("hybrid_dataset", exist_ok=True)
13
+ csv_path = "hybrid_dataset/imd_rainfall_data.csv"
14
+ all_data = []
15
+
16
+ print("🌦️ Starting IMD Rainfall data fetch...")
17
+
18
+ offset = 0
19
+ total_fetched = 0
20
+
21
+ while total_fetched < max_records:
22
+ url = f"{BASE_URL}{IMD_RESOURCE_ID}?api-key={API_KEY}&format=json&limit={limit}&offset={offset}"
23
+
24
+ for attempt in range(retries):
25
+ try:
26
+ response = requests.get(url, timeout=20)
27
+ response.raise_for_status()
28
+
29
+ data = response.json().get("records", [])
30
+ if not data:
31
+ print("✅ No more records found.")
32
+ break
33
+
34
+ df_chunk = pd.DataFrame(data)
35
+ all_data.append(df_chunk)
36
+
37
+ total_fetched += len(df_chunk)
38
+ offset += limit
39
+
40
+ print(f"✅ Chunk fetched: {len(df_chunk)} rows (Total: {total_fetched})")
41
+
42
+ time.sleep(2) # avoid rate-limit
43
+ break
44
+
45
+ except requests.exceptions.HTTPError as e:
46
+ if "429" in str(e):
47
+ print("⚠️ Too Many Requests — waiting 20 seconds...")
48
+ time.sleep(20)
49
+ elif "403" in str(e):
50
+ print("🚫 Forbidden: check API key or IMD resource ID in config.py")
51
+ return pd.DataFrame()
52
+ else:
53
+ print(f"⚠️ Attempt {attempt+1} failed: {e}")
54
+ time.sleep(3)
55
+ else:
56
+ print("❌ Max retries reached, skipping this chunk.")
57
+ break
58
+
59
+ if all_data:
60
+ final_df = pd.concat(all_data, ignore_index=True)
61
+ final_df.to_csv(csv_path, index=False)
62
+ print(f"✅ Rainfall data fetched & saved → {csv_path} ({len(final_df)} rows total)")
63
+ return final_df
64
+ else:
65
+ print("❌ No rainfall data fetched.")
66
+ return pd.DataFrame()
data_layer/integrate_data.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ import os
3
+ import re
4
+
5
+ def clean_state_name(name: str):
6
+ """Cleans and standardizes state/subdivision names."""
7
+ if not isinstance(name, str):
8
+ return ""
9
+ name = name.lower().strip()
10
+ name = re.sub(r"&", "and", name)
11
+ name = re.sub(r"\s+", " ", name)
12
+ name = re.sub(r"[^a-z\s]", "", name) # remove special chars
13
+ return name
14
+
15
+ def integrate_data(agri_df: pd.DataFrame, rain_df: pd.DataFrame):
16
+ """
17
+ 🔗 Final Integration Logic — Clean, Normalize, and Merge
18
+ Works even if & or trailing spaces exist.
19
+ """
20
+
21
+ os.makedirs("hybrid_dataset", exist_ok=True)
22
+
23
+ print(f"🧾 Agriculture unique states: {agri_df['state_name'].nunique()}")
24
+ print(f"☁️ Rainfall unique subdivisions: {rain_df['subdivision'].nunique()}")
25
+
26
+ # Clean columns
27
+ agri_df.columns = agri_df.columns.str.lower().str.strip()
28
+ rain_df.columns = rain_df.columns.str.lower().str.strip()
29
+
30
+ # Clean text values
31
+ agri_df["state_name"] = agri_df["state_name"].apply(clean_state_name)
32
+ rain_df["subdivision"] = rain_df["subdivision"].apply(clean_state_name)
33
+
34
+ # Create mapping
35
+ mapping = {
36
+ "andaman and nicobar islands": "andaman and nicobar islands",
37
+ "orissa": "odisha",
38
+ "sub himalayan west bengal and sikkim": "west bengal",
39
+ "gangetic west bengal": "west bengal",
40
+ "east uttar pradesh": "uttar pradesh",
41
+ "west uttar pradesh": "uttar pradesh",
42
+ "east rajasthan": "rajasthan",
43
+ "west rajasthan": "rajasthan",
44
+ "haryana delhi and chandigarh": "haryana",
45
+ "assam and meghalaya": "assam",
46
+ "naga mani mizo tripura": "tripura",
47
+ }
48
+
49
+ # Apply mapping to rainfall data
50
+ rain_df["state_name"] = rain_df["subdivision"].replace(mapping)
51
+
52
+ # Ensure year columns match type
53
+ agri_df["crop_year"] = pd.to_numeric(agri_df["crop_year"], errors="coerce").astype("Int64")
54
+ rain_df["year"] = pd.to_numeric(rain_df["year"], errors="coerce").astype("Int64")
55
+ rain_df.rename(columns={"year": "crop_year"}, inplace=True)
56
+
57
+ # Show what’s common after full cleaning
58
+ common_states = sorted(set(agri_df["state_name"].unique()) & set(rain_df["state_name"].unique()))
59
+ print(f"✅ Common states found: {common_states}")
60
+
61
+ if not common_states:
62
+ print("⚠️ No matching states even after cleaning — check character mismatches manually!")
63
+ print("🔍 Example Agri states:", agri_df['state_name'].unique().tolist())
64
+ print("🔍 Example Rainfall states:", rain_df['state_name'].unique().tolist())
65
+ return pd.DataFrame()
66
+
67
+ # Filter only matching states
68
+ agri_df = agri_df[agri_df["state_name"].isin(common_states)]
69
+ rain_df = rain_df[rain_df["state_name"].isin(common_states)]
70
+
71
+ # Merge datasets
72
+ merged = pd.merge(agri_df, rain_df, on=["state_name", "crop_year"], how="inner")
73
+
74
+ # Save output
75
+ output_path = "hybrid_dataset/merged_agri_rainfall.csv"
76
+ merged.to_csv(output_path, index=False)
77
+
78
+ print(f"✅ Data integrated and saved → {output_path} ({len(merged)} rows, {len(merged.columns)} columns)")
79
+ print("🏛️ Unique merged states:", merged["state_name"].unique().tolist())
80
+
81
+ return merged
82
+
83
+
84
+ # 🧪 Quick standalone test
85
+ if __name__ == "__main__":
86
+ ag = pd.read_csv("hybrid_dataset/agriculture_data.csv")
87
+ rd = pd.read_csv("hybrid_dataset/imd_rainfall_data.csv")
88
+ integrate_data(ag, rd)
main.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from data_layer.fetch_agriculture_api import fetch_agriculture_data # 🌾 fetches Agriculture data from API
2
+ from data_layer.fetch_imd_api import fetch_rainfall_data # 🌧️ fetches IMD rainfall data from API
3
+ from data_layer.integrate_data import integrate_data # 🔗 merges both datasets
4
+
5
+ if __name__ == "__main__":
6
+ print("🌾 Fetching Agriculture Data ...")
7
+ agri_df = fetch_agriculture_data()
8
+
9
+ print("🌧️ Fetching IMD Rainfall Data ...")
10
+ rain_df = fetch_rainfall_data()
11
+
12
+ print("🔗 Integrating Datasets ...")
13
+ integrate_data(agri_df, rain_df)
14
+
15
+ print("\n✅ Phase 1 Completed Successfully!")
16
+ print("Now run: streamlit run ui/app_streamlit.py")
notebooks/01_data_discovery.ipynb ADDED
@@ -0,0 +1,1619 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "id": "ec455bd3",
7
+ "metadata": {},
8
+ "outputs": [
9
+ {
10
+ "name": "stdout",
11
+ "output_type": "stream",
12
+ "text": [
13
+ "Active code page: 1252\n",
14
+ "Requirement already satisfied: requests in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (2.32.5)\n",
15
+ "Requirement already satisfied: pandas in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (2.3.2)\n",
16
+ "Requirement already satisfied: numpy in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (2.3.2)\n",
17
+ "Collecting matplotlib\n",
18
+ " Using cached matplotlib-3.10.7-cp313-cp313-win_amd64.whl.metadata (11 kB)\n",
19
+ "Collecting plotly\n",
20
+ " Downloading plotly-6.3.1-py3-none-any.whl.metadata (8.5 kB)\n",
21
+ "Requirement already satisfied: streamlit in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (1.49.1)\n",
22
+ "Requirement already satisfied: langchain in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (0.3.27)\n",
23
+ "Requirement already satisfied: transformers in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (4.56.1)\n",
24
+ "Requirement already satisfied: charset_normalizer<4,>=2 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests) (3.4.3)\n",
25
+ "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests) (3.10)\n",
26
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests) (2.5.0)\n",
27
+ "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests) (2025.8.3)\n",
28
+ "Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2.9.0.post0)\n",
29
+ "Requirement already satisfied: pytz>=2020.1 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2025.2)\n",
30
+ "Requirement already satisfied: tzdata>=2022.7 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2025.2)\n",
31
+ "Collecting contourpy>=1.0.1 (from matplotlib)\n",
32
+ " Using cached contourpy-1.3.3-cp313-cp313-win_amd64.whl.metadata (5.5 kB)\n",
33
+ "Collecting cycler>=0.10 (from matplotlib)\n",
34
+ " Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)\n",
35
+ "Collecting fonttools>=4.22.0 (from matplotlib)\n",
36
+ " Using cached fonttools-4.60.1-cp313-cp313-win_amd64.whl.metadata (114 kB)\n",
37
+ "Collecting kiwisolver>=1.3.1 (from matplotlib)\n",
38
+ " Using cached kiwisolver-1.4.9-cp313-cp313-win_amd64.whl.metadata (6.4 kB)\n",
39
+ "Requirement already satisfied: packaging>=20.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from matplotlib) (23.2)\n",
40
+ "Requirement already satisfied: pillow>=8 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from matplotlib) (11.3.0)\n",
41
+ "Collecting pyparsing>=3 (from matplotlib)\n",
42
+ " Using cached pyparsing-3.2.5-py3-none-any.whl.metadata (5.0 kB)\n",
43
+ "Requirement already satisfied: narwhals>=1.15.1 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from plotly) (2.4.0)\n",
44
+ "Requirement already satisfied: altair!=5.4.0,!=5.4.1,<6,>=4.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (5.5.0)\n",
45
+ "Requirement already satisfied: blinker<2,>=1.5.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (1.9.0)\n",
46
+ "Requirement already satisfied: cachetools<7,>=4.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (6.2.0)\n",
47
+ "Requirement already satisfied: click<9,>=7.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (8.2.1)\n",
48
+ "Requirement already satisfied: protobuf<7,>=3.20 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (6.32.0)\n",
49
+ "Requirement already satisfied: pyarrow>=7.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (21.0.0)\n",
50
+ "Requirement already satisfied: tenacity<10,>=8.1.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (8.5.0)\n",
51
+ "Requirement already satisfied: toml<2,>=0.10.1 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (0.10.2)\n",
52
+ "Requirement already satisfied: typing-extensions<5,>=4.4.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (4.15.0)\n",
53
+ "Requirement already satisfied: watchdog<7,>=2.1.5 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (6.0.0)\n",
54
+ "Requirement already satisfied: gitpython!=3.1.19,<4,>=3.0.7 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (3.1.45)\n",
55
+ "Requirement already satisfied: pydeck<1,>=0.8.0b4 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (0.9.1)\n",
56
+ "Requirement already satisfied: tornado!=6.5.0,<7,>=6.0.3 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from streamlit) (6.5.2)\n",
57
+ "Requirement already satisfied: langchain-core<1.0.0,>=0.3.72 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from langchain) (0.3.76)\n",
58
+ "Requirement already satisfied: langchain-text-splitters<1.0.0,>=0.3.9 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from langchain) (0.3.11)\n",
59
+ "Requirement already satisfied: langsmith>=0.1.17 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from langchain) (0.4.27)\n",
60
+ "Requirement already satisfied: pydantic<3.0.0,>=2.7.4 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from langchain) (2.11.7)\n",
61
+ "Requirement already satisfied: SQLAlchemy<3,>=1.4 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from langchain) (2.0.43)\n",
62
+ "Requirement already satisfied: PyYAML>=5.3 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from langchain) (6.0.2)\n",
63
+ "Requirement already satisfied: filelock in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from transformers) (3.19.1)\n",
64
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.34.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from transformers) (0.34.4)\n",
65
+ "Requirement already satisfied: regex!=2019.12.17 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from transformers) (2025.9.1)\n",
66
+ "Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from transformers) (0.22.0)\n",
67
+ "Requirement already satisfied: safetensors>=0.4.3 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from transformers) (0.6.2)\n",
68
+ "Requirement already satisfied: tqdm>=4.27 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from transformers) (4.67.1)\n",
69
+ "Requirement already satisfied: jinja2 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from altair!=5.4.0,!=5.4.1,<6,>=4.0->streamlit) (3.1.6)\n",
70
+ "Requirement already satisfied: jsonschema>=3.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from altair!=5.4.0,!=5.4.1,<6,>=4.0->streamlit) (4.25.1)\n",
71
+ "Requirement already satisfied: colorama in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from click<9,>=7.0->streamlit) (0.4.6)\n",
72
+ "Requirement already satisfied: gitdb<5,>=4.0.1 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from gitpython!=3.1.19,<4,>=3.0.7->streamlit) (4.0.12)\n",
73
+ "Requirement already satisfied: fsspec>=2023.5.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from huggingface-hub<1.0,>=0.34.0->transformers) (2024.2.0)\n",
74
+ "Requirement already satisfied: jsonpatch<2.0,>=1.33 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from langchain-core<1.0.0,>=0.3.72->langchain) (1.33)\n",
75
+ "Requirement already satisfied: httpx<1,>=0.23.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from langsmith>=0.1.17->langchain) (0.28.1)\n",
76
+ "Requirement already satisfied: orjson>=3.9.14 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from langsmith>=0.1.17->langchain) (3.11.3)\n",
77
+ "Requirement already satisfied: requests-toolbelt>=1.0.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from langsmith>=0.1.17->langchain) (1.0.0)\n",
78
+ "Requirement already satisfied: zstandard>=0.23.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from langsmith>=0.1.17->langchain) (0.24.0)\n",
79
+ "Requirement already satisfied: annotated-types>=0.6.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from pydantic<3.0.0,>=2.7.4->langchain) (0.7.0)\n",
80
+ "Requirement already satisfied: pydantic-core==2.33.2 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from pydantic<3.0.0,>=2.7.4->langchain) (2.33.2)\n",
81
+ "Requirement already satisfied: typing-inspection>=0.4.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from pydantic<3.0.0,>=2.7.4->langchain) (0.4.1)\n",
82
+ "Requirement already satisfied: six>=1.5 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n",
83
+ "Requirement already satisfied: greenlet>=1 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from SQLAlchemy<3,>=1.4->langchain) (3.2.4)\n",
84
+ "Requirement already satisfied: smmap<6,>=3.0.1 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from gitdb<5,>=4.0.1->gitpython!=3.1.19,<4,>=3.0.7->streamlit) (5.0.2)\n",
85
+ "Requirement already satisfied: anyio in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from httpx<1,>=0.23.0->langsmith>=0.1.17->langchain) (4.10.0)\n",
86
+ "Requirement already satisfied: httpcore==1.* in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from httpx<1,>=0.23.0->langsmith>=0.1.17->langchain) (1.0.9)\n",
87
+ "Requirement already satisfied: h11>=0.16 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from httpcore==1.*->httpx<1,>=0.23.0->langsmith>=0.1.17->langchain) (0.16.0)\n",
88
+ "Requirement already satisfied: MarkupSafe>=2.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from jinja2->altair!=5.4.0,!=5.4.1,<6,>=4.0->streamlit) (3.0.2)\n",
89
+ "Requirement already satisfied: jsonpointer>=1.9 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from jsonpatch<2.0,>=1.33->langchain-core<1.0.0,>=0.3.72->langchain) (3.0.0)\n",
90
+ "Requirement already satisfied: attrs>=22.2.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from jsonschema>=3.0->altair!=5.4.0,!=5.4.1,<6,>=4.0->streamlit) (25.3.0)\n",
91
+ "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from jsonschema>=3.0->altair!=5.4.0,!=5.4.1,<6,>=4.0->streamlit) (2025.9.1)\n",
92
+ "Requirement already satisfied: referencing>=0.28.4 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from jsonschema>=3.0->altair!=5.4.0,!=5.4.1,<6,>=4.0->streamlit) (0.36.2)\n",
93
+ "Requirement already satisfied: rpds-py>=0.7.1 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from jsonschema>=3.0->altair!=5.4.0,!=5.4.1,<6,>=4.0->streamlit) (0.27.1)\n",
94
+ "Requirement already satisfied: sniffio>=1.1 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from anyio->httpx<1,>=0.23.0->langsmith>=0.1.17->langchain) (1.3.1)\n",
95
+ "Using cached matplotlib-3.10.7-cp313-cp313-win_amd64.whl (8.1 MB)\n",
96
+ "Downloading plotly-6.3.1-py3-none-any.whl (9.8 MB)\n",
97
+ " ---------------------------------------- 0.0/9.8 MB ? eta -:--:--\n",
98
+ " ---- ----------------------------------- 1.0/9.8 MB 7.0 MB/s eta 0:00:02\n",
99
+ " ---- ----------------------------------- 1.0/9.8 MB 7.0 MB/s eta 0:00:02\n",
100
+ " ---------- ----------------------------- 2.6/9.8 MB 4.5 MB/s eta 0:00:02\n",
101
+ " ------------- -------------------------- 3.4/9.8 MB 4.4 MB/s eta 0:00:02\n",
102
+ " ----------------- ---------------------- 4.2/9.8 MB 4.3 MB/s eta 0:00:02\n",
103
+ " -------------------- ------------------- 5.0/9.8 MB 4.2 MB/s eta 0:00:02\n",
104
+ " ----------------------- ---------------- 5.8/9.8 MB 4.1 MB/s eta 0:00:01\n",
105
+ " --------------------------- ------------ 6.8/9.8 MB 4.1 MB/s eta 0:00:01\n",
106
+ " ------------------------------ --------- 7.6/9.8 MB 4.1 MB/s eta 0:00:01\n",
107
+ " ---------------------------------- ----- 8.4/9.8 MB 4.1 MB/s eta 0:00:01\n",
108
+ " ------------------------------------- -- 9.2/9.8 MB 4.0 MB/s eta 0:00:01\n",
109
+ " --------------------------------------- 9.7/9.8 MB 4.0 MB/s eta 0:00:01\n",
110
+ " ---------------------------------------- 9.8/9.8 MB 3.9 MB/s eta 0:00:00\n",
111
+ "Using cached contourpy-1.3.3-cp313-cp313-win_amd64.whl (226 kB)\n",
112
+ "Using cached cycler-0.12.1-py3-none-any.whl (8.3 kB)\n",
113
+ "Using cached fonttools-4.60.1-cp313-cp313-win_amd64.whl (2.3 MB)\n",
114
+ "Using cached kiwisolver-1.4.9-cp313-cp313-win_amd64.whl (73 kB)\n",
115
+ "Using cached pyparsing-3.2.5-py3-none-any.whl (113 kB)\n",
116
+ "Installing collected packages: pyparsing, plotly, kiwisolver, fonttools, cycler, contourpy, matplotlib\n",
117
+ "Successfully installed contourpy-1.3.3 cycler-0.12.1 fonttools-4.60.1 kiwisolver-1.4.9 matplotlib-3.10.7 plotly-6.3.1 pyparsing-3.2.5\n",
118
+ "Note: you may need to restart the kernel to use updated packages.\n"
119
+ ]
120
+ }
121
+ ],
122
+ "source": [
123
+ "pip install requests pandas numpy matplotlib plotly streamlit langchain transformers\n"
124
+ ]
125
+ },
126
+ {
127
+ "cell_type": "code",
128
+ "execution_count": 6,
129
+ "id": "42263596",
130
+ "metadata": {},
131
+ "outputs": [
132
+ {
133
+ "name": "stdout",
134
+ "output_type": "stream",
135
+ "text": [
136
+ "🔍 Dataset Loaded Successfully!\n",
137
+ "Total Rows: 203\n",
138
+ "Total Columns: 25\n",
139
+ "\n"
140
+ ]
141
+ },
142
+ {
143
+ "data": {
144
+ "application/vnd.microsoft.datawrangler.viewer.v0+json": {
145
+ "columns": [
146
+ {
147
+ "name": "index",
148
+ "rawType": "int64",
149
+ "type": "integer"
150
+ },
151
+ {
152
+ "name": "state_name",
153
+ "rawType": "object",
154
+ "type": "string"
155
+ },
156
+ {
157
+ "name": "district_name",
158
+ "rawType": "object",
159
+ "type": "string"
160
+ },
161
+ {
162
+ "name": "crop_year",
163
+ "rawType": "int64",
164
+ "type": "integer"
165
+ },
166
+ {
167
+ "name": "season",
168
+ "rawType": "object",
169
+ "type": "string"
170
+ },
171
+ {
172
+ "name": "crop",
173
+ "rawType": "object",
174
+ "type": "string"
175
+ },
176
+ {
177
+ "name": "area_",
178
+ "rawType": "float64",
179
+ "type": "float"
180
+ },
181
+ {
182
+ "name": "production_",
183
+ "rawType": "float64",
184
+ "type": "float"
185
+ },
186
+ {
187
+ "name": "subdivision",
188
+ "rawType": "object",
189
+ "type": "string"
190
+ },
191
+ {
192
+ "name": "jan",
193
+ "rawType": "float64",
194
+ "type": "float"
195
+ },
196
+ {
197
+ "name": "feb",
198
+ "rawType": "float64",
199
+ "type": "float"
200
+ },
201
+ {
202
+ "name": "mar",
203
+ "rawType": "float64",
204
+ "type": "float"
205
+ },
206
+ {
207
+ "name": "apr",
208
+ "rawType": "float64",
209
+ "type": "float"
210
+ },
211
+ {
212
+ "name": "may",
213
+ "rawType": "float64",
214
+ "type": "float"
215
+ },
216
+ {
217
+ "name": "jun",
218
+ "rawType": "float64",
219
+ "type": "float"
220
+ },
221
+ {
222
+ "name": "jul",
223
+ "rawType": "float64",
224
+ "type": "float"
225
+ },
226
+ {
227
+ "name": "aug",
228
+ "rawType": "float64",
229
+ "type": "float"
230
+ },
231
+ {
232
+ "name": "sep",
233
+ "rawType": "float64",
234
+ "type": "float"
235
+ },
236
+ {
237
+ "name": "oct",
238
+ "rawType": "float64",
239
+ "type": "float"
240
+ },
241
+ {
242
+ "name": "nov",
243
+ "rawType": "float64",
244
+ "type": "float"
245
+ },
246
+ {
247
+ "name": "dec",
248
+ "rawType": "float64",
249
+ "type": "float"
250
+ },
251
+ {
252
+ "name": "annual",
253
+ "rawType": "float64",
254
+ "type": "float"
255
+ },
256
+ {
257
+ "name": "jf",
258
+ "rawType": "float64",
259
+ "type": "float"
260
+ },
261
+ {
262
+ "name": "mam",
263
+ "rawType": "float64",
264
+ "type": "float"
265
+ },
266
+ {
267
+ "name": "jjas",
268
+ "rawType": "float64",
269
+ "type": "float"
270
+ },
271
+ {
272
+ "name": "ond",
273
+ "rawType": "float64",
274
+ "type": "float"
275
+ }
276
+ ],
277
+ "ref": "633f3f12-0965-479f-8a47-3a1c5a2b8105",
278
+ "rows": [
279
+ [
280
+ "0",
281
+ "andaman and nicobar islands",
282
+ "NICOBARS",
283
+ "2000",
284
+ "Kharif",
285
+ "Arecanut",
286
+ "1254.0",
287
+ "2000.0",
288
+ "andaman & nicobar islands",
289
+ "53.0",
290
+ "59.0",
291
+ "171.3",
292
+ "218.1",
293
+ "422.8",
294
+ "357.0",
295
+ "176.3",
296
+ "460.8",
297
+ "250.1",
298
+ "321.2",
299
+ "158.3",
300
+ "115.2",
301
+ "2763.2",
302
+ "112.0",
303
+ "812.2",
304
+ "1244.2",
305
+ "594.7"
306
+ ],
307
+ [
308
+ "1",
309
+ "andaman and nicobar islands",
310
+ "NICOBARS",
311
+ "2000",
312
+ "Kharif",
313
+ "Other Kharif pulses",
314
+ "2.0",
315
+ "1.0",
316
+ "andaman & nicobar islands",
317
+ "53.0",
318
+ "59.0",
319
+ "171.3",
320
+ "218.1",
321
+ "422.8",
322
+ "357.0",
323
+ "176.3",
324
+ "460.8",
325
+ "250.1",
326
+ "321.2",
327
+ "158.3",
328
+ "115.2",
329
+ "2763.2",
330
+ "112.0",
331
+ "812.2",
332
+ "1244.2",
333
+ "594.7"
334
+ ],
335
+ [
336
+ "2",
337
+ "andaman and nicobar islands",
338
+ "NICOBARS",
339
+ "2000",
340
+ "Kharif",
341
+ "Rice",
342
+ "102.0",
343
+ "321.0",
344
+ "andaman & nicobar islands",
345
+ "53.0",
346
+ "59.0",
347
+ "171.3",
348
+ "218.1",
349
+ "422.8",
350
+ "357.0",
351
+ "176.3",
352
+ "460.8",
353
+ "250.1",
354
+ "321.2",
355
+ "158.3",
356
+ "115.2",
357
+ "2763.2",
358
+ "112.0",
359
+ "812.2",
360
+ "1244.2",
361
+ "594.7"
362
+ ],
363
+ [
364
+ "3",
365
+ "andaman and nicobar islands",
366
+ "NICOBARS",
367
+ "2000",
368
+ "Whole Year",
369
+ "Banana",
370
+ "176.0",
371
+ "641.0",
372
+ "andaman & nicobar islands",
373
+ "53.0",
374
+ "59.0",
375
+ "171.3",
376
+ "218.1",
377
+ "422.8",
378
+ "357.0",
379
+ "176.3",
380
+ "460.8",
381
+ "250.1",
382
+ "321.2",
383
+ "158.3",
384
+ "115.2",
385
+ "2763.2",
386
+ "112.0",
387
+ "812.2",
388
+ "1244.2",
389
+ "594.7"
390
+ ],
391
+ [
392
+ "4",
393
+ "andaman and nicobar islands",
394
+ "NICOBARS",
395
+ "2000",
396
+ "Whole Year",
397
+ "Cashewnut",
398
+ "720.0",
399
+ "165.0",
400
+ "andaman & nicobar islands",
401
+ "53.0",
402
+ "59.0",
403
+ "171.3",
404
+ "218.1",
405
+ "422.8",
406
+ "357.0",
407
+ "176.3",
408
+ "460.8",
409
+ "250.1",
410
+ "321.2",
411
+ "158.3",
412
+ "115.2",
413
+ "2763.2",
414
+ "112.0",
415
+ "812.2",
416
+ "1244.2",
417
+ "594.7"
418
+ ]
419
+ ],
420
+ "shape": {
421
+ "columns": 25,
422
+ "rows": 5
423
+ }
424
+ },
425
+ "text/html": [
426
+ "<div>\n",
427
+ "<style scoped>\n",
428
+ " .dataframe tbody tr th:only-of-type {\n",
429
+ " vertical-align: middle;\n",
430
+ " }\n",
431
+ "\n",
432
+ " .dataframe tbody tr th {\n",
433
+ " vertical-align: top;\n",
434
+ " }\n",
435
+ "\n",
436
+ " .dataframe thead th {\n",
437
+ " text-align: right;\n",
438
+ " }\n",
439
+ "</style>\n",
440
+ "<table border=\"1\" class=\"dataframe\">\n",
441
+ " <thead>\n",
442
+ " <tr style=\"text-align: right;\">\n",
443
+ " <th></th>\n",
444
+ " <th>state_name</th>\n",
445
+ " <th>district_name</th>\n",
446
+ " <th>crop_year</th>\n",
447
+ " <th>season</th>\n",
448
+ " <th>crop</th>\n",
449
+ " <th>area_</th>\n",
450
+ " <th>production_</th>\n",
451
+ " <th>subdivision</th>\n",
452
+ " <th>jan</th>\n",
453
+ " <th>feb</th>\n",
454
+ " <th>...</th>\n",
455
+ " <th>aug</th>\n",
456
+ " <th>sep</th>\n",
457
+ " <th>oct</th>\n",
458
+ " <th>nov</th>\n",
459
+ " <th>dec</th>\n",
460
+ " <th>annual</th>\n",
461
+ " <th>jf</th>\n",
462
+ " <th>mam</th>\n",
463
+ " <th>jjas</th>\n",
464
+ " <th>ond</th>\n",
465
+ " </tr>\n",
466
+ " </thead>\n",
467
+ " <tbody>\n",
468
+ " <tr>\n",
469
+ " <th>0</th>\n",
470
+ " <td>andaman and nicobar islands</td>\n",
471
+ " <td>NICOBARS</td>\n",
472
+ " <td>2000</td>\n",
473
+ " <td>Kharif</td>\n",
474
+ " <td>Arecanut</td>\n",
475
+ " <td>1254.0</td>\n",
476
+ " <td>2000.0</td>\n",
477
+ " <td>andaman &amp; nicobar islands</td>\n",
478
+ " <td>53.0</td>\n",
479
+ " <td>59.0</td>\n",
480
+ " <td>...</td>\n",
481
+ " <td>460.8</td>\n",
482
+ " <td>250.1</td>\n",
483
+ " <td>321.2</td>\n",
484
+ " <td>158.3</td>\n",
485
+ " <td>115.2</td>\n",
486
+ " <td>2763.2</td>\n",
487
+ " <td>112.0</td>\n",
488
+ " <td>812.2</td>\n",
489
+ " <td>1244.2</td>\n",
490
+ " <td>594.7</td>\n",
491
+ " </tr>\n",
492
+ " <tr>\n",
493
+ " <th>1</th>\n",
494
+ " <td>andaman and nicobar islands</td>\n",
495
+ " <td>NICOBARS</td>\n",
496
+ " <td>2000</td>\n",
497
+ " <td>Kharif</td>\n",
498
+ " <td>Other Kharif pulses</td>\n",
499
+ " <td>2.0</td>\n",
500
+ " <td>1.0</td>\n",
501
+ " <td>andaman &amp; nicobar islands</td>\n",
502
+ " <td>53.0</td>\n",
503
+ " <td>59.0</td>\n",
504
+ " <td>...</td>\n",
505
+ " <td>460.8</td>\n",
506
+ " <td>250.1</td>\n",
507
+ " <td>321.2</td>\n",
508
+ " <td>158.3</td>\n",
509
+ " <td>115.2</td>\n",
510
+ " <td>2763.2</td>\n",
511
+ " <td>112.0</td>\n",
512
+ " <td>812.2</td>\n",
513
+ " <td>1244.2</td>\n",
514
+ " <td>594.7</td>\n",
515
+ " </tr>\n",
516
+ " <tr>\n",
517
+ " <th>2</th>\n",
518
+ " <td>andaman and nicobar islands</td>\n",
519
+ " <td>NICOBARS</td>\n",
520
+ " <td>2000</td>\n",
521
+ " <td>Kharif</td>\n",
522
+ " <td>Rice</td>\n",
523
+ " <td>102.0</td>\n",
524
+ " <td>321.0</td>\n",
525
+ " <td>andaman &amp; nicobar islands</td>\n",
526
+ " <td>53.0</td>\n",
527
+ " <td>59.0</td>\n",
528
+ " <td>...</td>\n",
529
+ " <td>460.8</td>\n",
530
+ " <td>250.1</td>\n",
531
+ " <td>321.2</td>\n",
532
+ " <td>158.3</td>\n",
533
+ " <td>115.2</td>\n",
534
+ " <td>2763.2</td>\n",
535
+ " <td>112.0</td>\n",
536
+ " <td>812.2</td>\n",
537
+ " <td>1244.2</td>\n",
538
+ " <td>594.7</td>\n",
539
+ " </tr>\n",
540
+ " <tr>\n",
541
+ " <th>3</th>\n",
542
+ " <td>andaman and nicobar islands</td>\n",
543
+ " <td>NICOBARS</td>\n",
544
+ " <td>2000</td>\n",
545
+ " <td>Whole Year</td>\n",
546
+ " <td>Banana</td>\n",
547
+ " <td>176.0</td>\n",
548
+ " <td>641.0</td>\n",
549
+ " <td>andaman &amp; nicobar islands</td>\n",
550
+ " <td>53.0</td>\n",
551
+ " <td>59.0</td>\n",
552
+ " <td>...</td>\n",
553
+ " <td>460.8</td>\n",
554
+ " <td>250.1</td>\n",
555
+ " <td>321.2</td>\n",
556
+ " <td>158.3</td>\n",
557
+ " <td>115.2</td>\n",
558
+ " <td>2763.2</td>\n",
559
+ " <td>112.0</td>\n",
560
+ " <td>812.2</td>\n",
561
+ " <td>1244.2</td>\n",
562
+ " <td>594.7</td>\n",
563
+ " </tr>\n",
564
+ " <tr>\n",
565
+ " <th>4</th>\n",
566
+ " <td>andaman and nicobar islands</td>\n",
567
+ " <td>NICOBARS</td>\n",
568
+ " <td>2000</td>\n",
569
+ " <td>Whole Year</td>\n",
570
+ " <td>Cashewnut</td>\n",
571
+ " <td>720.0</td>\n",
572
+ " <td>165.0</td>\n",
573
+ " <td>andaman &amp; nicobar islands</td>\n",
574
+ " <td>53.0</td>\n",
575
+ " <td>59.0</td>\n",
576
+ " <td>...</td>\n",
577
+ " <td>460.8</td>\n",
578
+ " <td>250.1</td>\n",
579
+ " <td>321.2</td>\n",
580
+ " <td>158.3</td>\n",
581
+ " <td>115.2</td>\n",
582
+ " <td>2763.2</td>\n",
583
+ " <td>112.0</td>\n",
584
+ " <td>812.2</td>\n",
585
+ " <td>1244.2</td>\n",
586
+ " <td>594.7</td>\n",
587
+ " </tr>\n",
588
+ " </tbody>\n",
589
+ "</table>\n",
590
+ "<p>5 rows × 25 columns</p>\n",
591
+ "</div>"
592
+ ],
593
+ "text/plain": [
594
+ " state_name district_name crop_year season \\\n",
595
+ "0 andaman and nicobar islands NICOBARS 2000 Kharif \n",
596
+ "1 andaman and nicobar islands NICOBARS 2000 Kharif \n",
597
+ "2 andaman and nicobar islands NICOBARS 2000 Kharif \n",
598
+ "3 andaman and nicobar islands NICOBARS 2000 Whole Year \n",
599
+ "4 andaman and nicobar islands NICOBARS 2000 Whole Year \n",
600
+ "\n",
601
+ " crop area_ production_ subdivision jan \\\n",
602
+ "0 Arecanut 1254.0 2000.0 andaman & nicobar islands 53.0 \n",
603
+ "1 Other Kharif pulses 2.0 1.0 andaman & nicobar islands 53.0 \n",
604
+ "2 Rice 102.0 321.0 andaman & nicobar islands 53.0 \n",
605
+ "3 Banana 176.0 641.0 andaman & nicobar islands 53.0 \n",
606
+ "4 Cashewnut 720.0 165.0 andaman & nicobar islands 53.0 \n",
607
+ "\n",
608
+ " feb ... aug sep oct nov dec annual jf mam jjas \\\n",
609
+ "0 59.0 ... 460.8 250.1 321.2 158.3 115.2 2763.2 112.0 812.2 1244.2 \n",
610
+ "1 59.0 ... 460.8 250.1 321.2 158.3 115.2 2763.2 112.0 812.2 1244.2 \n",
611
+ "2 59.0 ... 460.8 250.1 321.2 158.3 115.2 2763.2 112.0 812.2 1244.2 \n",
612
+ "3 59.0 ... 460.8 250.1 321.2 158.3 115.2 2763.2 112.0 812.2 1244.2 \n",
613
+ "4 59.0 ... 460.8 250.1 321.2 158.3 115.2 2763.2 112.0 812.2 1244.2 \n",
614
+ "\n",
615
+ " ond \n",
616
+ "0 594.7 \n",
617
+ "1 594.7 \n",
618
+ "2 594.7 \n",
619
+ "3 594.7 \n",
620
+ "4 594.7 \n",
621
+ "\n",
622
+ "[5 rows x 25 columns]"
623
+ ]
624
+ },
625
+ "metadata": {},
626
+ "output_type": "display_data"
627
+ },
628
+ {
629
+ "name": "stdout",
630
+ "output_type": "stream",
631
+ "text": [
632
+ "📊 Columns in dataset:\n",
633
+ "['state_name', 'district_name', 'crop_year', 'season', 'crop', 'area_', 'production_', 'subdivision', 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec', 'annual', 'jf', 'mam', 'jjas', 'ond']\n",
634
+ "\n",
635
+ "🏛️ Unique States in Dataset:\n",
636
+ "- andaman and nicobar islands\n",
637
+ "\n",
638
+ "🌾 Unique Crops in Dataset:\n",
639
+ "- Arecanut\n",
640
+ "- Arhar/Tur\n",
641
+ "- Banana\n",
642
+ "- Black pepper\n",
643
+ "- Cashewnut\n",
644
+ "- Coconut\n",
645
+ "- Dry chillies\n",
646
+ "- Dry ginger\n",
647
+ "- Groundnut\n",
648
+ "- Maize\n",
649
+ "- Moong(Green Gram)\n",
650
+ "- Other Kharif pulses\n",
651
+ "- Rice\n",
652
+ "- Sugarcane\n",
653
+ "- Sunflower\n",
654
+ "- Sweet potato\n",
655
+ "- Tapioca\n",
656
+ "- Turmeric\n",
657
+ "- Urad\n",
658
+ "- other oilseeds\n",
659
+ "... (Total 20 unique crops)\n",
660
+ "\n",
661
+ "📅 Crop Year Range: 2000 - 2010\n",
662
+ "\n",
663
+ "📈 Number of unique crops per state:\n"
664
+ ]
665
+ },
666
+ {
667
+ "data": {
668
+ "application/vnd.microsoft.datawrangler.viewer.v0+json": {
669
+ "columns": [
670
+ {
671
+ "name": "index",
672
+ "rawType": "int64",
673
+ "type": "integer"
674
+ },
675
+ {
676
+ "name": "state_name",
677
+ "rawType": "object",
678
+ "type": "string"
679
+ },
680
+ {
681
+ "name": "unique_crops",
682
+ "rawType": "int64",
683
+ "type": "integer"
684
+ }
685
+ ],
686
+ "ref": "16808a21-275e-4e3e-a212-22ed467e3c22",
687
+ "rows": [
688
+ [
689
+ "0",
690
+ "andaman and nicobar islands",
691
+ "20"
692
+ ]
693
+ ],
694
+ "shape": {
695
+ "columns": 2,
696
+ "rows": 1
697
+ }
698
+ },
699
+ "text/html": [
700
+ "<div>\n",
701
+ "<style scoped>\n",
702
+ " .dataframe tbody tr th:only-of-type {\n",
703
+ " vertical-align: middle;\n",
704
+ " }\n",
705
+ "\n",
706
+ " .dataframe tbody tr th {\n",
707
+ " vertical-align: top;\n",
708
+ " }\n",
709
+ "\n",
710
+ " .dataframe thead th {\n",
711
+ " text-align: right;\n",
712
+ " }\n",
713
+ "</style>\n",
714
+ "<table border=\"1\" class=\"dataframe\">\n",
715
+ " <thead>\n",
716
+ " <tr style=\"text-align: right;\">\n",
717
+ " <th></th>\n",
718
+ " <th>state_name</th>\n",
719
+ " <th>unique_crops</th>\n",
720
+ " </tr>\n",
721
+ " </thead>\n",
722
+ " <tbody>\n",
723
+ " <tr>\n",
724
+ " <th>0</th>\n",
725
+ " <td>andaman and nicobar islands</td>\n",
726
+ " <td>20</td>\n",
727
+ " </tr>\n",
728
+ " </tbody>\n",
729
+ "</table>\n",
730
+ "</div>"
731
+ ],
732
+ "text/plain": [
733
+ " state_name unique_crops\n",
734
+ "0 andaman and nicobar islands 20"
735
+ ]
736
+ },
737
+ "metadata": {},
738
+ "output_type": "display_data"
739
+ }
740
+ ],
741
+ "source": [
742
+ "# -----------------------------------------------\n",
743
+ "# 📘 Project Samarth - Phase 1: Data Discovery\n",
744
+ "# -----------------------------------------------\n",
745
+ "\n",
746
+ "import pandas as pd\n",
747
+ "\n",
748
+ "# ✅ Load merged dataset\n",
749
+ "file_path = \"../hybrid_dataset/merged_agri_rainfall.csv\" # adjust if needed\n",
750
+ "df = pd.read_csv(file_path)\n",
751
+ "\n",
752
+ "# ✅ Basic info\n",
753
+ "print(\"🔍 Dataset Loaded Successfully!\")\n",
754
+ "print(f\"Total Rows: {len(df)}\")\n",
755
+ "print(f\"Total Columns: {len(df.columns)}\\n\")\n",
756
+ "\n",
757
+ "# ✅ Display first few rows\n",
758
+ "display(df.head())\n",
759
+ "\n",
760
+ "# ✅ Show all available columns\n",
761
+ "print(\"📊 Columns in dataset:\")\n",
762
+ "print(df.columns.tolist())\n",
763
+ "\n",
764
+ "# ✅ Check unique states\n",
765
+ "if \"state_name\" in df.columns:\n",
766
+ " states = sorted(df[\"state_name\"].dropna().unique().tolist())\n",
767
+ " print(\"\\n🏛️ Unique States in Dataset:\")\n",
768
+ " for s in states:\n",
769
+ " print(\"-\", s)\n",
770
+ "\n",
771
+ "# ✅ Check unique crops\n",
772
+ "if \"crop\" in df.columns:\n",
773
+ " crops = sorted(df[\"crop\"].dropna().unique().tolist())\n",
774
+ " print(\"\\n🌾 Unique Crops in Dataset:\")\n",
775
+ " for c in crops[:20]: # limit to first 20\n",
776
+ " print(\"-\", c)\n",
777
+ " print(f\"... (Total {len(crops)} unique crops)\")\n",
778
+ "\n",
779
+ "# ✅ Check year range\n",
780
+ "if \"crop_year\" in df.columns:\n",
781
+ " min_year, max_year = int(df[\"crop_year\"].min()), int(df[\"crop_year\"].max())\n",
782
+ " print(f\"\\n📅 Crop Year Range: {min_year} - {max_year}\")\n",
783
+ "\n",
784
+ "# ✅ Quick count by state and crop\n",
785
+ "if {\"state_name\", \"crop\"} <= set(df.columns):\n",
786
+ " summary = (\n",
787
+ " df.groupby(\"state_name\")[\"crop\"]\n",
788
+ " .nunique()\n",
789
+ " .sort_values(ascending=False)\n",
790
+ " .reset_index()\n",
791
+ " .rename(columns={\"crop\": \"unique_crops\"})\n",
792
+ " )\n",
793
+ " print(\"\\n📈 Number of unique crops per state:\")\n",
794
+ " display(summary)\n"
795
+ ]
796
+ },
797
+ {
798
+ "cell_type": "code",
799
+ "execution_count": 5,
800
+ "id": "49e9d168",
801
+ "metadata": {},
802
+ "outputs": [
803
+ {
804
+ "name": "stdout",
805
+ "output_type": "stream",
806
+ "text": [
807
+ "Columns available in this file:\n",
808
+ "['subdivision', 'year', 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec', 'annual', 'jf', 'mam', 'jjas', 'ond']\n"
809
+ ]
810
+ }
811
+ ],
812
+ "source": [
813
+ "import pandas as pd\n",
814
+ "\n",
815
+ "df = pd.read_csv(r\"C:\\Users\\satya\\Downloads\\Project_Samarth\\task\\hybrid_dataset\\imd_rainfall_data.csv\")\n",
816
+ "\n",
817
+ "print(\"Columns available in this file:\")\n",
818
+ "print(df.columns.tolist())\n"
819
+ ]
820
+ },
821
+ {
822
+ "cell_type": "code",
823
+ "execution_count": 7,
824
+ "id": "1f616bd6",
825
+ "metadata": {},
826
+ "outputs": [
827
+ {
828
+ "name": "stdout",
829
+ "output_type": "stream",
830
+ "text": [
831
+ "✅ Datasets Loaded Successfully!\n",
832
+ "\n",
833
+ "Agriculture Data Shape: (5000, 7)\n",
834
+ "IMD Rainfall Data Shape: (2000, 19)\n"
835
+ ]
836
+ }
837
+ ],
838
+ "source": [
839
+ "import pandas as pd\n",
840
+ "\n",
841
+ "# Paths to your saved files (use raw string format to avoid escape issues)\n",
842
+ "agri_path = r\"C:\\Users\\satya\\Downloads\\Project_Samarth\\task\\hybrid_dataset\\agriculture_data.csv\"\n",
843
+ "imd_path = r\"C:\\Users\\satya\\Downloads\\Project_Samarth\\task\\hybrid_dataset\\imd_rainfall_data.csv\"\n",
844
+ "\n",
845
+ "# Load both datasets\n",
846
+ "agri_df = pd.read_csv(agri_path)\n",
847
+ "imd_df = pd.read_csv(imd_path)\n",
848
+ "\n",
849
+ "print(\"✅ Datasets Loaded Successfully!\\n\")\n",
850
+ "print(f\"Agriculture Data Shape: {agri_df.shape}\")\n",
851
+ "print(f\"IMD Rainfall Data Shape: {imd_df.shape}\")\n"
852
+ ]
853
+ },
854
+ {
855
+ "cell_type": "code",
856
+ "execution_count": 8,
857
+ "id": "485e1cd8",
858
+ "metadata": {},
859
+ "outputs": [
860
+ {
861
+ "name": "stdout",
862
+ "output_type": "stream",
863
+ "text": [
864
+ "\n",
865
+ "🌾 Agriculture Data Columns:\n",
866
+ "['state_name', 'district_name', 'crop_year', 'season', 'crop', 'area_', 'production_']\n",
867
+ "\n",
868
+ "☁️ IMD Rainfall Data Columns:\n",
869
+ "['subdivision', 'year', 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec', 'annual', 'jf', 'mam', 'jjas', 'ond']\n",
870
+ "\n",
871
+ "�� Agriculture Data Sample:\n"
872
+ ]
873
+ },
874
+ {
875
+ "data": {
876
+ "application/vnd.microsoft.datawrangler.viewer.v0+json": {
877
+ "columns": [
878
+ {
879
+ "name": "index",
880
+ "rawType": "int64",
881
+ "type": "integer"
882
+ },
883
+ {
884
+ "name": "state_name",
885
+ "rawType": "object",
886
+ "type": "string"
887
+ },
888
+ {
889
+ "name": "district_name",
890
+ "rawType": "object",
891
+ "type": "string"
892
+ },
893
+ {
894
+ "name": "crop_year",
895
+ "rawType": "int64",
896
+ "type": "integer"
897
+ },
898
+ {
899
+ "name": "season",
900
+ "rawType": "object",
901
+ "type": "string"
902
+ },
903
+ {
904
+ "name": "crop",
905
+ "rawType": "object",
906
+ "type": "string"
907
+ },
908
+ {
909
+ "name": "area_",
910
+ "rawType": "float64",
911
+ "type": "float"
912
+ },
913
+ {
914
+ "name": "production_",
915
+ "rawType": "float64",
916
+ "type": "float"
917
+ }
918
+ ],
919
+ "ref": "0d1e680d-db1a-4e1b-a014-078e92fcf760",
920
+ "rows": [
921
+ [
922
+ "0",
923
+ "Andaman and Nicobar Islands",
924
+ "NICOBARS",
925
+ "2000",
926
+ "Kharif",
927
+ "Arecanut",
928
+ "1254.0",
929
+ "2000.0"
930
+ ],
931
+ [
932
+ "1",
933
+ "Andaman and Nicobar Islands",
934
+ "NICOBARS",
935
+ "2000",
936
+ "Kharif",
937
+ "Other Kharif pulses",
938
+ "2.0",
939
+ "1.0"
940
+ ],
941
+ [
942
+ "2",
943
+ "Andaman and Nicobar Islands",
944
+ "NICOBARS",
945
+ "2000",
946
+ "Kharif",
947
+ "Rice",
948
+ "102.0",
949
+ "321.0"
950
+ ],
951
+ [
952
+ "3",
953
+ "Andaman and Nicobar Islands",
954
+ "NICOBARS",
955
+ "2000",
956
+ "Whole Year",
957
+ "Banana",
958
+ "176.0",
959
+ "641.0"
960
+ ],
961
+ [
962
+ "4",
963
+ "Andaman and Nicobar Islands",
964
+ "NICOBARS",
965
+ "2000",
966
+ "Whole Year",
967
+ "Cashewnut",
968
+ "720.0",
969
+ "165.0"
970
+ ]
971
+ ],
972
+ "shape": {
973
+ "columns": 7,
974
+ "rows": 5
975
+ }
976
+ },
977
+ "text/html": [
978
+ "<div>\n",
979
+ "<style scoped>\n",
980
+ " .dataframe tbody tr th:only-of-type {\n",
981
+ " vertical-align: middle;\n",
982
+ " }\n",
983
+ "\n",
984
+ " .dataframe tbody tr th {\n",
985
+ " vertical-align: top;\n",
986
+ " }\n",
987
+ "\n",
988
+ " .dataframe thead th {\n",
989
+ " text-align: right;\n",
990
+ " }\n",
991
+ "</style>\n",
992
+ "<table border=\"1\" class=\"dataframe\">\n",
993
+ " <thead>\n",
994
+ " <tr style=\"text-align: right;\">\n",
995
+ " <th></th>\n",
996
+ " <th>state_name</th>\n",
997
+ " <th>district_name</th>\n",
998
+ " <th>crop_year</th>\n",
999
+ " <th>season</th>\n",
1000
+ " <th>crop</th>\n",
1001
+ " <th>area_</th>\n",
1002
+ " <th>production_</th>\n",
1003
+ " </tr>\n",
1004
+ " </thead>\n",
1005
+ " <tbody>\n",
1006
+ " <tr>\n",
1007
+ " <th>0</th>\n",
1008
+ " <td>Andaman and Nicobar Islands</td>\n",
1009
+ " <td>NICOBARS</td>\n",
1010
+ " <td>2000</td>\n",
1011
+ " <td>Kharif</td>\n",
1012
+ " <td>Arecanut</td>\n",
1013
+ " <td>1254.0</td>\n",
1014
+ " <td>2000.0</td>\n",
1015
+ " </tr>\n",
1016
+ " <tr>\n",
1017
+ " <th>1</th>\n",
1018
+ " <td>Andaman and Nicobar Islands</td>\n",
1019
+ " <td>NICOBARS</td>\n",
1020
+ " <td>2000</td>\n",
1021
+ " <td>Kharif</td>\n",
1022
+ " <td>Other Kharif pulses</td>\n",
1023
+ " <td>2.0</td>\n",
1024
+ " <td>1.0</td>\n",
1025
+ " </tr>\n",
1026
+ " <tr>\n",
1027
+ " <th>2</th>\n",
1028
+ " <td>Andaman and Nicobar Islands</td>\n",
1029
+ " <td>NICOBARS</td>\n",
1030
+ " <td>2000</td>\n",
1031
+ " <td>Kharif</td>\n",
1032
+ " <td>Rice</td>\n",
1033
+ " <td>102.0</td>\n",
1034
+ " <td>321.0</td>\n",
1035
+ " </tr>\n",
1036
+ " <tr>\n",
1037
+ " <th>3</th>\n",
1038
+ " <td>Andaman and Nicobar Islands</td>\n",
1039
+ " <td>NICOBARS</td>\n",
1040
+ " <td>2000</td>\n",
1041
+ " <td>Whole Year</td>\n",
1042
+ " <td>Banana</td>\n",
1043
+ " <td>176.0</td>\n",
1044
+ " <td>641.0</td>\n",
1045
+ " </tr>\n",
1046
+ " <tr>\n",
1047
+ " <th>4</th>\n",
1048
+ " <td>Andaman and Nicobar Islands</td>\n",
1049
+ " <td>NICOBARS</td>\n",
1050
+ " <td>2000</td>\n",
1051
+ " <td>Whole Year</td>\n",
1052
+ " <td>Cashewnut</td>\n",
1053
+ " <td>720.0</td>\n",
1054
+ " <td>165.0</td>\n",
1055
+ " </tr>\n",
1056
+ " </tbody>\n",
1057
+ "</table>\n",
1058
+ "</div>"
1059
+ ],
1060
+ "text/plain": [
1061
+ " state_name district_name crop_year season \\\n",
1062
+ "0 Andaman and Nicobar Islands NICOBARS 2000 Kharif \n",
1063
+ "1 Andaman and Nicobar Islands NICOBARS 2000 Kharif \n",
1064
+ "2 Andaman and Nicobar Islands NICOBARS 2000 Kharif \n",
1065
+ "3 Andaman and Nicobar Islands NICOBARS 2000 Whole Year \n",
1066
+ "4 Andaman and Nicobar Islands NICOBARS 2000 Whole Year \n",
1067
+ "\n",
1068
+ " crop area_ production_ \n",
1069
+ "0 Arecanut 1254.0 2000.0 \n",
1070
+ "1 Other Kharif pulses 2.0 1.0 \n",
1071
+ "2 Rice 102.0 321.0 \n",
1072
+ "3 Banana 176.0 641.0 \n",
1073
+ "4 Cashewnut 720.0 165.0 "
1074
+ ]
1075
+ },
1076
+ "metadata": {},
1077
+ "output_type": "display_data"
1078
+ },
1079
+ {
1080
+ "name": "stdout",
1081
+ "output_type": "stream",
1082
+ "text": [
1083
+ "\n",
1084
+ "🌦️ IMD Rainfall Data Sample:\n"
1085
+ ]
1086
+ },
1087
+ {
1088
+ "data": {
1089
+ "application/vnd.microsoft.datawrangler.viewer.v0+json": {
1090
+ "columns": [
1091
+ {
1092
+ "name": "index",
1093
+ "rawType": "int64",
1094
+ "type": "integer"
1095
+ },
1096
+ {
1097
+ "name": "subdivision",
1098
+ "rawType": "object",
1099
+ "type": "string"
1100
+ },
1101
+ {
1102
+ "name": "year",
1103
+ "rawType": "int64",
1104
+ "type": "integer"
1105
+ },
1106
+ {
1107
+ "name": "jan",
1108
+ "rawType": "float64",
1109
+ "type": "float"
1110
+ },
1111
+ {
1112
+ "name": "feb",
1113
+ "rawType": "float64",
1114
+ "type": "float"
1115
+ },
1116
+ {
1117
+ "name": "mar",
1118
+ "rawType": "float64",
1119
+ "type": "float"
1120
+ },
1121
+ {
1122
+ "name": "apr",
1123
+ "rawType": "float64",
1124
+ "type": "float"
1125
+ },
1126
+ {
1127
+ "name": "may",
1128
+ "rawType": "float64",
1129
+ "type": "float"
1130
+ },
1131
+ {
1132
+ "name": "jun",
1133
+ "rawType": "float64",
1134
+ "type": "float"
1135
+ },
1136
+ {
1137
+ "name": "jul",
1138
+ "rawType": "float64",
1139
+ "type": "float"
1140
+ },
1141
+ {
1142
+ "name": "aug",
1143
+ "rawType": "float64",
1144
+ "type": "float"
1145
+ },
1146
+ {
1147
+ "name": "sep",
1148
+ "rawType": "float64",
1149
+ "type": "float"
1150
+ },
1151
+ {
1152
+ "name": "oct",
1153
+ "rawType": "float64",
1154
+ "type": "float"
1155
+ },
1156
+ {
1157
+ "name": "nov",
1158
+ "rawType": "float64",
1159
+ "type": "float"
1160
+ },
1161
+ {
1162
+ "name": "dec",
1163
+ "rawType": "float64",
1164
+ "type": "float"
1165
+ },
1166
+ {
1167
+ "name": "annual",
1168
+ "rawType": "float64",
1169
+ "type": "float"
1170
+ },
1171
+ {
1172
+ "name": "jf",
1173
+ "rawType": "float64",
1174
+ "type": "float"
1175
+ },
1176
+ {
1177
+ "name": "mam",
1178
+ "rawType": "float64",
1179
+ "type": "float"
1180
+ },
1181
+ {
1182
+ "name": "jjas",
1183
+ "rawType": "float64",
1184
+ "type": "float"
1185
+ },
1186
+ {
1187
+ "name": "ond",
1188
+ "rawType": "float64",
1189
+ "type": "float"
1190
+ }
1191
+ ],
1192
+ "ref": "032f16ab-7b47-4a43-bcbc-1c17f86ef3bb",
1193
+ "rows": [
1194
+ [
1195
+ "0",
1196
+ "Andaman & Nicobar Islands",
1197
+ "1901",
1198
+ "49.2",
1199
+ "87.1",
1200
+ "29.2",
1201
+ "2.3",
1202
+ "528.8",
1203
+ "517.5",
1204
+ "365.1",
1205
+ "481.1",
1206
+ "332.6",
1207
+ "388.5",
1208
+ "558.2",
1209
+ "33.6",
1210
+ "3373.2",
1211
+ "136.3",
1212
+ "560.3",
1213
+ "1696.3",
1214
+ "980.3"
1215
+ ],
1216
+ [
1217
+ "1",
1218
+ "Andaman & Nicobar Islands",
1219
+ "1902",
1220
+ "0.0",
1221
+ "159.8",
1222
+ "12.2",
1223
+ "0.0",
1224
+ "446.1",
1225
+ "537.1",
1226
+ "228.9",
1227
+ "753.7",
1228
+ "666.2",
1229
+ "197.2",
1230
+ "359.0",
1231
+ "160.5",
1232
+ "3520.7",
1233
+ "159.8",
1234
+ "458.3",
1235
+ "2185.9",
1236
+ "716.7"
1237
+ ],
1238
+ [
1239
+ "2",
1240
+ "Andaman & Nicobar Islands",
1241
+ "1903",
1242
+ "12.7",
1243
+ "144.0",
1244
+ "0.0",
1245
+ "1.0",
1246
+ "235.1",
1247
+ "479.9",
1248
+ "728.4",
1249
+ "326.7",
1250
+ "339.0",
1251
+ "181.2",
1252
+ "284.4",
1253
+ "225.0",
1254
+ "2957.4",
1255
+ "156.7",
1256
+ "236.1",
1257
+ "1874.0",
1258
+ "690.6"
1259
+ ],
1260
+ [
1261
+ "3",
1262
+ "Andaman & Nicobar Islands",
1263
+ "1904",
1264
+ "9.4",
1265
+ "14.7",
1266
+ "0.0",
1267
+ "202.4",
1268
+ "304.5",
1269
+ "495.1",
1270
+ "502.0",
1271
+ "160.1",
1272
+ "820.4",
1273
+ "222.2",
1274
+ "308.7",
1275
+ "40.1",
1276
+ "3079.6",
1277
+ "24.1",
1278
+ "506.9",
1279
+ "1977.6",
1280
+ "571.0"
1281
+ ],
1282
+ [
1283
+ "4",
1284
+ "Andaman & Nicobar Islands",
1285
+ "1905",
1286
+ "1.3",
1287
+ "0.0",
1288
+ "3.3",
1289
+ "26.9",
1290
+ "279.5",
1291
+ "628.7",
1292
+ "368.7",
1293
+ "330.5",
1294
+ "297.0",
1295
+ "260.7",
1296
+ "25.4",
1297
+ "344.7",
1298
+ "2566.7",
1299
+ "1.3",
1300
+ "309.7",
1301
+ "1624.9",
1302
+ "630.8"
1303
+ ]
1304
+ ],
1305
+ "shape": {
1306
+ "columns": 19,
1307
+ "rows": 5
1308
+ }
1309
+ },
1310
+ "text/html": [
1311
+ "<div>\n",
1312
+ "<style scoped>\n",
1313
+ " .dataframe tbody tr th:only-of-type {\n",
1314
+ " vertical-align: middle;\n",
1315
+ " }\n",
1316
+ "\n",
1317
+ " .dataframe tbody tr th {\n",
1318
+ " vertical-align: top;\n",
1319
+ " }\n",
1320
+ "\n",
1321
+ " .dataframe thead th {\n",
1322
+ " text-align: right;\n",
1323
+ " }\n",
1324
+ "</style>\n",
1325
+ "<table border=\"1\" class=\"dataframe\">\n",
1326
+ " <thead>\n",
1327
+ " <tr style=\"text-align: right;\">\n",
1328
+ " <th></th>\n",
1329
+ " <th>subdivision</th>\n",
1330
+ " <th>year</th>\n",
1331
+ " <th>jan</th>\n",
1332
+ " <th>feb</th>\n",
1333
+ " <th>mar</th>\n",
1334
+ " <th>apr</th>\n",
1335
+ " <th>may</th>\n",
1336
+ " <th>jun</th>\n",
1337
+ " <th>jul</th>\n",
1338
+ " <th>aug</th>\n",
1339
+ " <th>sep</th>\n",
1340
+ " <th>oct</th>\n",
1341
+ " <th>nov</th>\n",
1342
+ " <th>dec</th>\n",
1343
+ " <th>annual</th>\n",
1344
+ " <th>jf</th>\n",
1345
+ " <th>mam</th>\n",
1346
+ " <th>jjas</th>\n",
1347
+ " <th>ond</th>\n",
1348
+ " </tr>\n",
1349
+ " </thead>\n",
1350
+ " <tbody>\n",
1351
+ " <tr>\n",
1352
+ " <th>0</th>\n",
1353
+ " <td>Andaman &amp; Nicobar Islands</td>\n",
1354
+ " <td>1901</td>\n",
1355
+ " <td>49.2</td>\n",
1356
+ " <td>87.1</td>\n",
1357
+ " <td>29.2</td>\n",
1358
+ " <td>2.3</td>\n",
1359
+ " <td>528.8</td>\n",
1360
+ " <td>517.5</td>\n",
1361
+ " <td>365.1</td>\n",
1362
+ " <td>481.1</td>\n",
1363
+ " <td>332.6</td>\n",
1364
+ " <td>388.5</td>\n",
1365
+ " <td>558.2</td>\n",
1366
+ " <td>33.6</td>\n",
1367
+ " <td>3373.2</td>\n",
1368
+ " <td>136.3</td>\n",
1369
+ " <td>560.3</td>\n",
1370
+ " <td>1696.3</td>\n",
1371
+ " <td>980.3</td>\n",
1372
+ " </tr>\n",
1373
+ " <tr>\n",
1374
+ " <th>1</th>\n",
1375
+ " <td>Andaman &amp; Nicobar Islands</td>\n",
1376
+ " <td>1902</td>\n",
1377
+ " <td>0.0</td>\n",
1378
+ " <td>159.8</td>\n",
1379
+ " <td>12.2</td>\n",
1380
+ " <td>0.0</td>\n",
1381
+ " <td>446.1</td>\n",
1382
+ " <td>537.1</td>\n",
1383
+ " <td>228.9</td>\n",
1384
+ " <td>753.7</td>\n",
1385
+ " <td>666.2</td>\n",
1386
+ " <td>197.2</td>\n",
1387
+ " <td>359.0</td>\n",
1388
+ " <td>160.5</td>\n",
1389
+ " <td>3520.7</td>\n",
1390
+ " <td>159.8</td>\n",
1391
+ " <td>458.3</td>\n",
1392
+ " <td>2185.9</td>\n",
1393
+ " <td>716.7</td>\n",
1394
+ " </tr>\n",
1395
+ " <tr>\n",
1396
+ " <th>2</th>\n",
1397
+ " <td>Andaman &amp; Nicobar Islands</td>\n",
1398
+ " <td>1903</td>\n",
1399
+ " <td>12.7</td>\n",
1400
+ " <td>144.0</td>\n",
1401
+ " <td>0.0</td>\n",
1402
+ " <td>1.0</td>\n",
1403
+ " <td>235.1</td>\n",
1404
+ " <td>479.9</td>\n",
1405
+ " <td>728.4</td>\n",
1406
+ " <td>326.7</td>\n",
1407
+ " <td>339.0</td>\n",
1408
+ " <td>181.2</td>\n",
1409
+ " <td>284.4</td>\n",
1410
+ " <td>225.0</td>\n",
1411
+ " <td>2957.4</td>\n",
1412
+ " <td>156.7</td>\n",
1413
+ " <td>236.1</td>\n",
1414
+ " <td>1874.0</td>\n",
1415
+ " <td>690.6</td>\n",
1416
+ " </tr>\n",
1417
+ " <tr>\n",
1418
+ " <th>3</th>\n",
1419
+ " <td>Andaman &amp; Nicobar Islands</td>\n",
1420
+ " <td>1904</td>\n",
1421
+ " <td>9.4</td>\n",
1422
+ " <td>14.7</td>\n",
1423
+ " <td>0.0</td>\n",
1424
+ " <td>202.4</td>\n",
1425
+ " <td>304.5</td>\n",
1426
+ " <td>495.1</td>\n",
1427
+ " <td>502.0</td>\n",
1428
+ " <td>160.1</td>\n",
1429
+ " <td>820.4</td>\n",
1430
+ " <td>222.2</td>\n",
1431
+ " <td>308.7</td>\n",
1432
+ " <td>40.1</td>\n",
1433
+ " <td>3079.6</td>\n",
1434
+ " <td>24.1</td>\n",
1435
+ " <td>506.9</td>\n",
1436
+ " <td>1977.6</td>\n",
1437
+ " <td>571.0</td>\n",
1438
+ " </tr>\n",
1439
+ " <tr>\n",
1440
+ " <th>4</th>\n",
1441
+ " <td>Andaman &amp; Nicobar Islands</td>\n",
1442
+ " <td>1905</td>\n",
1443
+ " <td>1.3</td>\n",
1444
+ " <td>0.0</td>\n",
1445
+ " <td>3.3</td>\n",
1446
+ " <td>26.9</td>\n",
1447
+ " <td>279.5</td>\n",
1448
+ " <td>628.7</td>\n",
1449
+ " <td>368.7</td>\n",
1450
+ " <td>330.5</td>\n",
1451
+ " <td>297.0</td>\n",
1452
+ " <td>260.7</td>\n",
1453
+ " <td>25.4</td>\n",
1454
+ " <td>344.7</td>\n",
1455
+ " <td>2566.7</td>\n",
1456
+ " <td>1.3</td>\n",
1457
+ " <td>309.7</td>\n",
1458
+ " <td>1624.9</td>\n",
1459
+ " <td>630.8</td>\n",
1460
+ " </tr>\n",
1461
+ " </tbody>\n",
1462
+ "</table>\n",
1463
+ "</div>"
1464
+ ],
1465
+ "text/plain": [
1466
+ " subdivision year jan feb mar apr may jun \\\n",
1467
+ "0 Andaman & Nicobar Islands 1901 49.2 87.1 29.2 2.3 528.8 517.5 \n",
1468
+ "1 Andaman & Nicobar Islands 1902 0.0 159.8 12.2 0.0 446.1 537.1 \n",
1469
+ "2 Andaman & Nicobar Islands 1903 12.7 144.0 0.0 1.0 235.1 479.9 \n",
1470
+ "3 Andaman & Nicobar Islands 1904 9.4 14.7 0.0 202.4 304.5 495.1 \n",
1471
+ "4 Andaman & Nicobar Islands 1905 1.3 0.0 3.3 26.9 279.5 628.7 \n",
1472
+ "\n",
1473
+ " jul aug sep oct nov dec annual jf mam jjas \\\n",
1474
+ "0 365.1 481.1 332.6 388.5 558.2 33.6 3373.2 136.3 560.3 1696.3 \n",
1475
+ "1 228.9 753.7 666.2 197.2 359.0 160.5 3520.7 159.8 458.3 2185.9 \n",
1476
+ "2 728.4 326.7 339.0 181.2 284.4 225.0 2957.4 156.7 236.1 1874.0 \n",
1477
+ "3 502.0 160.1 820.4 222.2 308.7 40.1 3079.6 24.1 506.9 1977.6 \n",
1478
+ "4 368.7 330.5 297.0 260.7 25.4 344.7 2566.7 1.3 309.7 1624.9 \n",
1479
+ "\n",
1480
+ " ond \n",
1481
+ "0 980.3 \n",
1482
+ "1 716.7 \n",
1483
+ "2 690.6 \n",
1484
+ "3 571.0 \n",
1485
+ "4 630.8 "
1486
+ ]
1487
+ },
1488
+ "metadata": {},
1489
+ "output_type": "display_data"
1490
+ }
1491
+ ],
1492
+ "source": [
1493
+ "print(\"\\n🌾 Agriculture Data Columns:\")\n",
1494
+ "print(agri_df.columns.tolist())\n",
1495
+ "\n",
1496
+ "print(\"\\n☁️ IMD Rainfall Data Columns:\")\n",
1497
+ "print(imd_df.columns.tolist())\n",
1498
+ "\n",
1499
+ "print(\"\\n📊 Agriculture Data Sample:\")\n",
1500
+ "display(agri_df.head(5))\n",
1501
+ "\n",
1502
+ "print(\"\\n🌦️ IMD Rainfall Data Sample:\")\n",
1503
+ "display(imd_df.head(5))\n"
1504
+ ]
1505
+ },
1506
+ {
1507
+ "cell_type": "code",
1508
+ "execution_count": 9,
1509
+ "id": "4d48c1f1",
1510
+ "metadata": {},
1511
+ "outputs": [
1512
+ {
1513
+ "name": "stdout",
1514
+ "output_type": "stream",
1515
+ "text": [
1516
+ "\n",
1517
+ "🏛️ Unique States in Agriculture Data:\n",
1518
+ "['Andaman and Nicobar Islands', 'Andhra Pradesh']\n",
1519
+ "\n",
1520
+ "📅 Year Range in Agriculture Data:\n",
1521
+ "1997 → 2014\n",
1522
+ "\n",
1523
+ "🌾 Top 10 Crops:\n",
1524
+ "['Arecanut', 'Other Kharif pulses', 'Rice', 'Banana', 'Cashewnut', 'Coconut', 'Dry ginger', 'Sugarcane', 'Sweet potato', 'Tapioca']\n",
1525
+ "\n",
1526
+ "🏛️ Unique Subdivisions in IMD Rainfall Data:\n",
1527
+ "['Andaman & Nicobar Islands', 'Arunachal Pradesh', 'Assam & Meghalaya', 'Naga Mani Mizo Tripura', 'Sub Himalayan West Bengal & Sikkim', 'Gangetic West Bengal', 'Orissa', 'Jharkhand', 'Bihar', 'East Uttar Pradesh', 'West Uttar Pradesh', 'Uttarakhand', 'Haryana Delhi & Chandigarh', 'Punjab', 'Himachal Pradesh', 'Jammu & Kashmir', 'West Rajasthan', 'East Rajasthan']\n",
1528
+ "\n",
1529
+ "📅 Year Range in IMD Rainfall Data:\n",
1530
+ "1901 → 2017\n"
1531
+ ]
1532
+ }
1533
+ ],
1534
+ "source": [
1535
+ "# ---- AGRICULTURE ----\n",
1536
+ "print(\"\\n🏛️ Unique States in Agriculture Data:\")\n",
1537
+ "print(agri_df['state_name'].unique().tolist())\n",
1538
+ "\n",
1539
+ "print(\"\\n📅 Year Range in Agriculture Data:\")\n",
1540
+ "if 'crop_year' in agri_df.columns:\n",
1541
+ " print(int(agri_df['crop_year'].min()), \"→\", int(agri_df['crop_year'].max()))\n",
1542
+ "\n",
1543
+ "print(\"\\n🌾 Top 10 Crops:\")\n",
1544
+ "print(agri_df['crop'].unique().tolist()[:10])\n",
1545
+ "\n",
1546
+ "# ---- IMD RAINFALL ----\n",
1547
+ "print(\"\\n🏛️ Unique Subdivisions in IMD Rainfall Data:\")\n",
1548
+ "print(imd_df['subdivision'].unique().tolist())\n",
1549
+ "\n",
1550
+ "print(\"\\n📅 Year Range in IMD Rainfall Data:\")\n",
1551
+ "if 'year' in imd_df.columns:\n",
1552
+ " print(int(imd_df['year'].min()), \"→\", int(imd_df['year'].max()))\n"
1553
+ ]
1554
+ },
1555
+ {
1556
+ "cell_type": "code",
1557
+ "execution_count": 10,
1558
+ "id": "8cfe6ee8",
1559
+ "metadata": {},
1560
+ "outputs": [
1561
+ {
1562
+ "name": "stdout",
1563
+ "output_type": "stream",
1564
+ "text": [
1565
+ "\n",
1566
+ "✅ Common Names Found Between Agriculture & IMD Data (0):\n",
1567
+ "[]\n",
1568
+ "\n",
1569
+ "⚠️ States in Agriculture but not in IMD (2):\n",
1570
+ "['andaman and nicobar islands', 'andhra pradesh']\n"
1571
+ ]
1572
+ }
1573
+ ],
1574
+ "source": [
1575
+ "# Lowercase and trim for consistency\n",
1576
+ "agri_states = set(agri_df['state_name'].str.lower().str.strip().unique())\n",
1577
+ "imd_subdiv = set(imd_df['subdivision'].str.lower().str.strip().unique())\n",
1578
+ "\n",
1579
+ "common = sorted(agri_states.intersection(imd_subdiv))\n",
1580
+ "\n",
1581
+ "print(f\"\\n✅ Common Names Found Between Agriculture & IMD Data ({len(common)}):\")\n",
1582
+ "print(common[:10])\n",
1583
+ "\n",
1584
+ "missing_from_imd = sorted(list(agri_states - imd_subdiv))\n",
1585
+ "print(f\"\\n⚠️ States in Agriculture but not in IMD ({len(missing_from_imd)}):\")\n",
1586
+ "print(missing_from_imd[:10])\n"
1587
+ ]
1588
+ },
1589
+ {
1590
+ "cell_type": "code",
1591
+ "execution_count": null,
1592
+ "id": "21653121",
1593
+ "metadata": {},
1594
+ "outputs": [],
1595
+ "source": []
1596
+ }
1597
+ ],
1598
+ "metadata": {
1599
+ "kernelspec": {
1600
+ "display_name": "myenv",
1601
+ "language": "python",
1602
+ "name": "python3"
1603
+ },
1604
+ "language_info": {
1605
+ "codemirror_mode": {
1606
+ "name": "ipython",
1607
+ "version": 3
1608
+ },
1609
+ "file_extension": ".py",
1610
+ "mimetype": "text/x-python",
1611
+ "name": "python",
1612
+ "nbconvert_exporter": "python",
1613
+ "pygments_lexer": "ipython3",
1614
+ "version": "3.13.0"
1615
+ }
1616
+ },
1617
+ "nbformat": 4,
1618
+ "nbformat_minor": 5
1619
+ }
notebooks/02_data_integration.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
notebooks/03_qna_demo.ipynb ADDED
@@ -0,0 +1,322 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "id": "f92a389b",
7
+ "metadata": {},
8
+ "outputs": [
9
+ {
10
+ "name": "stdout",
11
+ "output_type": "stream",
12
+ "text": [
13
+ "Active code page: 1252\n",
14
+ "Requirement already satisfied: pandas in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (2.3.2)\n",
15
+ "Requirement already satisfied: numpy in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (2.3.2)\n",
16
+ "Requirement already satisfied: matplotlib in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (3.10.7)\n",
17
+ "Requirement already satisfied: seaborn in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (0.13.2)\n",
18
+ "Requirement already satisfied: nltk in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (3.9.1)\n",
19
+ "Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2.9.0.post0)\n",
20
+ "Requirement already satisfied: pytz>=2020.1 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2025.2)\n",
21
+ "Requirement already satisfied: tzdata>=2022.7 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2025.2)\n",
22
+ "Requirement already satisfied: contourpy>=1.0.1 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from matplotlib) (1.3.3)\n",
23
+ "Requirement already satisfied: cycler>=0.10 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from matplotlib) (0.12.1)\n",
24
+ "Requirement already satisfied: fonttools>=4.22.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from matplotlib) (4.60.1)\n",
25
+ "Requirement already satisfied: kiwisolver>=1.3.1 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from matplotlib) (1.4.9)\n",
26
+ "Requirement already satisfied: packaging>=20.0 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from matplotlib) (23.2)\n",
27
+ "Requirement already satisfied: pillow>=8 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from matplotlib) (11.3.0)\n",
28
+ "Requirement already satisfied: pyparsing>=3 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from matplotlib) (3.2.5)\n",
29
+ "Requirement already satisfied: click in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from nltk) (8.2.1)\n",
30
+ "Requirement already satisfied: joblib in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from nltk) (1.5.2)\n",
31
+ "Requirement already satisfied: regex>=2021.8.3 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from nltk) (2025.9.1)\n",
32
+ "Requirement already satisfied: tqdm in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from nltk) (4.67.1)\n",
33
+ "Requirement already satisfied: six>=1.5 in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n",
34
+ "Requirement already satisfied: colorama in c:\\users\\satya\\anaconda3\\envs\\myenv\\lib\\site-packages (from click->nltk) (0.4.6)\n",
35
+ "Note: you may need to restart the kernel to use updated packages.\n"
36
+ ]
37
+ }
38
+ ],
39
+ "source": [
40
+ "pip install pandas numpy matplotlib seaborn nltk\n"
41
+ ]
42
+ },
43
+ {
44
+ "cell_type": "code",
45
+ "execution_count": 2,
46
+ "id": "6bf0d886",
47
+ "metadata": {},
48
+ "outputs": [
49
+ {
50
+ "name": "stderr",
51
+ "output_type": "stream",
52
+ "text": [
53
+ "[nltk_data] Downloading package stopwords to\n",
54
+ "[nltk_data] C:\\Users\\satya\\AppData\\Roaming\\nltk_data...\n",
55
+ "[nltk_data] Package stopwords is already up-to-date!\n"
56
+ ]
57
+ },
58
+ {
59
+ "data": {
60
+ "text/plain": [
61
+ "True"
62
+ ]
63
+ },
64
+ "execution_count": 2,
65
+ "metadata": {},
66
+ "output_type": "execute_result"
67
+ }
68
+ ],
69
+ "source": [
70
+ "import nltk\n",
71
+ "nltk.download('stopwords')\n"
72
+ ]
73
+ },
74
+ {
75
+ "cell_type": "code",
76
+ "execution_count": 7,
77
+ "id": "6c16aeb7",
78
+ "metadata": {},
79
+ "outputs": [
80
+ {
81
+ "name": "stdout",
82
+ "output_type": "stream",
83
+ "text": [
84
+ "✅ Dataset Loaded Successfully!\n",
85
+ "Rows: 203, Columns: 25\n",
86
+ "\n",
87
+ "📊 Columns: ['state_name', 'district_name', 'crop_year', 'season', 'crop', 'area_', 'production_', 'subdivision', 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec', 'annual', 'jf', 'mam', 'jjas', 'ond']\n",
88
+ "\n",
89
+ "🏛️ States: ['andaman and nicobar islands']\n",
90
+ "🔍 Parsed Query: {'states': ['andaman', 'nicobar islands'], 'crop': 'rice', 'years': 5, 'metrics': ['rainfall', 'production']}\n",
91
+ "\n",
92
+ "📊 Q&A Result Summary:\n",
93
+ " {'message': 'No matching records found for your query.'}\n",
94
+ "ℹ️ No matching records found for your query.\n",
95
+ "\n",
96
+ "✅ Notebook Execution Completed Successfully!\n"
97
+ ]
98
+ },
99
+ {
100
+ "name": "stderr",
101
+ "output_type": "stream",
102
+ "text": [
103
+ "[nltk_data] Downloading package stopwords to\n",
104
+ "[nltk_data] C:\\Users\\satya\\AppData\\Roaming\\nltk_data...\n",
105
+ "[nltk_data] Package stopwords is already up-to-date!\n"
106
+ ]
107
+ }
108
+ ],
109
+ "source": [
110
+ "# ===============================================\n",
111
+ "# 🌾 Project Samarth - Notebook 03\n",
112
+ "# Phase 2: Intelligent Q&A System (Final Fixed)\n",
113
+ "# ===============================================\n",
114
+ "\n",
115
+ "# ✅ Step 1: Import Libraries\n",
116
+ "import pandas as pd\n",
117
+ "import numpy as np\n",
118
+ "import re\n",
119
+ "import matplotlib.pyplot as plt\n",
120
+ "import seaborn as sns\n",
121
+ "import nltk\n",
122
+ "from nltk.corpus import stopwords\n",
123
+ "nltk.download('stopwords')\n",
124
+ "\n",
125
+ "# ✅ Step 2: Load Integrated Dataset\n",
126
+ "data_path = \"../hybrid_dataset/merged_agri_rainfall.csv\"\n",
127
+ "df = pd.read_csv(data_path)\n",
128
+ "\n",
129
+ "df.columns = df.columns.str.lower().str.strip()\n",
130
+ "df[\"crop_year\"] = pd.to_numeric(df[\"crop_year\"], errors=\"coerce\")\n",
131
+ "\n",
132
+ "print(\"✅ Dataset Loaded Successfully!\")\n",
133
+ "print(f\"Rows: {len(df)}, Columns: {len(df.columns)}\")\n",
134
+ "print(\"\\n📊 Columns:\", df.columns.tolist())\n",
135
+ "print(\"\\n🏛️ States:\", df['state_name'].dropna().unique().tolist())\n",
136
+ "\n",
137
+ "# ===============================================\n",
138
+ "# 🧠 Step 3: Improved NLP Query Parser\n",
139
+ "# ===============================================\n",
140
+ "\n",
141
+ "def parse_query(query: str):\n",
142
+ " \"\"\"\n",
143
+ " Improved NLP parser that cleanly extracts states, crop, years, and metrics.\n",
144
+ " Example:\n",
145
+ " 'Compare rainfall and rice production in Andaman and Nicobar Islands and Andhra Pradesh for the last 5 years'\n",
146
+ " \"\"\"\n",
147
+ " query = query.lower().strip()\n",
148
+ "\n",
149
+ " # Extract crop\n",
150
+ " crop_match = re.findall(r\"\\b(rice|wheat|maize|sugarcane|banana|cotton)\\b\", query)\n",
151
+ "\n",
152
+ " # Extract metrics\n",
153
+ " metrics = []\n",
154
+ " if \"rainfall\" in query: metrics.append(\"rainfall\")\n",
155
+ " if \"production\" in query: metrics.append(\"production\")\n",
156
+ "\n",
157
+ " # Extract year info\n",
158
+ " year_match = re.search(r\"last (\\d+)\", query)\n",
159
+ " years = int(year_match.group(1)) if year_match else 5\n",
160
+ "\n",
161
+ " # Extract state names (clean multiple cases)\n",
162
+ " state_part = re.search(r\"in (.*)\", query)\n",
163
+ " states = []\n",
164
+ " if state_part:\n",
165
+ " # Break by 'and', ',', or 'with'\n",
166
+ " parts = re.split(r\"\\band\\b|,|with\", state_part.group(1))\n",
167
+ " for p in parts:\n",
168
+ " p = p.strip()\n",
169
+ " # Stop reading after phrases like \"for the last\"\n",
170
+ " if \"for the last\" in p:\n",
171
+ " break\n",
172
+ " if p:\n",
173
+ " states.append(p.strip())\n",
174
+ "\n",
175
+ " return {\n",
176
+ " \"states\": states,\n",
177
+ " \"crop\": crop_match[0] if crop_match else None,\n",
178
+ " \"years\": years,\n",
179
+ " \"metrics\": metrics\n",
180
+ " }\n",
181
+ "\n",
182
+ "# ===============================================\n",
183
+ "# ⚙️ Step 4: Query Execution Logic\n",
184
+ "# ===============================================\n",
185
+ "\n",
186
+ "def run_query(parsed_query: dict):\n",
187
+ " \"\"\"Perform analysis using parsed query information.\"\"\"\n",
188
+ " if df.empty:\n",
189
+ " return {\"error\": \"Dataset not found or empty.\"}\n",
190
+ "\n",
191
+ " states = [s.lower() for s in parsed_query.get(\"states\", [])]\n",
192
+ " crop = parsed_query.get(\"crop\")\n",
193
+ " years = parsed_query.get(\"years\", 5)\n",
194
+ " metrics = parsed_query.get(\"metrics\", [])\n",
195
+ "\n",
196
+ " filtered = df.copy()\n",
197
+ "\n",
198
+ " if states:\n",
199
+ " filtered = filtered[filtered[\"state_name\"].str.lower().isin(states)]\n",
200
+ " if crop:\n",
201
+ " filtered = filtered[filtered[\"crop\"].str.lower() == crop]\n",
202
+ "\n",
203
+ " # Handle missing crop_year safely\n",
204
+ " if \"crop_year\" in filtered.columns and not filtered.empty:\n",
205
+ " latest_year = filtered[\"crop_year\"].max()\n",
206
+ " if pd.notna(latest_year):\n",
207
+ " latest_year = int(latest_year)\n",
208
+ " start_year = latest_year - years + 1\n",
209
+ " filtered = filtered[(filtered[\"crop_year\"] >= start_year) & (filtered[\"crop_year\"] <= latest_year)]\n",
210
+ "\n",
211
+ " if filtered.empty:\n",
212
+ " return {\"message\": \"No matching records found for your query.\"}\n",
213
+ "\n",
214
+ " result = {\"states\": states, \"crop\": crop, \"years\": years}\n",
215
+ "\n",
216
+ " # Rainfall Analysis\n",
217
+ " if \"rainfall\" in metrics:\n",
218
+ " rain_cols = [c for c in [\"annual\", \"jjas\", \"jf\", \"mam\", \"ond\"] if c in filtered.columns]\n",
219
+ " if rain_cols:\n",
220
+ " filtered[\"avg_rainfall\"] = filtered[rain_cols].apply(pd.to_numeric, errors=\"coerce\").mean(axis=1)\n",
221
+ " rainfall_summary = (\n",
222
+ " filtered.groupby(\"state_name\")[\"avg_rainfall\"].mean().round(2).to_dict()\n",
223
+ " )\n",
224
+ " result[\"rainfall_summary\"] = rainfall_summary\n",
225
+ "\n",
226
+ " # Production Analysis\n",
227
+ " if \"production\" in metrics and \"production_\" in filtered.columns:\n",
228
+ " prod_summary = (\n",
229
+ " filtered.groupby(\"state_name\")[\"production_\"].sum().round(2).to_dict()\n",
230
+ " )\n",
231
+ " result[\"production_summary\"] = prod_summary\n",
232
+ "\n",
233
+ " return result\n",
234
+ "\n",
235
+ "# ===============================================\n",
236
+ "# 🗣️ Step 5: Test with a Query\n",
237
+ "# ===============================================\n",
238
+ "\n",
239
+ "example_query = \"Compare rainfall and rice production in Andaman and Nicobar Islands and Andhra Pradesh for the last 5 years\"\n",
240
+ "\n",
241
+ "parsed = parse_query(example_query)\n",
242
+ "print(\"🔍 Parsed Query:\", parsed)\n",
243
+ "\n",
244
+ "result = run_query(parsed)\n",
245
+ "print(\"\\n📊 Q&A Result Summary:\\n\", result)\n",
246
+ "\n",
247
+ "# ===============================================\n",
248
+ "# ✅ Step 6: Display Final Answer\n",
249
+ "# ===============================================\n",
250
+ "\n",
251
+ "if \"message\" in result:\n",
252
+ " print(\"ℹ️\", result[\"message\"])\n",
253
+ "elif \"rainfall_summary\" in result or \"production_summary\" in result:\n",
254
+ " print(f\"\\n📊 Analysis for {', '.join(parsed['states'])} — Crop: {parsed['crop'].title() if parsed['crop'] else 'All'}\")\n",
255
+ "\n",
256
+ " if \"rainfall_summary\" in result:\n",
257
+ " print(\"\\n🌧️ Average Rainfall (mm):\")\n",
258
+ " for s, v in result[\"rainfall_summary\"].items():\n",
259
+ " print(f\" • {s.title()}: {v} mm\")\n",
260
+ "\n",
261
+ " if \"production_summary\" in result:\n",
262
+ " print(\"\\n🌾 Total Production (tonnes):\")\n",
263
+ " for s, v in result[\"production_summary\"].items():\n",
264
+ " print(f\" • {s.title()}: {int(v)} tonnes\")\n",
265
+ "\n",
266
+ " print(\"\\n📚 Data Source: Government Open Data Portal (data.gov.in)\")\n",
267
+ " print(\"Developed for Project Samarth — Integrating Agriculture & Climate Data 🌦️🌾\")\n",
268
+ "\n",
269
+ "print(\"\\n✅ Notebook Execution Completed Successfully!\")\n"
270
+ ]
271
+ },
272
+ {
273
+ "cell_type": "code",
274
+ "execution_count": 6,
275
+ "id": "798fe45c",
276
+ "metadata": {},
277
+ "outputs": [
278
+ {
279
+ "name": "stdout",
280
+ "output_type": "stream",
281
+ "text": [
282
+ "Available States in merged dataset: ['andaman and nicobar islands']\n",
283
+ "Available Crops: ['Arecanut', 'Other Kharif pulses', 'Rice', 'Banana', 'Cashewnut', 'Coconut', 'Dry ginger', 'Sugarcane', 'Sweet potato', 'Tapioca']\n"
284
+ ]
285
+ }
286
+ ],
287
+ "source": [
288
+ "print(\"Available States in merged dataset:\", df['state_name'].unique().tolist())\n",
289
+ "print(\"Available Crops:\", df['crop'].unique().tolist()[:10])\n"
290
+ ]
291
+ },
292
+ {
293
+ "cell_type": "code",
294
+ "execution_count": null,
295
+ "id": "8c59d29d",
296
+ "metadata": {},
297
+ "outputs": [],
298
+ "source": []
299
+ }
300
+ ],
301
+ "metadata": {
302
+ "kernelspec": {
303
+ "display_name": "myenv",
304
+ "language": "python",
305
+ "name": "python3"
306
+ },
307
+ "language_info": {
308
+ "codemirror_mode": {
309
+ "name": "ipython",
310
+ "version": 3
311
+ },
312
+ "file_extension": ".py",
313
+ "mimetype": "text/x-python",
314
+ "name": "python",
315
+ "nbconvert_exporter": "python",
316
+ "pygments_lexer": "ipython3",
317
+ "version": "3.13.0"
318
+ }
319
+ },
320
+ "nbformat": 4,
321
+ "nbformat_minor": 5
322
+ }
query_engine/__init__.py ADDED
File without changes
query_engine/logic_engine.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -----------------------------------------------------------
2
+ # 🌾 Project Samarth — Logic Engine (Final Polished Version)
3
+ # -----------------------------------------------------------
4
+
5
+ import pandas as pd
6
+ import numpy as np
7
+
8
+ DATA_PATH = "hybrid_dataset/merged_agri_rainfall.csv"
9
+
10
+ try:
11
+ df = pd.read_csv(DATA_PATH)
12
+ df.columns = df.columns.str.lower().str.strip()
13
+ df["crop_year"] = pd.to_numeric(df.get("crop_year", pd.Series()), errors="coerce")
14
+ df["state_name"] = df["state_name"].fillna("").astype(str)
15
+ df["crop"] = df["crop"].fillna("").astype(str)
16
+ print(f"✅ Dataset loaded successfully → {DATA_PATH} ({len(df)} rows)")
17
+ except Exception as e:
18
+ print(f"⚠️ Error loading dataset: {e}")
19
+ df = pd.DataFrame()
20
+
21
+
22
+ def run_query(parsed_query: dict):
23
+ """Executes logic for a given parsed query using integrated dataset."""
24
+
25
+ if df.empty:
26
+ return {"error": "Dataset not found or empty."}
27
+
28
+ if not parsed_query or not isinstance(parsed_query, dict):
29
+ return {"error": "Invalid query format."}
30
+
31
+ states = [s.lower().strip() for s in parsed_query.get("states", [])]
32
+ crop = parsed_query.get("crop", "").lower().strip()
33
+ years = parsed_query.get("years", 5)
34
+ metrics = parsed_query.get("metrics", [])
35
+ result = {"states": states, "crop": crop, "years": years}
36
+
37
+ filtered = df.copy()
38
+
39
+ # ✅ Safely filter by state
40
+ if "state_name" in filtered.columns and states:
41
+ filtered = filtered[filtered["state_name"].str.lower().isin(states)]
42
+
43
+ # ✅ Safely filter by crop
44
+ if "crop" in filtered.columns and crop:
45
+ filtered = filtered[filtered["crop"].str.lower() == crop]
46
+
47
+ # ✅ Filter by year range
48
+ if "crop_year" in filtered.columns and not filtered["crop_year"].isna().all():
49
+ latest_year = int(filtered["crop_year"].max())
50
+ start_year = latest_year - years + 1
51
+ filtered = filtered[
52
+ (filtered["crop_year"] >= start_year)
53
+ & (filtered["crop_year"] <= latest_year)
54
+ ]
55
+
56
+ if filtered.empty:
57
+ return {"message": "No matching records found for your query."}
58
+
59
+ # 🌧️ Compute Average Rainfall
60
+ if "rainfall" in metrics:
61
+ rain_cols = [c for c in ["annual", "jjas", "jf", "mam", "ond"] if c in filtered.columns]
62
+ if rain_cols:
63
+ filtered["avg_rainfall"] = filtered[rain_cols].apply(
64
+ pd.to_numeric, errors="coerce"
65
+ ).mean(axis=1)
66
+ rainfall_summary = (
67
+ filtered.groupby("state_name")["avg_rainfall"]
68
+ .mean()
69
+ .round(2)
70
+ .to_dict()
71
+ )
72
+ result["rainfall_summary"] = rainfall_summary
73
+
74
+ # 🌾 Compute Total Crop Production
75
+ if "production" in metrics and "production_" in filtered.columns:
76
+ prod_summary = (
77
+ filtered.groupby("state_name")["production_"]
78
+ .sum()
79
+ .round(2)
80
+ .to_dict()
81
+ )
82
+ result["production_summary"] = prod_summary
83
+
84
+ # 📊 If no data found
85
+ if "rainfall_summary" not in result and "production_summary" not in result:
86
+ result["message"] = "No metrics found in dataset for the given query."
87
+
88
+ # ✅ Format clean output
89
+ result["states"] = sorted([s.title() for s in result.get("states", [])])
90
+
91
+ return result
92
+
93
+
94
+ # 🧪 Quick test
95
+ if __name__ == "__main__":
96
+ test_query = {
97
+ "states": ["andaman and nicobar islands", "andhra pradesh"],
98
+ "crop": "rice",
99
+ "years": 5,
100
+ "metrics": ["rainfall", "production"],
101
+ }
102
+ print("\n🧠 Running test query...\n")
103
+ print(run_query(test_query))
query_engine/parser.py ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+
3
+ def parse_query(user_input: str):
4
+ """
5
+ 🌾 Project Samarth — Query Parser (Final Version)
6
+ --------------------------------
7
+ Converts user natural language question into structured query.
8
+ """
9
+
10
+ query = (user_input or "").lower().strip()
11
+ result = {
12
+ "states": [],
13
+ "crop": None,
14
+ "years": 5, # Default
15
+ "metrics": [],
16
+ "query_type": "general"
17
+ }
18
+
19
+ # 1️⃣ Extract number of years
20
+ match = re.search(r"last (\d+) years?", query)
21
+ if match:
22
+ result["years"] = int(match.group(1))
23
+
24
+ # 2️⃣ Extract states — only ones that exist in your merged dataset
25
+ state_list = [
26
+ "andaman and nicobar islands", "andhra pradesh", "bihar", "jharkhand",
27
+ "odisha", "tamil nadu", "rajasthan", "uttar pradesh", "west bengal",
28
+ "kerala", "karnataka", "maharashtra"
29
+ ]
30
+ found_states = [s for s in state_list if s in query]
31
+ if found_states:
32
+ result["states"] = found_states
33
+
34
+ # 3️⃣ Extract crop
35
+ crop_list = [
36
+ "rice", "maize", "wheat", "sugarcane", "turmeric", "banana", "groundnut",
37
+ "arecanut", "sunflower", "moong", "urad", "black pepper", "cashewnut"
38
+ ]
39
+ for crop in crop_list:
40
+ if crop in query:
41
+ result["crop"] = crop
42
+ break
43
+
44
+ # 4️⃣ Extract metrics
45
+ if "rainfall" in query:
46
+ result["metrics"].append("rainfall")
47
+ if "production" in query:
48
+ result["metrics"].append("production")
49
+
50
+ # Default metrics
51
+ if not result["metrics"]:
52
+ result["metrics"] = ["rainfall", "production"]
53
+
54
+ # 5️⃣ Determine query type
55
+ if "compare" in query:
56
+ result["query_type"] = "compare_rainfall_production"
57
+ elif "trend" in query:
58
+ result["query_type"] = "crop_trend"
59
+ elif "highest" in query:
60
+ result["query_type"] = "highest_production"
61
+ elif "policy" in query or "promote" in query:
62
+ result["query_type"] = "policy_support"
63
+ else:
64
+ result["query_type"] = "general"
65
+
66
+ return result
67
+
68
+
69
+ # 🧪 Quick test
70
+ if __name__ == "__main__":
71
+ queries = [
72
+ "Compare rainfall and rice production in Andaman and Nicobar Islands for the last 5 years",
73
+ "Show rainfall trend for Rice in Andhra Pradesh for the last 10 years",
74
+ "Which district had highest rice production in Andhra Pradesh?",
75
+ "Suggest policy to promote drought-resistant crops in Odisha"
76
+ ]
77
+ for q in queries:
78
+ print(f"\n🔍 Query: {q}")
79
+ print("Parsed Output:", parse_query(q))
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ requests
2
+ pandas
3
+ numpy
4
+ matplotlib
5
+ plotly
6
+ streamlit
7
+ langchain
8
+ transformers
ui/__init__.py ADDED
File without changes
ui/app_streamlit.py ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ---------------------------------------------------
2
+ # 🌾 Project Samarth — Intelligent Q&A System
3
+ # ---------------------------------------------------
4
+
5
+ import sys, os
6
+ import streamlit as st
7
+
8
+ # ✅ Ensure Python finds your project modules
9
+ sys.path.append(os.path.dirname(os.path.abspath(__file__)) + "/..")
10
+
11
+ from query_engine.parser import parse_query
12
+ from query_engine.logic_engine import run_query
13
+ from answer_generator.citation_manager import get_source
14
+
15
+ # ---------------------------------------------------
16
+ # ⚙️ Streamlit Page Setup
17
+ # ---------------------------------------------------
18
+ st.set_page_config(page_title="🌾 Project Samarth — Intelligent Q&A", layout="centered")
19
+
20
+ # ✅ Load custom CSS
21
+ try:
22
+ st.markdown("<style>" + open("ui/style.css").read() + "</style>", unsafe_allow_html=True)
23
+ except Exception as e:
24
+ st.warning("⚠️ Could not load CSS file. Using default Streamlit styling.")
25
+
26
+ # ---------------------------------------------------
27
+ # 🧠 Title and Info
28
+ # ---------------------------------------------------
29
+ st.title("🌾 Project Samarth — Intelligent Q&A System")
30
+ st.caption("Ask intelligent, data-driven questions about agriculture and climate using live datasets from data.gov.in.")
31
+
32
+ # ---------------------------------------------------
33
+ # ✍️ Input Section
34
+ # ---------------------------------------------------
35
+ query = st.text_area(
36
+ "🧠 Ask your question:",
37
+ height=100,
38
+ placeholder="Example: Compare rainfall and rice production in Andaman and Nicobar Islands and Andhra Pradesh for the last 5 years"
39
+ )
40
+
41
+ # ---------------------------------------------------
42
+ # 🔍 Analyze Button Logic
43
+ # ---------------------------------------------------
44
+ if st.button("🔍 Analyze"):
45
+ if not query.strip():
46
+ st.warning("Please enter a valid question.")
47
+ else:
48
+ with st.spinner("Analyzing your question..."):
49
+ try:
50
+ # 1️⃣ Parse user query
51
+ parsed_query = parse_query(query)
52
+
53
+ # 2️⃣ Run analysis logic
54
+ result = run_query(parsed_query)
55
+
56
+ # 3️⃣ Get citation source
57
+ source = get_source(parsed_query.get("query_type"))
58
+
59
+ # 4️⃣ Display structured results
60
+ st.markdown("---")
61
+ st.markdown("### 📊 Result Summary")
62
+
63
+ if "error" in result:
64
+ st.error(result["error"])
65
+
66
+ elif "message" in result:
67
+ st.info(result["message"])
68
+
69
+ else:
70
+ states = parsed_query.get("states", [])
71
+ crop = parsed_query.get("crop", "")
72
+ st.markdown(f"### 📊 Analysis for {', '.join(states)} — Crop: {crop.title()}")
73
+
74
+ # 🌧️ Rainfall Summary
75
+ if "rainfall_summary" in result:
76
+ st.markdown("#### 🌧️ Average Rainfall (mm):")
77
+ for state, value in result["rainfall_summary"].items():
78
+ st.markdown(f"- **{state.title()}**: `{value}` mm")
79
+
80
+ # 🌾 Production Summary
81
+ if "production_summary" in result:
82
+ st.markdown("#### 🌾 Total Production (tonnes):")
83
+ for state, value in result["production_summary"].items():
84
+ st.markdown(f"- **{state.title()}**: `{int(value)}` tonnes")
85
+
86
+ st.markdown("---")
87
+ st.markdown(f"📚 **Data Source:** {source}")
88
+ st.caption("Developed for Project Samarth — Integrating Agriculture & Climate Data 🌦️🌾")
89
+
90
+ except Exception as e:
91
+ st.error(f"❌ Something went wrong: {e}")
ui/style.css ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /* 🌾 Project Samarth – Smart Minimal UI Styling */
2
+
3
+ body {
4
+ font-family: 'Poppins', 'Segoe UI', sans-serif;
5
+ background-color: #fafafa;
6
+ color: #333;
7
+ }
8
+
9
+ h1, h2, h3 {
10
+ color: #2e7d32;
11
+ text-align: center;
12
+ font-weight: 600;
13
+ }
14
+
15
+ textarea, input {
16
+ border-radius: 8px;
17
+ border: 1px solid #070707;
18
+ padding: 10px;
19
+ width: 100%;
20
+ background-color: #fff; /* changed to white */
21
+ color: #000; /* text color black */
22
+ }
23
+
24
+ button, .stButton>button {
25
+ background-color: #4caf50 !important;
26
+ color: #fff !important;
27
+ border: none !important;
28
+ padding: 8px 16px !important;
29
+ border-radius: 8px !important;
30
+ cursor: pointer !important;
31
+ transition: background-color 0.3s ease;
32
+ }
33
+
34
+ button:hover, .stButton>button:hover {
35
+ background-color: #2e7d32 !important;
36
+ }
37
+
38
+ hr {
39
+ border: none;
40
+ border-top: 1px solid #050505;
41
+ margin: 20px 0;
42
+ }
43
+
44
+ .stTextArea textarea {
45
+ border: 1px solid #ccc !important;
46
+ border-radius: 8px !important;
47
+ background-color: #fff !important;
48
+ color: #000 !important; /* black text color inside text area */
49
+ }
50
+
51
+ footer {
52
+ text-align: center;
53
+ color: #777;
54
+ font-size: 13px;
55
+ margin-top: 20px;
56
+ }
utils/__init__.py ADDED
File without changes
utils/helper.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # utils/helper.py
2
+ # Basic helper functions used across the project
3
+
4
+ import pandas as pd
5
+
6
+ def load_csv(path):
7
+ """Safely loads a CSV file."""
8
+ try:
9
+ df = pd.read_csv(path)
10
+ print(f"✅ Loaded file: {path} ({len(df)} rows)")
11
+ return df
12
+ except FileNotFoundError:
13
+ print(f"⚠️ File not found: {path}")
14
+ return pd.DataFrame()
15
+
16
+ def save_csv(df, path):
17
+ """Saves a DataFrame as CSV."""
18
+ df.to_csv(path, index=False)
19
+ print(f"💾 Data saved to {path}")
utils/visualizer.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ import matplotlib.pyplot as plt
2
+
3
+ def plot_trend(df, x_col, y_col, title):
4
+ plt.figure(figsize=(8,5))
5
+ plt.plot(df[x_col], df[y_col], marker='o')
6
+ plt.title(title)
7
+ plt.xlabel(x_col)
8
+ plt.ylabel(y_col)
9
+ plt.grid(True)
10
+ plt.show()