File size: 2,140 Bytes
95a4bcf
 
 
 
 
 
6ba44f9
595e83a
95a4bcf
 
 
b2fb95a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---

title: EDA Explorer
emoji: πŸ“Š
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.16.0
python_version: "3.10"
app_file: app.py
pinned: false
---

# πŸš€ EDA Explorer – AI-Powered Data Analysis CLI

A lightweight CLI tool that automates exploratory data analysis (EDA) with intelligent insights, feature importance detection, and data quality checks.

Designed to simulate how an **AI Data Analyst** works on real-world datasets used in EDA. 

---

## ⚑ Key Highlights

- πŸ” One-command analysis β†’ `analyze <dataset>`
- 🧠 Auto target detection for ML-based insights
- πŸ“ˆ Feature importance (no manual setup)
- ⚠️ Smart data warnings (missing, ID columns, constants)
- πŸ“Š Correlation & outlier detection
- πŸ“ Auto report generation (.txt)
- ⚑ Efficient handling of large datasets (Parquet + sampling)

---

## 🎬 Demo

πŸ‘‰ Full demo: https://github.com/user-attachments/assets/7dff8329-71e8-4bca-ad01-404e75df8314

https://github.com/user-attachments/assets/7dff8329-71e8-4bca-ad01-404e75df8314

---

## πŸ“Š Example Output

Top Correlations:
- age ↔ income: 0.72
- tenure ↔ balance: 0.65

⚠️ Data Warnings:
- customer_id β†’ likely ID column

- income β†’ 52% missing values



πŸ“ˆ Feature Importance:

- age: 0.41 (strong signal)

- tenure: 0.32 (strong signal)





---



## 🧠 What Makes It Stand Out



- Automatically identifies **useful vs irrelevant features**

- No manual preprocessing required

- Mimics real-world **data analyst reasoning**

- Built using a **modular agent-based system**



---



## ⚑ Performance



- Parquet-based storage for faster I/O

- Sampling strategy for large datasets



---



## πŸ› οΈ System Design



- Command handler

- Dataset registry

- Modular agents (AnalysisAgent, etc.)

- Logger integration



---



## πŸ“¦ Datasets



- Titanic

- Customer Churn

- Credit Card Fraud



---



## πŸ› οΈ Tech Stack



- Python  

- Pandas, NumPy  

- Scikit-learn  

- Parquet  



---



## πŸš€ Future Enhancements



- RAG-based EDA advisor  

- SQL query assistant  

- Model training pipeline