Upload 6 files
Browse files- LICENSE +21 -0
- README.md +130 -12
- app.py +1865 -0
- eda_analysis.py +479 -0
- llm_inference.py +377 -0
- requirements.txt +13 -0
LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
MIT License
|
| 2 |
+
|
| 3 |
+
Copyright (c) 2024 Vasudev Sharma
|
| 4 |
+
|
| 5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 6 |
+
of this software and associated documentation files (the "Software"), to deal
|
| 7 |
+
in the Software without restriction, including without limitation the rights
|
| 8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
| 9 |
+
copies of the Software, and to permit persons to whom the Software is
|
| 10 |
+
furnished to do so, subject to the following conditions:
|
| 11 |
+
|
| 12 |
+
The above copyright notice and this permission notice shall be included in all
|
| 13 |
+
copies or substantial portions of the Software.
|
| 14 |
+
|
| 15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
| 16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
| 17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
| 18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
| 19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
| 20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
| 21 |
+
SOFTWARE.
|
README.md
CHANGED
|
@@ -1,14 +1,132 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
|
|
|
| 1 |
+
# AI-Powered EDA & Feature Engineering Assistant
|
| 2 |
+
|
| 3 |
+

|
| 4 |
+
|
| 5 |
+
An interactive application that uses AI to analyze datasets and provide comprehensive exploratory data analysis (EDA) insights and feature engineering recommendations.
|
| 6 |
+
|
| 7 |
+
## 🌟 Features
|
| 8 |
+
|
| 9 |
+
- **🤖 AI-Powered Analysis**: Receive detailed EDA insights generated by Mistral-7B
|
| 10 |
+
- **📊 Automated Visualizations**: Generate key visualizations with a single click
|
| 11 |
+
- **🔧 Feature Engineering Recommendations**: Get AI suggestions for improving your data
|
| 12 |
+
- **⚠️ Data Quality Assessment**: Identify issues in your dataset and receive fixing advice
|
| 13 |
+
- **💬 Chat Interface**: Ask questions about your dataset and get AI-powered answers
|
| 14 |
+
- **🌙 Dark Mode UI**: Sleek, modern dark interface for comfortable analysis
|
| 15 |
+
|
| 16 |
+
## 📋 Demo
|
| 17 |
+
|
| 18 |
+
Here's a quick look at what you can do:
|
| 19 |
+
|
| 20 |
+
1. Upload a CSV dataset
|
| 21 |
+
2. Get automatic visualizations and statistics
|
| 22 |
+
3. Generate AI-powered insights for:
|
| 23 |
+
- Exploratory Data Analysis
|
| 24 |
+
- Feature Engineering Recommendations
|
| 25 |
+
- Data Quality Assessment
|
| 26 |
+
4. Chat with your data to ask specific questions
|
| 27 |
+
|
| 28 |
+
## 🛠️ Tech Stack
|
| 29 |
+
|
| 30 |
+
- **Frontend**: Streamlit
|
| 31 |
+
- **Data Processing**: Pandas, NumPy, Matplotlib, Seaborn
|
| 32 |
+
- **AI Integration**: LangChain + Groq API
|
| 33 |
+
- **LLM Model**: Llama3-8b-8192
|
| 34 |
+
|
| 35 |
+
## 📦 Installation
|
| 36 |
+
|
| 37 |
+
### Prerequisites
|
| 38 |
+
- Python 3.8+
|
| 39 |
+
- Anaconda or Miniconda (recommended)
|
| 40 |
+
- Groq API key
|
| 41 |
+
|
| 42 |
+
### Setup
|
| 43 |
+
|
| 44 |
+
1. Clone the repository:
|
| 45 |
+
```bash
|
| 46 |
+
git clone https://github.com/vashu2425/AI-Powered-EDA-Feature-Engineering-Assistant.git
|
| 47 |
+
cd AI-Powered-EDA-Feature-Engineering-Assistant
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
2. Create and activate a conda environment:
|
| 51 |
+
```bash
|
| 52 |
+
conda create -n ai_eda_env python=3.10
|
| 53 |
+
conda activate ai_eda_env
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
3. Install the required packages:
|
| 57 |
+
```bash
|
| 58 |
+
pip install -r requirements.txt
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
4. Create a `.env` file with your Groq API key:
|
| 62 |
+
```
|
| 63 |
+
GROQ_API_KEY=your_groq_api_key_here
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
### ⚠️ Compatibility Note
|
| 67 |
+
|
| 68 |
+
This application requires specific versions of NumPy (1.24.3) and pandas (1.5.3) to avoid binary compatibility issues. The requirements.txt file has been updated with these specific versions to ensure a smooth installation experience.
|
| 69 |
+
|
| 70 |
+
### 🔧 Troubleshooting
|
| 71 |
+
|
| 72 |
+
If you encounter the following error:
|
| 73 |
+
```
|
| 74 |
+
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
Try the following solutions:
|
| 78 |
+
|
| 79 |
+
1. Make sure you're using the exact versions specified in requirements.txt:
|
| 80 |
+
```bash
|
| 81 |
+
pip install numpy==1.24.3 pandas==1.5.3
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
2. If you're using Streamlit version older than 1.11.0, you might need to update the code to replace `st.experimental_rerun()` with `st.rerun()`.
|
| 85 |
+
|
| 86 |
+
3. If you're still having issues, try creating a fresh conda environment with Python 3.10:
|
| 87 |
+
```bash
|
| 88 |
+
conda create -n fresh_ai_eda_env python=3.10
|
| 89 |
+
conda activate fresh_ai_eda_env
|
| 90 |
+
pip install -r requirements.txt
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
## 🚀 Usage
|
| 94 |
+
|
| 95 |
+
1. Activate the conda environment:
|
| 96 |
+
```bash
|
| 97 |
+
conda activate ai_eda_env
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
2. Run the application:
|
| 101 |
+
```bash
|
| 102 |
+
streamlit run main.py
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
3. Open your web browser and navigate to `http://localhost:8501`
|
| 106 |
+
|
| 107 |
+
4. Upload a CSV dataset and start exploring!
|
| 108 |
+
|
| 109 |
+
## 📊 Example Analysis
|
| 110 |
+
|
| 111 |
+
Here are some examples of insights you can get:
|
| 112 |
+
|
| 113 |
+
- Comprehensive EDA insights about your dataset variables and distributions
|
| 114 |
+
- Feature engineering ideas specific to your data
|
| 115 |
+
- Data quality improvement recommendations
|
| 116 |
+
- Visualizations including correlation heatmaps, distribution plots, and more
|
| 117 |
+
|
| 118 |
+
## 🤝 Contributing
|
| 119 |
+
|
| 120 |
+
Contributions are welcome! Please feel free to submit a Pull Request.
|
| 121 |
+
|
| 122 |
+
## 📝 License
|
| 123 |
+
|
| 124 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
| 125 |
+
|
| 126 |
+
## 📬 Contact
|
| 127 |
+
|
| 128 |
+
For any questions or feedback, please reach out to the repository owner.
|
| 129 |
+
|
| 130 |
---
|
| 131 |
|
| 132 |
+
### 🌟 Star this repository if you find it useful!
|
app.py
ADDED
|
@@ -0,0 +1,1865 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
AI-Powered EDA & Feature Engineering Assistant
|
| 3 |
+
|
| 4 |
+
This application enables users to upload a CSV dataset, and utilizes LLMs to analyze
|
| 5 |
+
the dataset to provide EDA and feature engineering recommendations.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import streamlit as st
|
| 9 |
+
import pandas as pd
|
| 10 |
+
import os
|
| 11 |
+
import base64
|
| 12 |
+
from io import BytesIO
|
| 13 |
+
from dotenv import load_dotenv
|
| 14 |
+
from typing import Dict, List, Any, Optional
|
| 15 |
+
import time
|
| 16 |
+
import logging
|
| 17 |
+
import plotly.express as px
|
| 18 |
+
import numpy as np
|
| 19 |
+
|
| 20 |
+
# Import local modules
|
| 21 |
+
from eda_analysis import DatasetAnalyzer
|
| 22 |
+
from llm_inference import LLMInference
|
| 23 |
+
|
| 24 |
+
# Configure logging
|
| 25 |
+
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
| 26 |
+
logger = logging.getLogger(__name__)
|
| 27 |
+
|
| 28 |
+
# Load environment variables
|
| 29 |
+
load_dotenv()
|
| 30 |
+
|
| 31 |
+
# Set page configuration - must be the first Streamlit command
|
| 32 |
+
st.set_page_config(
|
| 33 |
+
page_title="AI-Powered EDA & Feature Engineering Assistant",
|
| 34 |
+
page_icon="📊",
|
| 35 |
+
layout="wide",
|
| 36 |
+
initial_sidebar_state="expanded"
|
| 37 |
+
)
|
| 38 |
+
|
| 39 |
+
# Initialize our classes
|
| 40 |
+
@st.cache_resource
|
| 41 |
+
def get_llm_inference():
|
| 42 |
+
try:
|
| 43 |
+
return LLMInference()
|
| 44 |
+
except Exception as e:
|
| 45 |
+
st.error(f"Error initializing LLM inference: {str(e)}")
|
| 46 |
+
return None
|
| 47 |
+
|
| 48 |
+
llm_inference = get_llm_inference()
|
| 49 |
+
|
| 50 |
+
# Session state initialization
|
| 51 |
+
if "dataset_analyzer" not in st.session_state:
|
| 52 |
+
st.session_state.dataset_analyzer = DatasetAnalyzer()
|
| 53 |
+
|
| 54 |
+
if "dataset_loaded" not in st.session_state:
|
| 55 |
+
st.session_state.dataset_loaded = False
|
| 56 |
+
|
| 57 |
+
if "dataset_info" not in st.session_state:
|
| 58 |
+
st.session_state.dataset_info = {}
|
| 59 |
+
|
| 60 |
+
if "visualizations" not in st.session_state:
|
| 61 |
+
st.session_state.visualizations = {}
|
| 62 |
+
|
| 63 |
+
if "eda_insights" not in st.session_state:
|
| 64 |
+
st.session_state.eda_insights = ""
|
| 65 |
+
|
| 66 |
+
if "feature_engineering_recommendations" not in st.session_state:
|
| 67 |
+
st.session_state.feature_engineering_recommendations = ""
|
| 68 |
+
|
| 69 |
+
if "data_quality_insights" not in st.session_state:
|
| 70 |
+
st.session_state.data_quality_insights = ""
|
| 71 |
+
|
| 72 |
+
if "active_tab" not in st.session_state:
|
| 73 |
+
st.session_state.active_tab = "welcome"
|
| 74 |
+
|
| 75 |
+
# Add new functions to support the updated UI
|
| 76 |
+
def initialize_session_state():
|
| 77 |
+
"""Initialize session state variables needed for the application"""
|
| 78 |
+
# Initialize session variables with appropriate defaults
|
| 79 |
+
if "chat_history" not in st.session_state:
|
| 80 |
+
st.session_state.chat_history = []
|
| 81 |
+
|
| 82 |
+
# For dataframe and related variables, ensure proper initialization
|
| 83 |
+
# df should not be in session_state until a proper DataFrame is loaded
|
| 84 |
+
if "descriptive_stats" not in st.session_state:
|
| 85 |
+
st.session_state.descriptive_stats = None
|
| 86 |
+
|
| 87 |
+
if "selected_columns" not in st.session_state:
|
| 88 |
+
st.session_state.selected_columns = []
|
| 89 |
+
|
| 90 |
+
if "filtered_df" not in st.session_state:
|
| 91 |
+
st.session_state.filtered_df = None
|
| 92 |
+
|
| 93 |
+
if "ai_insights" not in st.session_state:
|
| 94 |
+
st.session_state.ai_insights = None
|
| 95 |
+
|
| 96 |
+
if "loading_insights" not in st.session_state:
|
| 97 |
+
st.session_state.loading_insights = False
|
| 98 |
+
|
| 99 |
+
if "selected_tab" not in st.session_state:
|
| 100 |
+
st.session_state.selected_tab = 'tab-overview'
|
| 101 |
+
|
| 102 |
+
if "dataset_name" not in st.session_state:
|
| 103 |
+
st.session_state.dataset_name = ""
|
| 104 |
+
|
| 105 |
+
# Logging initialization
|
| 106 |
+
logger.info("Session state initialized")
|
| 107 |
+
|
| 108 |
+
def apply_custom_css():
|
| 109 |
+
"""Apply additional custom CSS that's not already in the main CSS block"""
|
| 110 |
+
st.markdown("""
|
| 111 |
+
<style>
|
| 112 |
+
/* Base theme variables */
|
| 113 |
+
:root {
|
| 114 |
+
--primary: #4F46E5;
|
| 115 |
+
--secondary: #06B6D4;
|
| 116 |
+
--text-light: #F3F4F6;
|
| 117 |
+
--text-muted: #9CA3AF;
|
| 118 |
+
--bg-card: rgba(31, 41, 55, 0.7);
|
| 119 |
+
--bg-dark: #111827;
|
| 120 |
+
}
|
| 121 |
+
|
| 122 |
+
/* Global styles */
|
| 123 |
+
.stApp {
|
| 124 |
+
background-color: var(--bg-dark);
|
| 125 |
+
color: var(--text-light);
|
| 126 |
+
}
|
| 127 |
+
|
| 128 |
+
/* Improve sidebar styling */
|
| 129 |
+
.sidebar-header {
|
| 130 |
+
background: linear-gradient(90deg, var(--primary), var(--secondary));
|
| 131 |
+
color: white;
|
| 132 |
+
padding: 1rem;
|
| 133 |
+
border-radius: 8px;
|
| 134 |
+
margin-bottom: 1.5rem;
|
| 135 |
+
font-size: 1.2rem;
|
| 136 |
+
font-weight: 600;
|
| 137 |
+
text-align: center;
|
| 138 |
+
}
|
| 139 |
+
|
| 140 |
+
.sidebar-section {
|
| 141 |
+
background: rgba(31, 41, 55, 0.4);
|
| 142 |
+
border-radius: 8px;
|
| 143 |
+
padding: 1rem;
|
| 144 |
+
margin-bottom: 1.5rem;
|
| 145 |
+
border: 1px solid rgba(99, 102, 241, 0.1);
|
| 146 |
+
}
|
| 147 |
+
|
| 148 |
+
.sidebar-footer {
|
| 149 |
+
text-align: center;
|
| 150 |
+
padding: 1rem;
|
| 151 |
+
font-size: 0.8rem;
|
| 152 |
+
color: var(--text-muted);
|
| 153 |
+
margin-top: 3rem;
|
| 154 |
+
}
|
| 155 |
+
|
| 156 |
+
/* Feature Engineering Cards */
|
| 157 |
+
.fe-cards-container {
|
| 158 |
+
display: grid;
|
| 159 |
+
grid-template-columns: repeat(2, 1fr);
|
| 160 |
+
gap: 0.8rem;
|
| 161 |
+
margin-top: 1rem;
|
| 162 |
+
}
|
| 163 |
+
|
| 164 |
+
.fe-card {
|
| 165 |
+
background: rgba(31, 41, 55, 0.6);
|
| 166 |
+
border-radius: 8px;
|
| 167 |
+
padding: 0.8rem;
|
| 168 |
+
text-align: center;
|
| 169 |
+
cursor: pointer;
|
| 170 |
+
transition: all 0.2s ease;
|
| 171 |
+
border: 1px solid rgba(99, 102, 241, 0.1);
|
| 172 |
+
position: relative;
|
| 173 |
+
overflow: hidden;
|
| 174 |
+
}
|
| 175 |
+
|
| 176 |
+
.fe-card::before {
|
| 177 |
+
content: '';
|
| 178 |
+
position: absolute;
|
| 179 |
+
top: 0;
|
| 180 |
+
left: 0;
|
| 181 |
+
right: 0;
|
| 182 |
+
bottom: 0;
|
| 183 |
+
background: linear-gradient(135deg, var(--primary), var(--secondary));
|
| 184 |
+
opacity: 0;
|
| 185 |
+
transition: opacity 0.3s ease;
|
| 186 |
+
z-index: 0;
|
| 187 |
+
}
|
| 188 |
+
|
| 189 |
+
.fe-card:hover::before {
|
| 190 |
+
opacity: 0.1;
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
+
.fe-card:hover {
|
| 194 |
+
transform: translateY(-2px);
|
| 195 |
+
box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
|
| 196 |
+
border-color: rgba(99, 102, 241, 0.3);
|
| 197 |
+
}
|
| 198 |
+
|
| 199 |
+
.fe-card-active {
|
| 200 |
+
border-color: var(--primary);
|
| 201 |
+
background: rgba(79, 70, 229, 0.1);
|
| 202 |
+
}
|
| 203 |
+
|
| 204 |
+
.fe-card-icon {
|
| 205 |
+
font-size: 1.8rem;
|
| 206 |
+
margin-bottom: 0.3rem;
|
| 207 |
+
position: relative;
|
| 208 |
+
z-index: 1;
|
| 209 |
+
}
|
| 210 |
+
|
| 211 |
+
.fe-card-title {
|
| 212 |
+
font-size: 0.85rem;
|
| 213 |
+
font-weight: 600;
|
| 214 |
+
color: var(--text-light);
|
| 215 |
+
position: relative;
|
| 216 |
+
z-index: 1;
|
| 217 |
+
}
|
| 218 |
+
|
| 219 |
+
/* Tab content styling */
|
| 220 |
+
.tab-title {
|
| 221 |
+
font-size: 1.8rem;
|
| 222 |
+
margin-bottom: 1.5rem;
|
| 223 |
+
position: relative;
|
| 224 |
+
display: inline-block;
|
| 225 |
+
color: var(--text-light);
|
| 226 |
+
}
|
| 227 |
+
|
| 228 |
+
.tab-title:after {
|
| 229 |
+
content: '';
|
| 230 |
+
position: absolute;
|
| 231 |
+
bottom: -10px;
|
| 232 |
+
left: 0;
|
| 233 |
+
width: 100%;
|
| 234 |
+
height: 3px;
|
| 235 |
+
background: linear-gradient(90deg, var(--primary) 0%, var(--secondary) 100%);
|
| 236 |
+
border-radius: 3px;
|
| 237 |
+
}
|
| 238 |
+
|
| 239 |
+
/* Navigation Tabs */
|
| 240 |
+
.custom-tabs {
|
| 241 |
+
display: flex;
|
| 242 |
+
background: rgba(31, 41, 55, 0.6);
|
| 243 |
+
border-radius: 12px;
|
| 244 |
+
padding: 0.5rem;
|
| 245 |
+
margin-bottom: 2rem;
|
| 246 |
+
justify-content: space-between;
|
| 247 |
+
overflow: hidden;
|
| 248 |
+
border: 1px solid rgba(99, 102, 241, 0.1);
|
| 249 |
+
}
|
| 250 |
+
|
| 251 |
+
.tab-item {
|
| 252 |
+
flex: 1;
|
| 253 |
+
text-align: center;
|
| 254 |
+
padding: 0.8rem 0.5rem;
|
| 255 |
+
border-radius: 8px;
|
| 256 |
+
cursor: pointer;
|
| 257 |
+
transition: all 0.3s ease;
|
| 258 |
+
position: relative;
|
| 259 |
+
z-index: 1;
|
| 260 |
+
margin: 0 0.2rem;
|
| 261 |
+
}
|
| 262 |
+
|
| 263 |
+
.tab-item.active {
|
| 264 |
+
background: rgba(79, 70, 229, 0.1);
|
| 265 |
+
}
|
| 266 |
+
|
| 267 |
+
.tab-item.active::before {
|
| 268 |
+
content: '';
|
| 269 |
+
position: absolute;
|
| 270 |
+
bottom: 0;
|
| 271 |
+
left: 10%;
|
| 272 |
+
right: 10%;
|
| 273 |
+
height: 3px;
|
| 274 |
+
background: linear-gradient(90deg, var(--primary), var(--secondary));
|
| 275 |
+
border-radius: 3px;
|
| 276 |
+
}
|
| 277 |
+
|
| 278 |
+
.tab-item:hover {
|
| 279 |
+
background: rgba(79, 70, 229, 0.05);
|
| 280 |
+
}
|
| 281 |
+
|
| 282 |
+
.tab-icon {
|
| 283 |
+
font-size: 1.5rem;
|
| 284 |
+
margin-bottom: 0.3rem;
|
| 285 |
+
}
|
| 286 |
+
|
| 287 |
+
.tab-label {
|
| 288 |
+
font-size: 0.85rem;
|
| 289 |
+
font-weight: 500;
|
| 290 |
+
color: var(--text-light);
|
| 291 |
+
}
|
| 292 |
+
|
| 293 |
+
.tab-content-spacer {
|
| 294 |
+
height: 1rem;
|
| 295 |
+
}
|
| 296 |
+
|
| 297 |
+
/* Card styling */
|
| 298 |
+
.stats-card, .info-card, .chart-card {
|
| 299 |
+
background: rgba(31, 41, 55, 0.3);
|
| 300 |
+
border-radius: 10px;
|
| 301 |
+
padding: 1.2rem;
|
| 302 |
+
margin-bottom: 1.5rem;
|
| 303 |
+
border: 1px solid rgba(99, 102, 241, 0.1);
|
| 304 |
+
transition: all 0.3s ease;
|
| 305 |
+
}
|
| 306 |
+
|
| 307 |
+
.stats-card:hover, .info-card:hover, .chart-card:hover {
|
| 308 |
+
transform: translateY(-5px);
|
| 309 |
+
box-shadow: 0 8px 15px rgba(0, 0, 0, 0.2);
|
| 310 |
+
border-color: rgba(99, 102, 241, 0.3);
|
| 311 |
+
}
|
| 312 |
+
|
| 313 |
+
/* Dataset stats styling */
|
| 314 |
+
.dataset-stats {
|
| 315 |
+
display: flex;
|
| 316 |
+
flex-wrap: wrap;
|
| 317 |
+
gap: 0.8rem;
|
| 318 |
+
justify-content: center;
|
| 319 |
+
}
|
| 320 |
+
|
| 321 |
+
.stat-item {
|
| 322 |
+
text-align: center;
|
| 323 |
+
padding: 0.8rem;
|
| 324 |
+
background: rgba(31, 41, 55, 0.6);
|
| 325 |
+
border-radius: 8px;
|
| 326 |
+
min-width: 80px;
|
| 327 |
+
border: 1px solid rgba(99, 102, 241, 0.2);
|
| 328 |
+
}
|
| 329 |
+
|
| 330 |
+
.stat-value {
|
| 331 |
+
font-size: 1.5rem;
|
| 332 |
+
font-weight: 700;
|
| 333 |
+
color: var(--primary);
|
| 334 |
+
}
|
| 335 |
+
|
| 336 |
+
.stat-label {
|
| 337 |
+
font-size: 0.8rem;
|
| 338 |
+
color: var(--text-muted);
|
| 339 |
+
margin-top: 0.3rem;
|
| 340 |
+
}
|
| 341 |
+
|
| 342 |
+
/* Chart styling */
|
| 343 |
+
.chart-container {
|
| 344 |
+
margin-top: 1.5rem;
|
| 345 |
+
}
|
| 346 |
+
|
| 347 |
+
.chart-card h3 {
|
| 348 |
+
font-size: 1.2rem;
|
| 349 |
+
margin-bottom: 1rem;
|
| 350 |
+
color: var(--text-light);
|
| 351 |
+
}
|
| 352 |
+
|
| 353 |
+
.stat-summary {
|
| 354 |
+
display: grid;
|
| 355 |
+
grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
|
| 356 |
+
gap: 0.5rem;
|
| 357 |
+
margin-top: 1rem;
|
| 358 |
+
}
|
| 359 |
+
|
| 360 |
+
.stat-pair {
|
| 361 |
+
display: flex;
|
| 362 |
+
justify-content: space-between;
|
| 363 |
+
padding: 0.3rem 0.5rem;
|
| 364 |
+
background: rgba(31, 41, 55, 0.4);
|
| 365 |
+
border-radius: 4px;
|
| 366 |
+
font-size: 0.9rem;
|
| 367 |
+
}
|
| 368 |
+
|
| 369 |
+
.stat-pair span {
|
| 370 |
+
color: var(--text-muted);
|
| 371 |
+
}
|
| 372 |
+
|
| 373 |
+
.stat-pair strong {
|
| 374 |
+
color: var(--text-light);
|
| 375 |
+
}
|
| 376 |
+
|
| 377 |
+
/* Filter container */
|
| 378 |
+
.filter-container {
|
| 379 |
+
background: rgba(31, 41, 55, 0.3);
|
| 380 |
+
border-radius: 10px;
|
| 381 |
+
padding: 1.2rem;
|
| 382 |
+
margin-bottom: 1.5rem;
|
| 383 |
+
border: 1px solid rgba(99, 102, 241, 0.1);
|
| 384 |
+
}
|
| 385 |
+
|
| 386 |
+
/* AI Insights styling */
|
| 387 |
+
.insights-container {
|
| 388 |
+
margin-top: 1rem;
|
| 389 |
+
}
|
| 390 |
+
|
| 391 |
+
.insights-category {
|
| 392 |
+
margin-top: 0.5rem;
|
| 393 |
+
}
|
| 394 |
+
|
| 395 |
+
.insight-card {
|
| 396 |
+
background: rgba(31, 41, 55, 0.3);
|
| 397 |
+
border-radius: 10px;
|
| 398 |
+
padding: 1.2rem;
|
| 399 |
+
margin-bottom: 1rem;
|
| 400 |
+
border: 1px solid rgba(99, 102, 241, 0.1);
|
| 401 |
+
display: flex;
|
| 402 |
+
align-items: flex-start;
|
| 403 |
+
}
|
| 404 |
+
|
| 405 |
+
.insight-content {
|
| 406 |
+
display: flex;
|
| 407 |
+
align-items: flex-start;
|
| 408 |
+
gap: 1rem;
|
| 409 |
+
}
|
| 410 |
+
|
| 411 |
+
.insight-icon {
|
| 412 |
+
font-size: 1.5rem;
|
| 413 |
+
margin-top: 0.1rem;
|
| 414 |
+
}
|
| 415 |
+
|
| 416 |
+
.insight-text {
|
| 417 |
+
flex: 1;
|
| 418 |
+
line-height: 1.5;
|
| 419 |
+
}
|
| 420 |
+
|
| 421 |
+
.generate-insights-container {
|
| 422 |
+
display: flex;
|
| 423 |
+
justify-content: center;
|
| 424 |
+
align-items: center;
|
| 425 |
+
margin: 3rem 0;
|
| 426 |
+
}
|
| 427 |
+
|
| 428 |
+
.placeholder-card {
|
| 429 |
+
background: rgba(31, 41, 55, 0.3);
|
| 430 |
+
border-radius: 15px;
|
| 431 |
+
padding: 2rem;
|
| 432 |
+
text-align: center;
|
| 433 |
+
border: 1px solid rgba(99, 102, 241, 0.1);
|
| 434 |
+
max-width: 500px;
|
| 435 |
+
margin: 0 auto;
|
| 436 |
+
}
|
| 437 |
+
|
| 438 |
+
.placeholder-icon {
|
| 439 |
+
font-size: 3rem;
|
| 440 |
+
margin-bottom: 1rem;
|
| 441 |
+
animation: float 3s ease-in-out infinite;
|
| 442 |
+
}
|
| 443 |
+
|
| 444 |
+
.placeholder-text {
|
| 445 |
+
color: var(--text-muted);
|
| 446 |
+
line-height: 1.6;
|
| 447 |
+
margin-bottom: 1.5rem;
|
| 448 |
+
}
|
| 449 |
+
|
| 450 |
+
.loading-container {
|
| 451 |
+
display: flex;
|
| 452 |
+
justify-content: center;
|
| 453 |
+
margin: 2rem 0;
|
| 454 |
+
}
|
| 455 |
+
|
| 456 |
+
.loading-pulse {
|
| 457 |
+
width: 80px;
|
| 458 |
+
height: 80px;
|
| 459 |
+
border-radius: 50%;
|
| 460 |
+
background: linear-gradient(to right, var(--primary), var(--secondary));
|
| 461 |
+
animation: pulse-animation 1.5s ease infinite;
|
| 462 |
+
}
|
| 463 |
+
|
| 464 |
+
@keyframes pulse-animation {
|
| 465 |
+
0% {
|
| 466 |
+
transform: scale(0.6);
|
| 467 |
+
opacity: 0.5;
|
| 468 |
+
}
|
| 469 |
+
50% {
|
| 470 |
+
transform: scale(1);
|
| 471 |
+
opacity: 1;
|
| 472 |
+
}
|
| 473 |
+
100% {
|
| 474 |
+
transform: scale(0.6);
|
| 475 |
+
opacity: 0.5;
|
| 476 |
+
}
|
| 477 |
+
}
|
| 478 |
+
|
| 479 |
+
@keyframes float {
|
| 480 |
+
0% { transform: translateY(0px); }
|
| 481 |
+
50% { transform: translateY(-10px); }
|
| 482 |
+
100% { transform: translateY(0px); }
|
| 483 |
+
}
|
| 484 |
+
|
| 485 |
+
/* Button styling */
|
| 486 |
+
button[kind="primary"] {
|
| 487 |
+
background: linear-gradient(90deg, var(--primary), var(--secondary)) !important;
|
| 488 |
+
color: white !important;
|
| 489 |
+
border: none !important;
|
| 490 |
+
border-radius: 8px !important;
|
| 491 |
+
padding: 0.6rem 1.2rem !important;
|
| 492 |
+
font-weight: 600 !important;
|
| 493 |
+
transition: all 0.3s ease !important;
|
| 494 |
+
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1) !important;
|
| 495 |
+
}
|
| 496 |
+
|
| 497 |
+
button[kind="primary"]:hover {
|
| 498 |
+
transform: translateY(-2px) !important;
|
| 499 |
+
box-shadow: 0 6px 10px rgba(0, 0, 0, 0.15) !important;
|
| 500 |
+
}
|
| 501 |
+
|
| 502 |
+
button[kind="secondary"] {
|
| 503 |
+
background: rgba(79, 70, 229, 0.1) !important;
|
| 504 |
+
color: var(--text-light) !important;
|
| 505 |
+
border: 1px solid rgba(79, 70, 229, 0.3) !important;
|
| 506 |
+
border-radius: 8px !important;
|
| 507 |
+
padding: 0.6rem 1.2rem !important;
|
| 508 |
+
font-weight: 600 !important;
|
| 509 |
+
transition: all 0.3s ease !important;
|
| 510 |
+
}
|
| 511 |
+
|
| 512 |
+
button[kind="secondary"]:hover {
|
| 513 |
+
background: rgba(79, 70, 229, 0.2) !important;
|
| 514 |
+
transform: translateY(-2px) !important;
|
| 515 |
+
}
|
| 516 |
+
|
| 517 |
+
/* Override Streamlit default button styles */
|
| 518 |
+
.stButton>button {
|
| 519 |
+
background: linear-gradient(90deg, var(--primary), var(--secondary)) !important;
|
| 520 |
+
color: white !important;
|
| 521 |
+
border: none !important;
|
| 522 |
+
border-radius: 8px !important;
|
| 523 |
+
padding: 0.6rem 1.2rem !important;
|
| 524 |
+
font-weight: 600 !important;
|
| 525 |
+
transition: all 0.3s ease !important;
|
| 526 |
+
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1) !important;
|
| 527 |
+
width: 100%;
|
| 528 |
+
}
|
| 529 |
+
|
| 530 |
+
.stButton>button:hover {
|
| 531 |
+
transform: translateY(-2px) !important;
|
| 532 |
+
box-shadow: 0 6px 10px rgba(0, 0, 0, 0.15) !important;
|
| 533 |
+
}
|
| 534 |
+
|
| 535 |
+
/* Chat interface styling */
|
| 536 |
+
.chat-interface-container {
|
| 537 |
+
padding: 1rem 0;
|
| 538 |
+
margin-bottom: 100px;
|
| 539 |
+
position: relative;
|
| 540 |
+
}
|
| 541 |
+
|
| 542 |
+
.chat-messages {
|
| 543 |
+
display: flex;
|
| 544 |
+
flex-direction: column;
|
| 545 |
+
gap: 15px;
|
| 546 |
+
margin-bottom: 20px;
|
| 547 |
+
}
|
| 548 |
+
|
| 549 |
+
.chat-message-user, .chat-message-ai {
|
| 550 |
+
padding: 12px 16px;
|
| 551 |
+
border-radius: 12px;
|
| 552 |
+
max-width: 80%;
|
| 553 |
+
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1);
|
| 554 |
+
}
|
| 555 |
+
|
| 556 |
+
.chat-message-user {
|
| 557 |
+
align-self: flex-end;
|
| 558 |
+
background: linear-gradient(135deg, var(--primary) 0%, var(--secondary) 100%);
|
| 559 |
+
color: white;
|
| 560 |
+
border-bottom-right-radius: 0;
|
| 561 |
+
margin-left: auto;
|
| 562 |
+
}
|
| 563 |
+
|
| 564 |
+
.chat-message-ai {
|
| 565 |
+
align-self: flex-start;
|
| 566 |
+
background: var(--bg-card);
|
| 567 |
+
color: var(--text-light);
|
| 568 |
+
border-bottom-left-radius: 0;
|
| 569 |
+
margin-right: auto;
|
| 570 |
+
}
|
| 571 |
+
|
| 572 |
+
.chat-input-container {
|
| 573 |
+
display: flex;
|
| 574 |
+
align-items: center;
|
| 575 |
+
gap: 10px;
|
| 576 |
+
margin-top: 1.5rem;
|
| 577 |
+
}
|
| 578 |
+
|
| 579 |
+
.chat-suggestions {
|
| 580 |
+
display: flex;
|
| 581 |
+
flex-wrap: wrap;
|
| 582 |
+
gap: 10px;
|
| 583 |
+
margin: 1.5rem 0;
|
| 584 |
+
}
|
| 585 |
+
|
| 586 |
+
.chat-suggestion {
|
| 587 |
+
background: rgba(99, 102, 241, 0.1);
|
| 588 |
+
border: 1px solid rgba(99, 102, 241, 0.3);
|
| 589 |
+
border-radius: 30px;
|
| 590 |
+
padding: 8px 15px;
|
| 591 |
+
font-size: 0.9rem;
|
| 592 |
+
color: var(--text-light);
|
| 593 |
+
cursor: pointer;
|
| 594 |
+
transition: all 0.3s ease;
|
| 595 |
+
display: inline-block;
|
| 596 |
+
margin-bottom: 8px;
|
| 597 |
+
}
|
| 598 |
+
|
| 599 |
+
.chat-suggestion:hover {
|
| 600 |
+
background: rgba(99, 102, 241, 0.2);
|
| 601 |
+
transform: translateY(-2px);
|
| 602 |
+
}
|
| 603 |
+
|
| 604 |
+
/* Expander styling */
|
| 605 |
+
.st-expander {
|
| 606 |
+
background: rgba(31, 41, 55, 0.2) !important;
|
| 607 |
+
border-radius: 8px !important;
|
| 608 |
+
margin-bottom: 1rem !important;
|
| 609 |
+
border: 1px solid rgba(99, 102, 241, 0.1) !important;
|
| 610 |
+
}
|
| 611 |
+
|
| 612 |
+
/* Streamlit widget styling */
|
| 613 |
+
div[data-testid="stForm"] {
|
| 614 |
+
background: rgba(31, 41, 55, 0.2) !important;
|
| 615 |
+
border-radius: 10px !important;
|
| 616 |
+
padding: 1rem !important;
|
| 617 |
+
border: 1px solid rgba(99, 102, 241, 0.1) !important;
|
| 618 |
+
}
|
| 619 |
+
|
| 620 |
+
.stSelectbox>div>div {
|
| 621 |
+
background: rgba(31, 41, 55, 0.4) !important;
|
| 622 |
+
border: 1px solid rgba(99, 102, 241, 0.2) !important;
|
| 623 |
+
border-radius: 8px !important;
|
| 624 |
+
}
|
| 625 |
+
|
| 626 |
+
.stTextInput>div>div>input {
|
| 627 |
+
background: rgba(31, 41, 55, 0.4) !important;
|
| 628 |
+
border: 1px solid rgba(99, 102, 241, 0.2) !important;
|
| 629 |
+
border-radius: 8px !important;
|
| 630 |
+
color: var(--text-light) !important;
|
| 631 |
+
padding: 1rem !important;
|
| 632 |
+
}
|
| 633 |
+
|
| 634 |
+
/* Streamlit multiselect dropdown styling */
|
| 635 |
+
div[data-baseweb="popover"] {
|
| 636 |
+
background: var(--bg-dark) !important;
|
| 637 |
+
border: 1px solid rgba(99, 102, 241, 0.2) !important;
|
| 638 |
+
border-radius: 8px !important;
|
| 639 |
+
}
|
| 640 |
+
|
| 641 |
+
div[data-baseweb="menu"] {
|
| 642 |
+
background: var(--bg-dark) !important;
|
| 643 |
+
}
|
| 644 |
+
|
| 645 |
+
div[role="listbox"] {
|
| 646 |
+
background: var(--bg-dark) !important;
|
| 647 |
+
}
|
| 648 |
+
|
| 649 |
+
/* Fix for the upload button */
|
| 650 |
+
.stFileUploader > div {
|
| 651 |
+
display: flex;
|
| 652 |
+
flex-direction: column;
|
| 653 |
+
align-items: center;
|
| 654 |
+
}
|
| 655 |
+
|
| 656 |
+
.stFileUploader > div > button {
|
| 657 |
+
background: linear-gradient(90deg, var(--primary), var(--secondary)) !important;
|
| 658 |
+
color: white !important;
|
| 659 |
+
border: none !important;
|
| 660 |
+
width: 100%;
|
| 661 |
+
margin-top: 1rem;
|
| 662 |
+
}
|
| 663 |
+
|
| 664 |
+
/* Fix for tab content spacing */
|
| 665 |
+
.tab-content {
|
| 666 |
+
margin-top: 2rem;
|
| 667 |
+
padding: 1rem;
|
| 668 |
+
background: rgba(31, 41, 55, 0.2);
|
| 669 |
+
border-radius: 10px;
|
| 670 |
+
border: 1px solid rgba(99, 102, 241, 0.1);
|
| 671 |
+
}
|
| 672 |
+
</style>
|
| 673 |
+
""", unsafe_allow_html=True)
|
| 674 |
+
|
| 675 |
+
def generate_ai_insights():
|
| 676 |
+
"""Generate AI-powered insights about the dataset"""
|
| 677 |
+
# Make sure we have a dataframe to analyze
|
| 678 |
+
if 'df' not in st.session_state:
|
| 679 |
+
logger.warning("Cannot generate AI insights: No dataframe in session state")
|
| 680 |
+
return {}
|
| 681 |
+
|
| 682 |
+
df = st.session_state.df
|
| 683 |
+
insights = {}
|
| 684 |
+
|
| 685 |
+
# Try to use the LLM for insights generation first
|
| 686 |
+
try:
|
| 687 |
+
if llm_inference is not None:
|
| 688 |
+
# Create dataset_info dictionary for LLM
|
| 689 |
+
num_rows, num_cols = df.shape
|
| 690 |
+
num_numerical = len(df.select_dtypes(include=['number']).columns)
|
| 691 |
+
num_categorical = len(df.select_dtypes(include=['object', 'category']).columns)
|
| 692 |
+
num_missing = df.isnull().sum().sum()
|
| 693 |
+
|
| 694 |
+
# Format missing values for better readability
|
| 695 |
+
missing_cols = df.isnull().sum()[df.isnull().sum() > 0]
|
| 696 |
+
missing_values = {}
|
| 697 |
+
for col in missing_cols.index:
|
| 698 |
+
count = missing_cols[col]
|
| 699 |
+
percent = round(count / len(df) * 100, 2)
|
| 700 |
+
missing_values[col] = (count, percent)
|
| 701 |
+
|
| 702 |
+
# Get numerical columns and their correlations if applicable
|
| 703 |
+
num_cols = df.select_dtypes(include=['number']).columns
|
| 704 |
+
correlations = "No numerical columns to calculate correlations."
|
| 705 |
+
if len(num_cols) > 1:
|
| 706 |
+
# Calculate correlations
|
| 707 |
+
corr_matrix = df[num_cols].corr()
|
| 708 |
+
# Get top correlations (absolute values)
|
| 709 |
+
corr_pairs = []
|
| 710 |
+
for i in range(len(num_cols)):
|
| 711 |
+
for j in range(i):
|
| 712 |
+
val = corr_matrix.iloc[i, j]
|
| 713 |
+
if abs(val) > 0.5: # Only show strong correlations
|
| 714 |
+
corr_pairs.append((num_cols[i], num_cols[j], val))
|
| 715 |
+
|
| 716 |
+
# Sort by absolute correlation and format
|
| 717 |
+
if corr_pairs:
|
| 718 |
+
corr_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
|
| 719 |
+
formatted_corrs = []
|
| 720 |
+
for col1, col2, val in corr_pairs[:5]: # Top 5
|
| 721 |
+
formatted_corrs.append(f"{col1} and {col2}: {val:.3f}")
|
| 722 |
+
correlations = "\n".join(formatted_corrs)
|
| 723 |
+
|
| 724 |
+
dataset_info = {
|
| 725 |
+
"shape": f"{num_rows} rows, {num_cols} columns",
|
| 726 |
+
"columns": df.columns.tolist(),
|
| 727 |
+
"dtypes": {col: str(dtype) for col, dtype in df.dtypes.items()},
|
| 728 |
+
"missing_values": missing_values,
|
| 729 |
+
"basic_stats": df.describe().to_string(),
|
| 730 |
+
"correlations": correlations,
|
| 731 |
+
"sample_data": df.head(5).to_string()
|
| 732 |
+
}
|
| 733 |
+
|
| 734 |
+
# Generate EDA insights with better error handling
|
| 735 |
+
logger.info("Requesting EDA insights from LLM")
|
| 736 |
+
try:
|
| 737 |
+
eda_insights = llm_inference.generate_eda_insights(dataset_info)
|
| 738 |
+
|
| 739 |
+
if eda_insights and isinstance(eda_insights, str) and len(eda_insights) > 50:
|
| 740 |
+
# Clean and format the response
|
| 741 |
+
eda_insights = eda_insights.strip()
|
| 742 |
+
insights["EDA Insights"] = [eda_insights]
|
| 743 |
+
logger.info("Successfully generated EDA insights")
|
| 744 |
+
else:
|
| 745 |
+
logger.warning(f"EDA insights response was invalid: {type(eda_insights)}, length: {len(eda_insights) if isinstance(eda_insights, str) else 'N/A'}")
|
| 746 |
+
except Exception as e:
|
| 747 |
+
logger.error(f"Error generating EDA insights: {str(e)}")
|
| 748 |
+
|
| 749 |
+
# Generate feature engineering recommendations
|
| 750 |
+
if "EDA Insights" in insights: # Only proceed if EDA worked
|
| 751 |
+
logger.info("Requesting feature engineering recommendations from LLM")
|
| 752 |
+
try:
|
| 753 |
+
fe_insights = llm_inference.generate_feature_engineering_recommendations(dataset_info)
|
| 754 |
+
|
| 755 |
+
if fe_insights and isinstance(fe_insights, str) and len(fe_insights) > 50:
|
| 756 |
+
fe_insights = fe_insights.strip()
|
| 757 |
+
insights["Feature Engineering Recommendations"] = [fe_insights]
|
| 758 |
+
logger.info("Successfully generated feature engineering recommendations")
|
| 759 |
+
else:
|
| 760 |
+
logger.warning(f"Feature engineering response was invalid: {type(fe_insights)}, length: {len(fe_insights) if isinstance(fe_insights, str) else 'N/A'}")
|
| 761 |
+
except Exception as e:
|
| 762 |
+
logger.error(f"Error generating feature engineering recommendations: {str(e)}")
|
| 763 |
+
|
| 764 |
+
# Generate data quality insights
|
| 765 |
+
logger.info("Requesting data quality insights from LLM")
|
| 766 |
+
try:
|
| 767 |
+
dq_insights = llm_inference.generate_data_quality_insights(dataset_info)
|
| 768 |
+
|
| 769 |
+
if dq_insights and isinstance(dq_insights, str) and len(dq_insights) > 50:
|
| 770 |
+
dq_insights = dq_insights.strip()
|
| 771 |
+
insights["Data Quality Insights"] = [dq_insights]
|
| 772 |
+
logger.info("Successfully generated data quality insights")
|
| 773 |
+
else:
|
| 774 |
+
logger.warning(f"Data quality response was invalid: {type(dq_insights)}, length: {len(dq_insights) if isinstance(dq_insights, str) else 'N/A'}")
|
| 775 |
+
except Exception as e:
|
| 776 |
+
logger.error(f"Error generating data quality insights: {str(e)}")
|
| 777 |
+
|
| 778 |
+
# If we have at least one type of insights, consider it a success
|
| 779 |
+
if insights:
|
| 780 |
+
# Mark that the insights are loaded
|
| 781 |
+
st.session_state['loading_insights'] = False
|
| 782 |
+
logger.info("Successfully generated AI insights using LLM")
|
| 783 |
+
return insights
|
| 784 |
+
|
| 785 |
+
logger.warning("All LLM generated insights failed or were too short. Falling back to template insights.")
|
| 786 |
+
else:
|
| 787 |
+
logger.warning("LLM inference is not available. Falling back to template insights.")
|
| 788 |
+
except Exception as e:
|
| 789 |
+
logger.error(f"Error in generate_ai_insights(): {str(e)}. Falling back to template insights.")
|
| 790 |
+
|
| 791 |
+
# If LLM fails or is not available, generate template-based insights
|
| 792 |
+
logger.info("Falling back to template-based insights generation")
|
| 793 |
+
|
| 794 |
+
# Add missing values insights
|
| 795 |
+
missing_data = df.isnull().sum()
|
| 796 |
+
missing_percent = (missing_data / len(df)) * 100
|
| 797 |
+
missing_cols = missing_data[missing_data > 0]
|
| 798 |
+
|
| 799 |
+
missing_insights = []
|
| 800 |
+
if len(missing_cols) > 0:
|
| 801 |
+
missing_insights.append(f"Found {len(missing_cols)} columns with missing values.")
|
| 802 |
+
for col in missing_cols.index[:3]: # Show details for top 3
|
| 803 |
+
missing_insights.append(f"Column '{col}' has {missing_data[col]} missing values ({missing_percent[col]:.2f}%).")
|
| 804 |
+
|
| 805 |
+
if len(missing_cols) > 3:
|
| 806 |
+
missing_insights.append(f"And {len(missing_cols) - 3} more columns have missing values.")
|
| 807 |
+
|
| 808 |
+
# Add recommendation
|
| 809 |
+
if any(missing_percent > 50):
|
| 810 |
+
high_missing = missing_percent[missing_percent > 50].index.tolist()
|
| 811 |
+
missing_insights.append(f"Consider dropping columns with >50% missing values: {', '.join(high_missing[:3])}.")
|
| 812 |
+
else:
|
| 813 |
+
missing_insights.append("Consider using imputation techniques for columns with missing values.")
|
| 814 |
+
else:
|
| 815 |
+
missing_insights.append("No missing values found in the dataset. Great job!")
|
| 816 |
+
|
| 817 |
+
insights["Missing Values Analysis"] = missing_insights
|
| 818 |
+
|
| 819 |
+
# Add distribution insights
|
| 820 |
+
num_cols = df.select_dtypes(include=['number']).columns
|
| 821 |
+
dist_insights = []
|
| 822 |
+
|
| 823 |
+
if len(num_cols) > 0:
|
| 824 |
+
for col in num_cols[:3]: # Analyze top 3 numeric columns
|
| 825 |
+
# Check for skewness
|
| 826 |
+
skew = df[col].skew()
|
| 827 |
+
if abs(skew) > 1:
|
| 828 |
+
direction = "right" if skew > 0 else "left"
|
| 829 |
+
dist_insights.append(f"Column '{col}' is {direction}-skewed (skewness: {skew:.2f}). Consider log transformation.")
|
| 830 |
+
|
| 831 |
+
# Check for outliers using IQR
|
| 832 |
+
Q1 = df[col].quantile(0.25)
|
| 833 |
+
Q3 = df[col].quantile(0.75)
|
| 834 |
+
IQR = Q3 - Q1
|
| 835 |
+
outliers = df[(df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))][col].count()
|
| 836 |
+
|
| 837 |
+
if outliers > 0:
|
| 838 |
+
pct = (outliers / len(df)) * 100
|
| 839 |
+
dist_insights.append(f"Column '{col}' has {outliers} outliers ({pct:.2f}%). Consider outlier treatment.")
|
| 840 |
+
|
| 841 |
+
if len(num_cols) > 3:
|
| 842 |
+
dist_insights.append(f"Additional {len(num_cols) - 3} numerical columns not analyzed here.")
|
| 843 |
+
else:
|
| 844 |
+
dist_insights.append("No numerical columns found for distribution analysis.")
|
| 845 |
+
|
| 846 |
+
insights["Distribution Insights"] = dist_insights
|
| 847 |
+
|
| 848 |
+
# Add correlation insights
|
| 849 |
+
corr_insights = []
|
| 850 |
+
if len(num_cols) > 1:
|
| 851 |
+
# Calculate correlation
|
| 852 |
+
corr_matrix = df[num_cols].corr()
|
| 853 |
+
high_corr = []
|
| 854 |
+
|
| 855 |
+
# Find high correlations
|
| 856 |
+
for i in range(len(corr_matrix.columns)):
|
| 857 |
+
for j in range(i):
|
| 858 |
+
if abs(corr_matrix.iloc[i, j]) > 0.7:
|
| 859 |
+
high_corr.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))
|
| 860 |
+
|
| 861 |
+
if high_corr:
|
| 862 |
+
corr_insights.append(f"Found {len(high_corr)} pairs of highly correlated features.")
|
| 863 |
+
for col1, col2, corr_val in high_corr[:3]: # Show top 3
|
| 864 |
+
corr_direction = "positively" if corr_val > 0 else "negatively"
|
| 865 |
+
corr_insights.append(f"'{col1}' and '{col2}' are strongly {corr_direction} correlated (r={corr_val:.2f}).")
|
| 866 |
+
|
| 867 |
+
if len(high_corr) > 3:
|
| 868 |
+
corr_insights.append(f"And {len(high_corr) - 3} more highly correlated pairs found.")
|
| 869 |
+
|
| 870 |
+
corr_insights.append("Consider removing some highly correlated features to reduce dimensionality.")
|
| 871 |
+
else:
|
| 872 |
+
corr_insights.append("No strong correlations found between features.")
|
| 873 |
+
else:
|
| 874 |
+
corr_insights.append("Need at least 2 numerical columns to analyze correlations.")
|
| 875 |
+
|
| 876 |
+
insights["Correlation Analysis"] = corr_insights
|
| 877 |
+
|
| 878 |
+
# Add feature engineering recommendations
|
| 879 |
+
fe_insights = []
|
| 880 |
+
|
| 881 |
+
# Check for date columns
|
| 882 |
+
date_cols = []
|
| 883 |
+
for col in df.columns:
|
| 884 |
+
if df[col].dtype == 'object':
|
| 885 |
+
try:
|
| 886 |
+
pd.to_datetime(df[col])
|
| 887 |
+
date_cols.append(col)
|
| 888 |
+
except:
|
| 889 |
+
pass
|
| 890 |
+
|
| 891 |
+
if date_cols:
|
| 892 |
+
fe_insights.append(f"Found {len(date_cols)} potential date columns: {', '.join(date_cols[:3])}.")
|
| 893 |
+
fe_insights.append("Consider extracting year, month, day, weekday from these columns.")
|
| 894 |
+
|
| 895 |
+
# Check for categorical columns
|
| 896 |
+
cat_cols = df.select_dtypes(include=['object']).columns
|
| 897 |
+
if len(cat_cols) > 0:
|
| 898 |
+
fe_insights.append(f"Found {len(cat_cols)} categorical columns.")
|
| 899 |
+
fe_insights.append("Consider one-hot encoding or label encoding for categorical features.")
|
| 900 |
+
|
| 901 |
+
# Check for high cardinality
|
| 902 |
+
high_card_cols = []
|
| 903 |
+
for col in cat_cols:
|
| 904 |
+
if df[col].nunique() > 10:
|
| 905 |
+
high_card_cols.append((col, df[col].nunique()))
|
| 906 |
+
|
| 907 |
+
if high_card_cols:
|
| 908 |
+
fe_insights.append(f"Some categorical columns have high cardinality:")
|
| 909 |
+
for col, card in high_card_cols[:2]:
|
| 910 |
+
fe_insights.append(f"Column '{col}' has {card} unique values. Consider grouping less common categories.")
|
| 911 |
+
|
| 912 |
+
# Suggest polynomial features if few numeric features
|
| 913 |
+
if 1 < len(num_cols) < 5:
|
| 914 |
+
fe_insights.append("Consider creating polynomial features or interaction terms between numerical features.")
|
| 915 |
+
|
| 916 |
+
insights["Feature Engineering Recommendations"] = fe_insights
|
| 917 |
+
|
| 918 |
+
# Add a slight delay to simulate processing
|
| 919 |
+
time.sleep(1)
|
| 920 |
+
|
| 921 |
+
# Mark that the insights are loaded
|
| 922 |
+
st.session_state['loading_insights'] = False
|
| 923 |
+
logger.info("Template-based insights generation completed")
|
| 924 |
+
|
| 925 |
+
return insights
|
| 926 |
+
|
| 927 |
+
def display_chat_interface():
|
| 928 |
+
"""Display a chat interface for interacting with the data"""
|
| 929 |
+
st.markdown('<div class="tab-content">', unsafe_allow_html=True)
|
| 930 |
+
st.markdown('<h2 class="tab-title">💬 Chat with Your Data</h2>', unsafe_allow_html=True)
|
| 931 |
+
|
| 932 |
+
# Initialize chat history if not present
|
| 933 |
+
if "chat_history" not in st.session_state:
|
| 934 |
+
st.session_state.chat_history = []
|
| 935 |
+
|
| 936 |
+
# Make sure we have data to chat about
|
| 937 |
+
if 'df' not in st.session_state or st.session_state.df is None:
|
| 938 |
+
st.error("No dataset loaded. Please upload a CSV file to chat with your data.")
|
| 939 |
+
|
| 940 |
+
# Show a preview of chat capabilities
|
| 941 |
+
st.markdown("""
|
| 942 |
+
<div style="margin-top: 2rem;">
|
| 943 |
+
<h3>What can I help you with?</h3>
|
| 944 |
+
<p>Once you upload a dataset, you can ask questions like:</p>
|
| 945 |
+
<ul>
|
| 946 |
+
<li>What patterns do you see in my data?</li>
|
| 947 |
+
<li>How many missing values are there?</li>
|
| 948 |
+
<li>What feature engineering would you recommend?</li>
|
| 949 |
+
<li>Show me the distribution of a specific column</li>
|
| 950 |
+
<li>What are the correlations between features?</li>
|
| 951 |
+
</ul>
|
| 952 |
+
</div>
|
| 953 |
+
""", unsafe_allow_html=True)
|
| 954 |
+
|
| 955 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 956 |
+
return
|
| 957 |
+
|
| 958 |
+
# Display chat history
|
| 959 |
+
for message in st.session_state.chat_history:
|
| 960 |
+
if message["role"] == "user":
|
| 961 |
+
st.chat_message("user").write(message["content"])
|
| 962 |
+
else:
|
| 963 |
+
st.chat_message("assistant").write(message["content"])
|
| 964 |
+
|
| 965 |
+
# If no chat history, show some example questions
|
| 966 |
+
if not st.session_state.chat_history:
|
| 967 |
+
st.info("Ask me anything about your dataset! I can help you understand patterns, identify issues, and suggest improvements.")
|
| 968 |
+
|
| 969 |
+
st.markdown("### Example questions you can ask:")
|
| 970 |
+
|
| 971 |
+
# Create a grid of example questions using columns
|
| 972 |
+
col1, col2 = st.columns(2)
|
| 973 |
+
|
| 974 |
+
with col1:
|
| 975 |
+
example_questions = [
|
| 976 |
+
"What are the key patterns in this dataset?",
|
| 977 |
+
"Which columns have missing values?",
|
| 978 |
+
"What kind of feature engineering would help?"
|
| 979 |
+
]
|
| 980 |
+
|
| 981 |
+
for i, question in enumerate(example_questions):
|
| 982 |
+
if st.button(question, key=f"example_q_{i}"):
|
| 983 |
+
process_chat_message(question)
|
| 984 |
+
st.rerun()
|
| 985 |
+
|
| 986 |
+
with col2:
|
| 987 |
+
more_questions = [
|
| 988 |
+
"How are the numerical variables distributed?",
|
| 989 |
+
"What are the strongest correlations?",
|
| 990 |
+
"How can I prepare this data for modeling?"
|
| 991 |
+
]
|
| 992 |
+
|
| 993 |
+
for i, question in enumerate(more_questions):
|
| 994 |
+
if st.button(question, key=f"example_q_{i+3}"):
|
| 995 |
+
process_chat_message(question)
|
| 996 |
+
st.rerun()
|
| 997 |
+
|
| 998 |
+
# Input area for new messages
|
| 999 |
+
user_input = st.chat_input("Ask a question about your data...", key="chat_input")
|
| 1000 |
+
|
| 1001 |
+
if user_input:
|
| 1002 |
+
# Add user message to chat history
|
| 1003 |
+
process_chat_message(user_input)
|
| 1004 |
+
st.rerun()
|
| 1005 |
+
|
| 1006 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1007 |
+
|
| 1008 |
+
def display_descriptive_tab():
|
| 1009 |
+
st.markdown('<div class="tab-content">', unsafe_allow_html=True)
|
| 1010 |
+
st.markdown('<h2 class="tab-title">📊 Descriptive Statistics</h2>', unsafe_allow_html=True)
|
| 1011 |
+
|
| 1012 |
+
# Make sure we access the data from session state
|
| 1013 |
+
if 'df' not in st.session_state or 'descriptive_stats' not in st.session_state:
|
| 1014 |
+
st.error("No dataset loaded. Please upload a CSV file.")
|
| 1015 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1016 |
+
return
|
| 1017 |
+
|
| 1018 |
+
df = st.session_state.df
|
| 1019 |
+
descriptive_stats = st.session_state.descriptive_stats
|
| 1020 |
+
|
| 1021 |
+
# Display descriptive statistics in a more visually appealing way
|
| 1022 |
+
col1, col2 = st.columns([3, 1])
|
| 1023 |
+
|
| 1024 |
+
with col1:
|
| 1025 |
+
# Style the dataframe
|
| 1026 |
+
st.markdown('<div class="stats-card">', unsafe_allow_html=True)
|
| 1027 |
+
st.subheader("Numerical Summary")
|
| 1028 |
+
st.dataframe(descriptive_stats.style.background_gradient(cmap='Blues', axis=0)
|
| 1029 |
+
.format(precision=2, na_rep="Missing"), use_container_width=True)
|
| 1030 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1031 |
+
|
| 1032 |
+
with col2:
|
| 1033 |
+
st.markdown('<div class="info-card">', unsafe_allow_html=True)
|
| 1034 |
+
st.subheader("Dataset Overview")
|
| 1035 |
+
|
| 1036 |
+
# Display dataset information in a cleaner format
|
| 1037 |
+
total_rows = df.shape[0]
|
| 1038 |
+
total_cols = df.shape[1]
|
| 1039 |
+
numeric_cols = len(df.select_dtypes(include=['number']).columns)
|
| 1040 |
+
cat_cols = len(df.select_dtypes(include=['object', 'category']).columns)
|
| 1041 |
+
date_cols = len(df.select_dtypes(include=['datetime']).columns)
|
| 1042 |
+
|
| 1043 |
+
st.markdown(f"""
|
| 1044 |
+
<div class="dataset-stats">
|
| 1045 |
+
<div class="stat-item">
|
| 1046 |
+
<div class="stat-value">{total_rows:,}</div>
|
| 1047 |
+
<div class="stat-label">Rows</div>
|
| 1048 |
+
</div>
|
| 1049 |
+
<div class="stat-item">
|
| 1050 |
+
<div class="stat-value">{total_cols}</div>
|
| 1051 |
+
<div class="stat-label">Columns</div>
|
| 1052 |
+
</div>
|
| 1053 |
+
<div class="stat-item">
|
| 1054 |
+
<div class="stat-value">{numeric_cols}</div>
|
| 1055 |
+
<div class="stat-label">Numerical</div>
|
| 1056 |
+
</div>
|
| 1057 |
+
<div class="stat-item">
|
| 1058 |
+
<div class="stat-value">{cat_cols}</div>
|
| 1059 |
+
<div class="stat-label">Categorical</div>
|
| 1060 |
+
</div>
|
| 1061 |
+
<div class="stat-item">
|
| 1062 |
+
<div class="stat-value">{date_cols}</div>
|
| 1063 |
+
<div class="stat-label">Date/Time</div>
|
| 1064 |
+
</div>
|
| 1065 |
+
</div>
|
| 1066 |
+
""", unsafe_allow_html=True)
|
| 1067 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1068 |
+
|
| 1069 |
+
# Add missing values information with visualization
|
| 1070 |
+
st.markdown('<div class="stats-card">', unsafe_allow_html=True)
|
| 1071 |
+
st.subheader("Missing Values")
|
| 1072 |
+
col1, col2 = st.columns([2, 3])
|
| 1073 |
+
|
| 1074 |
+
with col1:
|
| 1075 |
+
# Calculate missing values
|
| 1076 |
+
missing_data = df.isnull().sum()
|
| 1077 |
+
missing_percent = (missing_data / len(df)) * 100
|
| 1078 |
+
missing_data = pd.DataFrame({
|
| 1079 |
+
'Missing Values': missing_data,
|
| 1080 |
+
'Percentage (%)': missing_percent.round(2)
|
| 1081 |
+
})
|
| 1082 |
+
missing_data = missing_data[missing_data['Missing Values'] > 0].sort_values('Missing Values', ascending=False)
|
| 1083 |
+
|
| 1084 |
+
if not missing_data.empty:
|
| 1085 |
+
st.dataframe(missing_data.style.background_gradient(cmap='Reds', subset=['Percentage (%)'])
|
| 1086 |
+
.format({'Percentage (%)': '{:.2f}%'}), use_container_width=True)
|
| 1087 |
+
else:
|
| 1088 |
+
st.success("No missing values found in the dataset! 🎉")
|
| 1089 |
+
|
| 1090 |
+
with col2:
|
| 1091 |
+
if not missing_data.empty:
|
| 1092 |
+
# Create a horizontal bar chart for missing values
|
| 1093 |
+
fig = px.bar(missing_data,
|
| 1094 |
+
x='Percentage (%)',
|
| 1095 |
+
y=missing_data.index,
|
| 1096 |
+
orientation='h',
|
| 1097 |
+
color='Percentage (%)',
|
| 1098 |
+
color_continuous_scale='Reds',
|
| 1099 |
+
title='Missing Values by Column')
|
| 1100 |
+
|
| 1101 |
+
fig.update_layout(
|
| 1102 |
+
height=max(350, len(missing_data) * 30),
|
| 1103 |
+
xaxis_title='Missing (%)',
|
| 1104 |
+
yaxis_title='',
|
| 1105 |
+
coloraxis_showscale=False,
|
| 1106 |
+
margin=dict(l=0, r=10, t=30, b=0)
|
| 1107 |
+
)
|
| 1108 |
+
|
| 1109 |
+
st.plotly_chart(fig, use_container_width=True)
|
| 1110 |
+
|
| 1111 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1112 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1113 |
+
|
| 1114 |
+
def display_distribution_tab():
|
| 1115 |
+
st.markdown('<div class="tab-content">', unsafe_allow_html=True)
|
| 1116 |
+
st.markdown('<h2 class="tab-title">📈 Data Distribution</h2>', unsafe_allow_html=True)
|
| 1117 |
+
|
| 1118 |
+
# Make sure we access the data from session state
|
| 1119 |
+
if 'df' not in st.session_state:
|
| 1120 |
+
st.error("No dataset loaded. Please upload a CSV file.")
|
| 1121 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1122 |
+
return
|
| 1123 |
+
|
| 1124 |
+
df = st.session_state.df
|
| 1125 |
+
|
| 1126 |
+
# Add filters for better UX
|
| 1127 |
+
st.markdown('<div class="filter-container">', unsafe_allow_html=True)
|
| 1128 |
+
col1, col2 = st.columns([1, 1])
|
| 1129 |
+
|
| 1130 |
+
with col1:
|
| 1131 |
+
chart_type = st.selectbox(
|
| 1132 |
+
"Select Chart Type",
|
| 1133 |
+
["Histogram", "Box Plot", "Violin Plot", "Distribution Plot"],
|
| 1134 |
+
key="chart_type_select"
|
| 1135 |
+
)
|
| 1136 |
+
|
| 1137 |
+
with col2:
|
| 1138 |
+
if chart_type != "Distribution Plot":
|
| 1139 |
+
column_type = "Numerical" if chart_type in ["Histogram", "Box Plot", "Violin Plot"] else "Categorical"
|
| 1140 |
+
columns_to_show = df.select_dtypes(include=['number']).columns.tolist() if column_type == "Numerical" else df.select_dtypes(include=['object', 'category']).columns.tolist()
|
| 1141 |
+
|
| 1142 |
+
selected_columns = st.multiselect(
|
| 1143 |
+
f"Select {column_type} Columns to Visualize",
|
| 1144 |
+
options=columns_to_show,
|
| 1145 |
+
default=columns_to_show[:min(3, len(columns_to_show))],
|
| 1146 |
+
key="column_select"
|
| 1147 |
+
)
|
| 1148 |
+
else:
|
| 1149 |
+
num_cols = df.select_dtypes(include=['number']).columns.tolist()
|
| 1150 |
+
selected_columns = st.multiselect(
|
| 1151 |
+
"Select Numerical Columns",
|
| 1152 |
+
options=num_cols,
|
| 1153 |
+
default=num_cols[:min(3, len(num_cols))],
|
| 1154 |
+
key="column_select"
|
| 1155 |
+
)
|
| 1156 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1157 |
+
|
| 1158 |
+
# Display selected charts
|
| 1159 |
+
if selected_columns:
|
| 1160 |
+
st.markdown('<div class="chart-container">', unsafe_allow_html=True)
|
| 1161 |
+
|
| 1162 |
+
if chart_type == "Histogram":
|
| 1163 |
+
col1, col2 = st.columns([3, 1])
|
| 1164 |
+
with col2:
|
| 1165 |
+
bins = st.slider("Number of bins", min_value=5, max_value=100, value=30, key="hist_bins")
|
| 1166 |
+
kde = st.checkbox("Show KDE", value=True, key="show_kde")
|
| 1167 |
+
|
| 1168 |
+
with col1:
|
| 1169 |
+
pass
|
| 1170 |
+
|
| 1171 |
+
# Display histograms with better styling
|
| 1172 |
+
for column in selected_columns:
|
| 1173 |
+
st.markdown(f'<div class="chart-card"><h3>{column}</h3>', unsafe_allow_html=True)
|
| 1174 |
+
fig = px.histogram(df, x=column, nbins=bins,
|
| 1175 |
+
title=f"Histogram of {column}",
|
| 1176 |
+
marginal="box" if kde else None,
|
| 1177 |
+
color_discrete_sequence=['rgba(99, 102, 241, 0.7)'])
|
| 1178 |
+
|
| 1179 |
+
fig.update_layout(
|
| 1180 |
+
template="plotly_white",
|
| 1181 |
+
height=400,
|
| 1182 |
+
margin=dict(l=10, r=10, t=40, b=10),
|
| 1183 |
+
xaxis_title=column,
|
| 1184 |
+
yaxis_title="Frequency",
|
| 1185 |
+
bargap=0.1
|
| 1186 |
+
)
|
| 1187 |
+
|
| 1188 |
+
st.plotly_chart(fig, use_container_width=True)
|
| 1189 |
+
|
| 1190 |
+
# Show basic statistics
|
| 1191 |
+
stats = df[column].describe().to_dict()
|
| 1192 |
+
st.markdown(f"""
|
| 1193 |
+
<div class="stat-summary">
|
| 1194 |
+
<div class="stat-pair"><span>Mean:</span> <strong>{stats['mean']:.2f}</strong></div>
|
| 1195 |
+
<div class="stat-pair"><span>Median:</span> <strong>{stats['50%']:.2f}</strong></div>
|
| 1196 |
+
<div class="stat-pair"><span>Std Dev:</span> <strong>{stats['std']:.2f}</strong></div>
|
| 1197 |
+
<div class="stat-pair"><span>Min:</span> <strong>{stats['min']:.2f}</strong></div>
|
| 1198 |
+
<div class="stat-pair"><span>Max:</span> <strong>{stats['max']:.2f}</strong></div>
|
| 1199 |
+
</div>
|
| 1200 |
+
""", unsafe_allow_html=True)
|
| 1201 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1202 |
+
|
| 1203 |
+
elif chart_type == "Box Plot":
|
| 1204 |
+
for column in selected_columns:
|
| 1205 |
+
st.markdown(f'<div class="chart-card"><h3>{column}</h3>', unsafe_allow_html=True)
|
| 1206 |
+
fig = px.box(df, y=column, title=f"Box Plot of {column}",
|
| 1207 |
+
color_discrete_sequence=['rgba(99, 102, 241, 0.7)'])
|
| 1208 |
+
|
| 1209 |
+
fig.update_layout(
|
| 1210 |
+
template="plotly_white",
|
| 1211 |
+
height=400,
|
| 1212 |
+
margin=dict(l=10, r=10, t=40, b=10),
|
| 1213 |
+
yaxis_title=column
|
| 1214 |
+
)
|
| 1215 |
+
|
| 1216 |
+
st.plotly_chart(fig, use_container_width=True)
|
| 1217 |
+
|
| 1218 |
+
# Show outlier information
|
| 1219 |
+
q1 = df[column].quantile(0.25)
|
| 1220 |
+
q3 = df[column].quantile(0.75)
|
| 1221 |
+
iqr = q3 - q1
|
| 1222 |
+
lower_bound = q1 - 1.5 * iqr
|
| 1223 |
+
upper_bound = q3 + 1.5 * iqr
|
| 1224 |
+
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)][column]
|
| 1225 |
+
|
| 1226 |
+
st.markdown(f"""
|
| 1227 |
+
<div class="stat-summary">
|
| 1228 |
+
<div class="stat-pair"><span>Q1 (25%):</span> <strong>{q1:.2f}</strong></div>
|
| 1229 |
+
<div class="stat-pair"><span>Median:</span> <strong>{df[column].median():.2f}</strong></div>
|
| 1230 |
+
<div class="stat-pair"><span>Q3 (75%):</span> <strong>{q3:.2f}</strong></div>
|
| 1231 |
+
<div class="stat-pair"><span>IQR:</span> <strong>{iqr:.2f}</strong></div>
|
| 1232 |
+
<div class="stat-pair"><span>Outliers:</span> <strong>{len(outliers)}</strong> ({(len(outliers)/len(df)*100):.2f}%)</div>
|
| 1233 |
+
</div>
|
| 1234 |
+
""", unsafe_allow_html=True)
|
| 1235 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1236 |
+
|
| 1237 |
+
elif chart_type == "Violin Plot":
|
| 1238 |
+
for column in selected_columns:
|
| 1239 |
+
st.markdown(f'<div class="chart-card"><h3>{column}</h3>', unsafe_allow_html=True)
|
| 1240 |
+
fig = px.violin(df, y=column, box=True, points="all", title=f"Violin Plot of {column}",
|
| 1241 |
+
color_discrete_sequence=['rgba(99, 102, 241, 0.7)'])
|
| 1242 |
+
|
| 1243 |
+
fig.update_layout(
|
| 1244 |
+
template="plotly_white",
|
| 1245 |
+
height=400,
|
| 1246 |
+
margin=dict(l=10, r=10, t=40, b=10),
|
| 1247 |
+
yaxis_title=column
|
| 1248 |
+
)
|
| 1249 |
+
|
| 1250 |
+
fig.update_traces(marker=dict(size=3, opacity=0.5))
|
| 1251 |
+
st.plotly_chart(fig, use_container_width=True)
|
| 1252 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1253 |
+
|
| 1254 |
+
elif chart_type == "Distribution Plot":
|
| 1255 |
+
if len(selected_columns) >= 2:
|
| 1256 |
+
st.markdown('<div class="chart-card">', unsafe_allow_html=True)
|
| 1257 |
+
chart_options = st.radio(
|
| 1258 |
+
"Select Distribution Plot Type",
|
| 1259 |
+
["Scatter Plot", "Correlation Heatmap"],
|
| 1260 |
+
horizontal=True
|
| 1261 |
+
)
|
| 1262 |
+
|
| 1263 |
+
if chart_options == "Scatter Plot":
|
| 1264 |
+
col1, col2 = st.columns([3, 1])
|
| 1265 |
+
with col2:
|
| 1266 |
+
x_axis = st.selectbox("X-axis", options=selected_columns, index=0)
|
| 1267 |
+
y_axis = st.selectbox("Y-axis", options=selected_columns, index=min(1, len(selected_columns)-1))
|
| 1268 |
+
color_option = st.selectbox("Color by", options=["None"] + df.columns.tolist())
|
| 1269 |
+
|
| 1270 |
+
with col1:
|
| 1271 |
+
if color_option != "None":
|
| 1272 |
+
fig = px.scatter(df, x=x_axis, y=y_axis,
|
| 1273 |
+
color=color_option,
|
| 1274 |
+
title=f"{y_axis} vs {x_axis} (colored by {color_option})",
|
| 1275 |
+
opacity=0.7,
|
| 1276 |
+
marginal_x="histogram", marginal_y="histogram")
|
| 1277 |
+
else:
|
| 1278 |
+
fig = px.scatter(df, x=x_axis, y=y_axis,
|
| 1279 |
+
title=f"{y_axis} vs {x_axis}",
|
| 1280 |
+
opacity=0.7,
|
| 1281 |
+
marginal_x="histogram", marginal_y="histogram")
|
| 1282 |
+
|
| 1283 |
+
fig.update_layout(
|
| 1284 |
+
template="plotly_white",
|
| 1285 |
+
height=600,
|
| 1286 |
+
margin=dict(l=10, r=10, t=40, b=10),
|
| 1287 |
+
)
|
| 1288 |
+
|
| 1289 |
+
st.plotly_chart(fig, use_container_width=True)
|
| 1290 |
+
|
| 1291 |
+
elif chart_options == "Correlation Heatmap":
|
| 1292 |
+
# Calculate correlation matrix
|
| 1293 |
+
corr_matrix = df[selected_columns].corr()
|
| 1294 |
+
|
| 1295 |
+
# Create heatmap
|
| 1296 |
+
fig = px.imshow(corr_matrix,
|
| 1297 |
+
text_auto=".2f",
|
| 1298 |
+
color_continuous_scale="RdBu_r",
|
| 1299 |
+
zmin=-1, zmax=1,
|
| 1300 |
+
title="Correlation Heatmap")
|
| 1301 |
+
|
| 1302 |
+
fig.update_layout(
|
| 1303 |
+
template="plotly_white",
|
| 1304 |
+
height=600,
|
| 1305 |
+
margin=dict(l=10, r=10, t=40, b=10),
|
| 1306 |
+
)
|
| 1307 |
+
|
| 1308 |
+
st.plotly_chart(fig, use_container_width=True)
|
| 1309 |
+
|
| 1310 |
+
# Show highest correlations
|
| 1311 |
+
corr_df = corr_matrix.stack().reset_index()
|
| 1312 |
+
corr_df.columns = ['Variable 1', 'Variable 2', 'Correlation']
|
| 1313 |
+
corr_df = corr_df[corr_df['Variable 1'] != corr_df['Variable 2']]
|
| 1314 |
+
corr_df = corr_df.sort_values('Correlation', ascending=False).head(5)
|
| 1315 |
+
|
| 1316 |
+
st.markdown("##### Top 5 Highest Correlations")
|
| 1317 |
+
st.dataframe(corr_df.style.background_gradient(cmap='Blues')
|
| 1318 |
+
.format({'Correlation': '{:.2f}'}), use_container_width=True)
|
| 1319 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1320 |
+
else:
|
| 1321 |
+
st.warning("Please select at least 2 numerical columns to see distribution plots")
|
| 1322 |
+
|
| 1323 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1324 |
+
else:
|
| 1325 |
+
st.info("Please select at least one column to visualize")
|
| 1326 |
+
|
| 1327 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1328 |
+
|
| 1329 |
+
def display_ai_insights_tab():
|
| 1330 |
+
st.markdown('<div class="tab-content">', unsafe_allow_html=True)
|
| 1331 |
+
st.markdown('<h2 class="tab-title">🧠 AI-Generated Insights</h2>', unsafe_allow_html=True)
|
| 1332 |
+
|
| 1333 |
+
# Make sure we access the data from session state
|
| 1334 |
+
if 'df' not in st.session_state:
|
| 1335 |
+
st.error("No dataset loaded. Please upload a CSV file.")
|
| 1336 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1337 |
+
return
|
| 1338 |
+
|
| 1339 |
+
if st.session_state.get('loading_insights', False):
|
| 1340 |
+
with st.spinner("Generating AI insights about your data..."):
|
| 1341 |
+
st.markdown('<div class="loading-container"><div class="loading-pulse"></div></div>', unsafe_allow_html=True)
|
| 1342 |
+
time.sleep(0.1) # Small delay to ensure UI updates
|
| 1343 |
+
|
| 1344 |
+
# AI insights section
|
| 1345 |
+
if 'ai_insights' in st.session_state and st.session_state.ai_insights and len(st.session_state.ai_insights) > 0:
|
| 1346 |
+
insights = st.session_state.ai_insights
|
| 1347 |
+
|
| 1348 |
+
st.markdown('<div class="insights-container">', unsafe_allow_html=True)
|
| 1349 |
+
|
| 1350 |
+
for i, (category, insight_list) in enumerate(insights.items()):
|
| 1351 |
+
with st.expander(f"{category}", expanded=i < 2):
|
| 1352 |
+
st.markdown('<div class="insights-category">', unsafe_allow_html=True)
|
| 1353 |
+
|
| 1354 |
+
# Check if the insights are from LLM (single string) or template (list of strings)
|
| 1355 |
+
if len(insight_list) == 1 and isinstance(insight_list[0], str) and len(insight_list[0]) > 100:
|
| 1356 |
+
# This is likely an LLM-generated insight (single long string)
|
| 1357 |
+
st.markdown(insight_list[0])
|
| 1358 |
+
else:
|
| 1359 |
+
# Template-based insights (list of strings)
|
| 1360 |
+
for insight in insight_list:
|
| 1361 |
+
st.markdown(f"""
|
| 1362 |
+
<div class="insight-card">
|
| 1363 |
+
<div class="insight-content">
|
| 1364 |
+
<div class="insight-icon">💡</div>
|
| 1365 |
+
<div class="insight-text">{insight}</div>
|
| 1366 |
+
</div>
|
| 1367 |
+
</div>
|
| 1368 |
+
""", unsafe_allow_html=True)
|
| 1369 |
+
|
| 1370 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1371 |
+
|
| 1372 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1373 |
+
|
| 1374 |
+
# Add regenerate button
|
| 1375 |
+
st.markdown('<div style="text-align: center; margin-top: 20px;">', unsafe_allow_html=True)
|
| 1376 |
+
if st.button("Regenerate Insights", key="regenerate_insights"):
|
| 1377 |
+
st.session_state['loading_insights'] = True
|
| 1378 |
+
st.session_state['ai_insights'] = None
|
| 1379 |
+
logger.info("User requested regeneration of AI insights")
|
| 1380 |
+
st.rerun()
|
| 1381 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1382 |
+
else:
|
| 1383 |
+
if not st.session_state.get('loading_insights', False):
|
| 1384 |
+
# Show generate button if insights are not loading and not available
|
| 1385 |
+
st.markdown('<div class="generate-insights-container">', unsafe_allow_html=True)
|
| 1386 |
+
st.markdown("""
|
| 1387 |
+
<div class="placeholder-card">
|
| 1388 |
+
<div class="placeholder-icon">🧠</div>
|
| 1389 |
+
<div class="placeholder-text">Generate AI-powered insights about your dataset to discover patterns, anomalies, and suggestions for feature engineering.</div>
|
| 1390 |
+
</div>
|
| 1391 |
+
""", unsafe_allow_html=True)
|
| 1392 |
+
if st.button("Generate Insights", key="generate_insights"):
|
| 1393 |
+
st.session_state['loading_insights'] = True
|
| 1394 |
+
logger.info("User initiated AI insights generation")
|
| 1395 |
+
st.rerun()
|
| 1396 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1397 |
+
|
| 1398 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1399 |
+
|
| 1400 |
+
def display_welcome_page():
|
| 1401 |
+
"""Display a welcome page with information about the application"""
|
| 1402 |
+
# Use Streamlit columns and components instead of raw HTML
|
| 1403 |
+
st.title("Welcome to AI-Powered EDA & Feature Engineering Assistant")
|
| 1404 |
+
|
| 1405 |
+
st.write("""
|
| 1406 |
+
Upload your CSV dataset and leverage the power of AI to analyze, visualize, and improve your data.
|
| 1407 |
+
This tool helps you understand your data better and prepare it for machine learning models.
|
| 1408 |
+
""")
|
| 1409 |
+
|
| 1410 |
+
# Feature cards
|
| 1411 |
+
st.subheader("Key Features")
|
| 1412 |
+
|
| 1413 |
+
# Use Streamlit columns to create a grid layout
|
| 1414 |
+
col1, col2 = st.columns(2)
|
| 1415 |
+
|
| 1416 |
+
with col1:
|
| 1417 |
+
st.markdown("#### 📊 Exploratory Data Analysis")
|
| 1418 |
+
st.write("Quickly understand your dataset with automatic statistical analysis and visualizations")
|
| 1419 |
+
|
| 1420 |
+
st.markdown("#### 🧠 AI-Powered Insights")
|
| 1421 |
+
st.write("Get intelligent recommendations about patterns, anomalies, and opportunities in your data")
|
| 1422 |
+
|
| 1423 |
+
st.markdown("#### ⚡ Feature Engineering")
|
| 1424 |
+
st.write("Transform and enhance your features to improve machine learning model performance")
|
| 1425 |
+
|
| 1426 |
+
with col2:
|
| 1427 |
+
st.markdown("#### 📈 Interactive Visualizations")
|
| 1428 |
+
st.write("Explore distributions, relationships, and outliers with dynamic charts")
|
| 1429 |
+
|
| 1430 |
+
st.markdown("#### 💬 Chat Interface")
|
| 1431 |
+
st.write("Ask questions about your data and get AI-powered answers in natural language")
|
| 1432 |
+
|
| 1433 |
+
st.markdown("#### 🔄 Data Transformation")
|
| 1434 |
+
st.write("Clean, transform, and prepare your data for modeling with guided workflows")
|
| 1435 |
+
|
| 1436 |
+
# Usage section
|
| 1437 |
+
st.subheader("How to use")
|
| 1438 |
+
|
| 1439 |
+
st.markdown("""
|
| 1440 |
+
1. **Upload** your CSV dataset using the sidebar on the left
|
| 1441 |
+
2. **Explore** automatically generated statistics and visualizations
|
| 1442 |
+
3. **Generate** AI insights to better understand your data
|
| 1443 |
+
4. **Chat** with AI to ask specific questions about your dataset
|
| 1444 |
+
5. **Transform** your features based on recommendations
|
| 1445 |
+
""")
|
| 1446 |
+
|
| 1447 |
+
# Powered by section
|
| 1448 |
+
st.subheader("Powered by")
|
| 1449 |
+
cols = st.columns(3)
|
| 1450 |
+
with cols[0]:
|
| 1451 |
+
st.markdown("**llama3-8b-8192**")
|
| 1452 |
+
with cols[1]:
|
| 1453 |
+
st.markdown("**Groq API**")
|
| 1454 |
+
with cols[2]:
|
| 1455 |
+
st.markdown("**Streamlit**")
|
| 1456 |
+
|
| 1457 |
+
# Upload prompt
|
| 1458 |
+
st.info("👈 Please upload a CSV file using the sidebar to get started")
|
| 1459 |
+
|
| 1460 |
+
def display_relationships_tab():
|
| 1461 |
+
"""Display correlations and relationships between variables"""
|
| 1462 |
+
st.markdown('<div class="tab-content">', unsafe_allow_html=True)
|
| 1463 |
+
st.markdown('<h2 class="tab-title">🔄 Relationships & Correlations</h2>', unsafe_allow_html=True)
|
| 1464 |
+
|
| 1465 |
+
# Make sure we have data to visualize
|
| 1466 |
+
if 'df' not in st.session_state or st.session_state.df is None:
|
| 1467 |
+
st.error("No dataset loaded. Please upload a CSV file.")
|
| 1468 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1469 |
+
return
|
| 1470 |
+
|
| 1471 |
+
df = st.session_state.df
|
| 1472 |
+
|
| 1473 |
+
# Select numerical columns for correlation analysis
|
| 1474 |
+
num_cols = df.select_dtypes(include=['number']).columns
|
| 1475 |
+
|
| 1476 |
+
if len(num_cols) < 2:
|
| 1477 |
+
st.warning("At least 2 numerical columns are needed for correlation analysis.")
|
| 1478 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1479 |
+
return
|
| 1480 |
+
|
| 1481 |
+
# Correlation matrix heatmap
|
| 1482 |
+
st.subheader("Correlation Matrix")
|
| 1483 |
+
|
| 1484 |
+
# Calculate correlation
|
| 1485 |
+
corr_matrix = df[num_cols].corr()
|
| 1486 |
+
|
| 1487 |
+
# Create correlation heatmap
|
| 1488 |
+
fig = px.imshow(
|
| 1489 |
+
corr_matrix,
|
| 1490 |
+
text_auto=".2f",
|
| 1491 |
+
color_continuous_scale="RdBu_r",
|
| 1492 |
+
zmin=-1, zmax=1,
|
| 1493 |
+
aspect="auto",
|
| 1494 |
+
title="Correlation Heatmap"
|
| 1495 |
+
)
|
| 1496 |
+
|
| 1497 |
+
fig.update_layout(
|
| 1498 |
+
height=600,
|
| 1499 |
+
width=800,
|
| 1500 |
+
title_font_size=20,
|
| 1501 |
+
margin=dict(l=10, r=10, t=30, b=10)
|
| 1502 |
+
)
|
| 1503 |
+
|
| 1504 |
+
st.plotly_chart(fig, use_container_width=True)
|
| 1505 |
+
|
| 1506 |
+
# Show top correlations
|
| 1507 |
+
st.subheader("Top Correlations")
|
| 1508 |
+
|
| 1509 |
+
# Extract and format correlations
|
| 1510 |
+
corr_pairs = []
|
| 1511 |
+
for i in range(len(num_cols)):
|
| 1512 |
+
for j in range(i):
|
| 1513 |
+
corr_pairs.append({
|
| 1514 |
+
'Feature 1': num_cols[i],
|
| 1515 |
+
'Feature 2': num_cols[j],
|
| 1516 |
+
'Correlation': corr_matrix.iloc[i, j]
|
| 1517 |
+
})
|
| 1518 |
+
|
| 1519 |
+
# Convert to dataframe and sort
|
| 1520 |
+
corr_df = pd.DataFrame(corr_pairs)
|
| 1521 |
+
sorted_corr = corr_df.sort_values('Correlation', key=abs, ascending=False).head(10)
|
| 1522 |
+
|
| 1523 |
+
# Show table with styled background
|
| 1524 |
+
st.dataframe(
|
| 1525 |
+
sorted_corr.style.background_gradient(cmap='RdBu_r', subset=['Correlation'])
|
| 1526 |
+
.format({'Correlation': '{:.3f}'}),
|
| 1527 |
+
use_container_width=True
|
| 1528 |
+
)
|
| 1529 |
+
|
| 1530 |
+
# Scatter plot matrix
|
| 1531 |
+
st.subheader("Scatter Plot Matrix")
|
| 1532 |
+
|
| 1533 |
+
# Let user choose columns
|
| 1534 |
+
selected_cols = st.multiselect(
|
| 1535 |
+
"Select columns for scatter plot matrix (max 5 recommended)",
|
| 1536 |
+
options=num_cols,
|
| 1537 |
+
default=num_cols[:min(4, len(num_cols))]
|
| 1538 |
+
)
|
| 1539 |
+
|
| 1540 |
+
if selected_cols:
|
| 1541 |
+
if len(selected_cols) > 5:
|
| 1542 |
+
st.warning("More than 5 columns may make the plot hard to read.")
|
| 1543 |
+
|
| 1544 |
+
color_col = st.selectbox("Color by", options=["None"] + df.columns.tolist())
|
| 1545 |
+
|
| 1546 |
+
# Only pass the color parameter if not "None"
|
| 1547 |
+
if color_col != "None":
|
| 1548 |
+
fig = px.scatter_matrix(
|
| 1549 |
+
df,
|
| 1550 |
+
dimensions=selected_cols,
|
| 1551 |
+
color=color_col,
|
| 1552 |
+
opacity=0.7,
|
| 1553 |
+
title="Scatter Plot Matrix"
|
| 1554 |
+
)
|
| 1555 |
+
else:
|
| 1556 |
+
fig = px.scatter_matrix(
|
| 1557 |
+
df,
|
| 1558 |
+
dimensions=selected_cols,
|
| 1559 |
+
opacity=0.7,
|
| 1560 |
+
title="Scatter Plot Matrix"
|
| 1561 |
+
)
|
| 1562 |
+
|
| 1563 |
+
fig.update_layout(
|
| 1564 |
+
height=700,
|
| 1565 |
+
title_font_size=18,
|
| 1566 |
+
margin=dict(l=10, r=10, t=30, b=10)
|
| 1567 |
+
)
|
| 1568 |
+
|
| 1569 |
+
st.plotly_chart(fig, use_container_width=True)
|
| 1570 |
+
|
| 1571 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1572 |
+
|
| 1573 |
+
def process_chat_message(user_message):
|
| 1574 |
+
"""Process a user message in the chat interface"""
|
| 1575 |
+
# Add user message to chat history
|
| 1576 |
+
st.session_state.chat_history.append({"role": "user", "content": user_message})
|
| 1577 |
+
|
| 1578 |
+
# Generate a response from the AI
|
| 1579 |
+
if 'df' in st.session_state and st.session_state.df is not None:
|
| 1580 |
+
# Try to use LLM if available, otherwise fall back to templates
|
| 1581 |
+
try:
|
| 1582 |
+
if llm_inference is not None:
|
| 1583 |
+
# Create a prompt about the dataset
|
| 1584 |
+
df = st.session_state.df
|
| 1585 |
+
|
| 1586 |
+
# Get basic dataset info
|
| 1587 |
+
num_rows, num_cols = df.shape
|
| 1588 |
+
num_numerical = len(df.select_dtypes(include=['number']).columns)
|
| 1589 |
+
num_categorical = len(df.select_dtypes(include=['object', 'category']).columns)
|
| 1590 |
+
num_missing = df.isnull().sum().sum()
|
| 1591 |
+
missing_cols = df.isnull().sum()[df.isnull().sum() > 0]
|
| 1592 |
+
|
| 1593 |
+
# Format missing values for better readability
|
| 1594 |
+
missing_values = {}
|
| 1595 |
+
for col in missing_cols.index:
|
| 1596 |
+
count = missing_cols[col]
|
| 1597 |
+
percent = round(count / len(df) * 100, 2)
|
| 1598 |
+
missing_values[col] = (count, percent)
|
| 1599 |
+
|
| 1600 |
+
# Get correlations for numerical columns
|
| 1601 |
+
num_cols = df.select_dtypes(include=['number']).columns
|
| 1602 |
+
correlations = "No numerical columns to calculate correlations."
|
| 1603 |
+
if len(num_cols) > 1:
|
| 1604 |
+
# Calculate correlations
|
| 1605 |
+
corr_matrix = df[num_cols].corr()
|
| 1606 |
+
# Get top 5 correlations (absolute values)
|
| 1607 |
+
corr_pairs = []
|
| 1608 |
+
for i in range(len(num_cols)):
|
| 1609 |
+
for j in range(i):
|
| 1610 |
+
val = corr_matrix.iloc[i, j]
|
| 1611 |
+
if abs(val) > 0.5: # Only show strong correlations
|
| 1612 |
+
corr_pairs.append((num_cols[i], num_cols[j], val))
|
| 1613 |
+
|
| 1614 |
+
# Sort by absolute correlation and format
|
| 1615 |
+
if corr_pairs:
|
| 1616 |
+
corr_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
|
| 1617 |
+
formatted_corrs = []
|
| 1618 |
+
for col1, col2, val in corr_pairs[:5]: # Top 5
|
| 1619 |
+
formatted_corrs.append(f"{col1} and {col2}: {val:.3f}")
|
| 1620 |
+
correlations = "\n".join(formatted_corrs)
|
| 1621 |
+
|
| 1622 |
+
# Create dataset_info dictionary for LLM
|
| 1623 |
+
dataset_info = {
|
| 1624 |
+
"shape": f"{num_rows} rows, {num_cols} columns",
|
| 1625 |
+
"columns": df.columns.tolist(),
|
| 1626 |
+
"dtypes": {col: str(dtype) for col, dtype in df.dtypes.items()},
|
| 1627 |
+
"missing_values": missing_values,
|
| 1628 |
+
"basic_stats": df.describe().to_string(),
|
| 1629 |
+
"correlations": correlations,
|
| 1630 |
+
"sample_data": df.head(5).to_string()
|
| 1631 |
+
}
|
| 1632 |
+
|
| 1633 |
+
# Generate response using LLM
|
| 1634 |
+
logger.info(f"Sending question to LLM: {user_message}")
|
| 1635 |
+
response = llm_inference.answer_dataset_question(user_message, dataset_info)
|
| 1636 |
+
|
| 1637 |
+
# Log the raw response for debugging
|
| 1638 |
+
logger.info(f"Raw LLM response: {response[:100]}...")
|
| 1639 |
+
|
| 1640 |
+
# If response is not empty and is a valid string
|
| 1641 |
+
if response and isinstance(response, str) and len(response) > 10:
|
| 1642 |
+
# Clean up the response if needed
|
| 1643 |
+
cleaned_response = response.strip()
|
| 1644 |
+
|
| 1645 |
+
# Add to chat history
|
| 1646 |
+
st.session_state.chat_history.append({"role": "assistant", "content": cleaned_response})
|
| 1647 |
+
return
|
| 1648 |
+
else:
|
| 1649 |
+
logger.warning(f"LLM response too short or invalid: {response}")
|
| 1650 |
+
raise Exception("LLM response too short or invalid")
|
| 1651 |
+
else:
|
| 1652 |
+
raise Exception("LLM not available")
|
| 1653 |
+
|
| 1654 |
+
except Exception as e:
|
| 1655 |
+
logger.warning(f"Error using LLM for chat response: {str(e)}. Falling back to templates.")
|
| 1656 |
+
# Fall back happens below
|
| 1657 |
+
|
| 1658 |
+
# If we're here, either there's no dataframe, LLM failed, or response was invalid
|
| 1659 |
+
# Use template-based responses as fallback
|
| 1660 |
+
if 'df' in st.session_state and st.session_state.df is not None:
|
| 1661 |
+
df = st.session_state.df
|
| 1662 |
+
|
| 1663 |
+
# Simple response templates
|
| 1664 |
+
responses = {
|
| 1665 |
+
"missing": f"I found {df.isnull().sum().sum()} missing values across the dataset. The columns with the most missing values are: {df.isnull().sum().sort_values(ascending=False).head(3).index.tolist()}.",
|
| 1666 |
+
"pattern": "Looking at the data, I can see several interesting patterns. The numerical features show varied distributions, and there might be some correlations worth exploring further.",
|
| 1667 |
+
"feature": "Based on the data, I'd recommend feature engineering steps like handling missing values, encoding categorical variables, and possibly creating interaction terms for highly correlated features.",
|
| 1668 |
+
"distribution": f"The numerical variables show different distributions. Some appear to be normally distributed while others show skewness. Let me know if you want to see visualizations for specific columns.",
|
| 1669 |
+
"correlation": "I detected several strong correlations in the dataset. You might want to look at the correlation heatmap in the Relationships tab for more details.",
|
| 1670 |
+
"prepare": "To prepare this data for modeling, I suggest: 1) Handling missing values, 2) Encoding categorical variables, 3) Feature scaling, and 4) Possibly dimensionality reduction if you have many features."
|
| 1671 |
+
}
|
| 1672 |
+
|
| 1673 |
+
# Simple keyword matching for demo purposes
|
| 1674 |
+
if "missing" in user_message.lower():
|
| 1675 |
+
response = responses["missing"]
|
| 1676 |
+
elif "pattern" in user_message.lower():
|
| 1677 |
+
response = responses["pattern"]
|
| 1678 |
+
elif "feature" in user_message.lower() or "engineering" in user_message.lower():
|
| 1679 |
+
response = responses["feature"]
|
| 1680 |
+
elif "distribut" in user_message.lower():
|
| 1681 |
+
response = responses["distribution"]
|
| 1682 |
+
elif "correlat" in user_message.lower() or "relation" in user_message.lower():
|
| 1683 |
+
response = responses["correlation"]
|
| 1684 |
+
elif "prepare" in user_message.lower() or "model" in user_message.lower():
|
| 1685 |
+
response = responses["prepare"]
|
| 1686 |
+
else:
|
| 1687 |
+
# Generic response
|
| 1688 |
+
response = "I analyzed your dataset and found some interesting insights. You can explore different aspects of your data using the tabs above. Is there anything specific you'd like to know about your data?"
|
| 1689 |
+
else:
|
| 1690 |
+
response = "Please upload a dataset first so I can analyze it and answer your questions."
|
| 1691 |
+
|
| 1692 |
+
# Add AI response to chat history
|
| 1693 |
+
st.session_state.chat_history.append({"role": "assistant", "content": response})
|
| 1694 |
+
|
| 1695 |
+
def main():
|
| 1696 |
+
"""Main function to run the application"""
|
| 1697 |
+
# Initialize session state at the beginning
|
| 1698 |
+
initialize_session_state()
|
| 1699 |
+
|
| 1700 |
+
# Apply CSS styling
|
| 1701 |
+
apply_custom_css()
|
| 1702 |
+
|
| 1703 |
+
# Sidebar for file upload and settings
|
| 1704 |
+
with st.sidebar:
|
| 1705 |
+
st.markdown('<div class="sidebar-header">AI-Powered EDA & Feature Engineering</div>', unsafe_allow_html=True)
|
| 1706 |
+
|
| 1707 |
+
# File uploader
|
| 1708 |
+
st.markdown('<div class="sidebar-section">', unsafe_allow_html=True)
|
| 1709 |
+
st.markdown('### Upload Dataset')
|
| 1710 |
+
uploaded_file = st.file_uploader("Choose a CSV file", type="csv")
|
| 1711 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1712 |
+
|
| 1713 |
+
# Load example dataset
|
| 1714 |
+
with st.expander("Or use an example dataset"):
|
| 1715 |
+
example_datasets = {
|
| 1716 |
+
"Iris": "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv",
|
| 1717 |
+
"Tips": "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv",
|
| 1718 |
+
"Titanic": "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv",
|
| 1719 |
+
"Diamonds": "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv"
|
| 1720 |
+
}
|
| 1721 |
+
selected_example = st.selectbox("Select example dataset", list(example_datasets.keys()))
|
| 1722 |
+
if st.button("Load Example", key="load_example_btn"):
|
| 1723 |
+
try:
|
| 1724 |
+
# Load the selected example dataset
|
| 1725 |
+
df = pd.read_csv(example_datasets[selected_example])
|
| 1726 |
+
|
| 1727 |
+
# Verify we have a valid dataframe
|
| 1728 |
+
if df is not None and not df.empty:
|
| 1729 |
+
st.session_state['df'] = df
|
| 1730 |
+
st.session_state['descriptive_stats'] = df.describe()
|
| 1731 |
+
st.session_state['dataset_name'] = selected_example
|
| 1732 |
+
st.success(f"Loaded {selected_example} dataset!")
|
| 1733 |
+
else:
|
| 1734 |
+
st.error(f"The {selected_example} dataset appears to be empty.")
|
| 1735 |
+
except Exception as e:
|
| 1736 |
+
st.error(f"Error loading example dataset: {str(e)}")
|
| 1737 |
+
|
| 1738 |
+
# Only show these sections if a dataset is loaded
|
| 1739 |
+
if 'df' in st.session_state:
|
| 1740 |
+
# Dataset Info
|
| 1741 |
+
st.markdown('<div class="sidebar-section">', unsafe_allow_html=True)
|
| 1742 |
+
st.markdown(f'### Dataset Info: {st.session_state.get("dataset_name", "Uploaded Data")}')
|
| 1743 |
+
df = st.session_state.df
|
| 1744 |
+
# Add check to ensure df is not None before accessing shape
|
| 1745 |
+
if df is not None:
|
| 1746 |
+
st.write(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
|
| 1747 |
+
else:
|
| 1748 |
+
st.error("Dataset is loaded but appears to be empty.")
|
| 1749 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1750 |
+
|
| 1751 |
+
# Column filters
|
| 1752 |
+
st.markdown('<div class="sidebar-section">', unsafe_allow_html=True)
|
| 1753 |
+
st.markdown('### Column Filters')
|
| 1754 |
+
if df is not None:
|
| 1755 |
+
selected_columns = st.multiselect("Select columns to analyze",
|
| 1756 |
+
options=df.columns.tolist(),
|
| 1757 |
+
default=df.columns.tolist())
|
| 1758 |
+
|
| 1759 |
+
if len(selected_columns) > 0:
|
| 1760 |
+
st.session_state['selected_columns'] = selected_columns
|
| 1761 |
+
st.session_state['filtered_df'] = df[selected_columns]
|
| 1762 |
+
else:
|
| 1763 |
+
st.session_state['selected_columns'] = df.columns.tolist()
|
| 1764 |
+
st.session_state['filtered_df'] = df
|
| 1765 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1766 |
+
|
| 1767 |
+
# Feature Engineering options with Streamlit buttons instead of JavaScript
|
| 1768 |
+
st.markdown('<div class="sidebar-section">', unsafe_allow_html=True)
|
| 1769 |
+
st.markdown('### Feature Engineering')
|
| 1770 |
+
|
| 1771 |
+
col1, col2 = st.columns(2)
|
| 1772 |
+
with col1:
|
| 1773 |
+
if st.button("Missing Values", key="missing_values_btn"):
|
| 1774 |
+
st.session_state['fe_selected'] = 'missing_values'
|
| 1775 |
+
|
| 1776 |
+
with col2:
|
| 1777 |
+
if st.button("Encode Categorical", key="encode_cat_btn"):
|
| 1778 |
+
st.session_state['fe_selected'] = 'encode_categorical'
|
| 1779 |
+
|
| 1780 |
+
col1, col2 = st.columns(2)
|
| 1781 |
+
with col1:
|
| 1782 |
+
if st.button("Scale Features", key="scale_features_btn"):
|
| 1783 |
+
st.session_state['fe_selected'] = 'scale_features'
|
| 1784 |
+
|
| 1785 |
+
with col2:
|
| 1786 |
+
if st.button("Transform", key="transform_btn"):
|
| 1787 |
+
st.session_state['fe_selected'] = 'transform'
|
| 1788 |
+
|
| 1789 |
+
# Display currently selected feature engineering option
|
| 1790 |
+
if 'fe_selected' in st.session_state:
|
| 1791 |
+
st.info(f"Selected: {st.session_state['fe_selected']}")
|
| 1792 |
+
|
| 1793 |
+
st.markdown('</div>', unsafe_allow_html=True)
|
| 1794 |
+
|
| 1795 |
+
st.markdown('<div class="sidebar-footer">Powered by Hugging Face & Streamlit</div>', unsafe_allow_html=True)
|
| 1796 |
+
|
| 1797 |
+
# If data is uploaded, process it
|
| 1798 |
+
if uploaded_file is not None and ('df' not in st.session_state or st.session_state.get('df') is None):
|
| 1799 |
+
try:
|
| 1800 |
+
# Attempt to read the CSV file
|
| 1801 |
+
df = pd.read_csv(uploaded_file)
|
| 1802 |
+
|
| 1803 |
+
# Verify that we have a valid dataframe before storing in session state
|
| 1804 |
+
if df is not None and not df.empty:
|
| 1805 |
+
st.session_state['df'] = df
|
| 1806 |
+
st.session_state['descriptive_stats'] = df.describe()
|
| 1807 |
+
st.session_state['dataset_name'] = uploaded_file.name
|
| 1808 |
+
st.success(f"Successfully loaded dataset: {uploaded_file.name}")
|
| 1809 |
+
else:
|
| 1810 |
+
st.error("The uploaded file appears to be empty.")
|
| 1811 |
+
except Exception as e:
|
| 1812 |
+
st.error(f"Error reading CSV file: {str(e)}")
|
| 1813 |
+
|
| 1814 |
+
# Create navigation tabs using Streamlit
|
| 1815 |
+
st.write("### Navigation")
|
| 1816 |
+
tabs = ["Overview", "Distribution", "Relationships", "AI Insights", "Chat"]
|
| 1817 |
+
|
| 1818 |
+
# Create columns for each tab
|
| 1819 |
+
cols = st.columns(len(tabs))
|
| 1820 |
+
|
| 1821 |
+
# Handle tab selection using Streamlit buttons
|
| 1822 |
+
for i, tab in enumerate(tabs):
|
| 1823 |
+
with cols[i]:
|
| 1824 |
+
if st.button(tab, key=f"tab_{tab.lower()}"):
|
| 1825 |
+
st.session_state['selected_tab'] = f"tab-{tab.lower().replace(' ', '-')}"
|
| 1826 |
+
st.rerun()
|
| 1827 |
+
|
| 1828 |
+
# Show selected tab indicator
|
| 1829 |
+
selected_tab_name = st.session_state['selected_tab'].replace('tab-', '').replace('-', ' ').title()
|
| 1830 |
+
st.markdown(f"<div style='text-align: center; margin-bottom: 2rem;'>Selected: {selected_tab_name}</div>", unsafe_allow_html=True)
|
| 1831 |
+
|
| 1832 |
+
# Show welcome message if no data is uploaded
|
| 1833 |
+
if 'df' not in st.session_state:
|
| 1834 |
+
display_welcome_page()
|
| 1835 |
+
else:
|
| 1836 |
+
# Display content based on selected tab
|
| 1837 |
+
if st.session_state['selected_tab'] == 'tab-overview':
|
| 1838 |
+
display_descriptive_tab()
|
| 1839 |
+
elif st.session_state['selected_tab'] == 'tab-distribution':
|
| 1840 |
+
display_distribution_tab()
|
| 1841 |
+
elif st.session_state['selected_tab'] == 'tab-relationships':
|
| 1842 |
+
display_relationships_tab()
|
| 1843 |
+
elif st.session_state['selected_tab'] == 'tab-ai-insights' or st.session_state['selected_tab'] == 'tab-ai':
|
| 1844 |
+
display_ai_insights_tab()
|
| 1845 |
+
elif st.session_state['selected_tab'] == 'tab-chat':
|
| 1846 |
+
display_chat_interface()
|
| 1847 |
+
|
| 1848 |
+
# After all tabs are rendered, check if we have a regenerate action
|
| 1849 |
+
# This is processed at the end to avoid session state changes during rendering
|
| 1850 |
+
if (st.session_state.get('loading_insights', False) and
|
| 1851 |
+
('ai_insights' not in st.session_state or st.session_state.get('ai_insights') is None)):
|
| 1852 |
+
logger.info("Generating AI insights at end of main function")
|
| 1853 |
+
try:
|
| 1854 |
+
st.session_state['ai_insights'] = generate_ai_insights()
|
| 1855 |
+
logger.info(f"Generated insights: {len(st.session_state['ai_insights'])} categories")
|
| 1856 |
+
st.session_state['loading_insights'] = False
|
| 1857 |
+
except Exception as e:
|
| 1858 |
+
logger.error(f"Error generating insights in main function: {str(e)}")
|
| 1859 |
+
st.session_state['loading_insights'] = False
|
| 1860 |
+
st.session_state['ai_insights'] = {} # Set to empty dict to prevent repeated failures
|
| 1861 |
+
finally:
|
| 1862 |
+
st.rerun()
|
| 1863 |
+
|
| 1864 |
+
if __name__ == "__main__":
|
| 1865 |
+
main()
|
eda_analysis.py
ADDED
|
@@ -0,0 +1,479 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
EDA Analysis Module
|
| 3 |
+
|
| 4 |
+
This module handles all dataset processing and analysis, providing structured information
|
| 5 |
+
about the dataset that can be used for visualization and LLM prompting.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import pandas as pd
|
| 9 |
+
import numpy as np
|
| 10 |
+
from typing import Dict, List, Tuple, Any, Optional
|
| 11 |
+
import matplotlib.pyplot as plt
|
| 12 |
+
import seaborn as sns
|
| 13 |
+
from sklearn.preprocessing import StandardScaler
|
| 14 |
+
from io import BytesIO
|
| 15 |
+
import base64
|
| 16 |
+
|
| 17 |
+
class DatasetAnalyzer:
|
| 18 |
+
"""Class for analyzing datasets and extracting key information"""
|
| 19 |
+
|
| 20 |
+
def __init__(self, df: pd.DataFrame = None):
|
| 21 |
+
"""Initialize with an optional dataframe"""
|
| 22 |
+
self.df = df
|
| 23 |
+
self.analysis_results = {}
|
| 24 |
+
|
| 25 |
+
def load_dataframe(self, df: pd.DataFrame) -> None:
|
| 26 |
+
"""Load a dataframe for analysis"""
|
| 27 |
+
self.df = df
|
| 28 |
+
# Reset analysis results when loading a new dataframe
|
| 29 |
+
self.analysis_results = {}
|
| 30 |
+
|
| 31 |
+
def analyze_dataset(self) -> Dict[str, Any]:
|
| 32 |
+
"""
|
| 33 |
+
Perform comprehensive analysis on the dataset
|
| 34 |
+
|
| 35 |
+
Returns:
|
| 36 |
+
Dict: Dictionary containing all analysis results
|
| 37 |
+
"""
|
| 38 |
+
if self.df is None:
|
| 39 |
+
raise ValueError("No dataframe loaded. Please load a dataframe first.")
|
| 40 |
+
|
| 41 |
+
# Basic information
|
| 42 |
+
self.analysis_results["shape"] = self.df.shape
|
| 43 |
+
self.analysis_results["columns"] = list(self.df.columns)
|
| 44 |
+
self.analysis_results["dtypes"] = {col: str(self.df[col].dtype) for col in self.df.columns}
|
| 45 |
+
|
| 46 |
+
# Missing values
|
| 47 |
+
self.analysis_results["missing_values"] = self._analyze_missing_values()
|
| 48 |
+
|
| 49 |
+
# Basic statistics
|
| 50 |
+
self.analysis_results["basic_stats"] = self._generate_basic_stats()
|
| 51 |
+
|
| 52 |
+
# Correlations (for numerical columns)
|
| 53 |
+
self.analysis_results["correlations"] = self._analyze_correlations()
|
| 54 |
+
|
| 55 |
+
# Sample data
|
| 56 |
+
self.analysis_results["sample_data"] = self.df.head().to_string()
|
| 57 |
+
|
| 58 |
+
# Additional analyses
|
| 59 |
+
self.analysis_results["categorical_columns"] = self._identify_categorical_columns()
|
| 60 |
+
self.analysis_results["numerical_columns"] = self._identify_numerical_columns()
|
| 61 |
+
self.analysis_results["unique_values"] = self._count_unique_values()
|
| 62 |
+
|
| 63 |
+
return self.analysis_results
|
| 64 |
+
|
| 65 |
+
def _analyze_missing_values(self) -> Dict[str, Tuple[int, float]]:
|
| 66 |
+
"""
|
| 67 |
+
Analyze missing values in the dataset
|
| 68 |
+
|
| 69 |
+
Returns:
|
| 70 |
+
Dict: Column names as keys, tuples of (count, percentage) as values
|
| 71 |
+
"""
|
| 72 |
+
missing_values = {}
|
| 73 |
+
for col in self.df.columns:
|
| 74 |
+
count = self.df[col].isna().sum()
|
| 75 |
+
percentage = round((count / len(self.df)) * 100, 2)
|
| 76 |
+
missing_values[col] = (count, percentage)
|
| 77 |
+
|
| 78 |
+
return missing_values
|
| 79 |
+
|
| 80 |
+
def _generate_basic_stats(self) -> str:
|
| 81 |
+
"""
|
| 82 |
+
Generate basic statistics for the dataset
|
| 83 |
+
|
| 84 |
+
Returns:
|
| 85 |
+
str: String representation of basic statistics
|
| 86 |
+
"""
|
| 87 |
+
# For numerical columns
|
| 88 |
+
num_stats = self.df.describe().to_string()
|
| 89 |
+
|
| 90 |
+
# For categorical columns
|
| 91 |
+
cat_columns = self._identify_categorical_columns()
|
| 92 |
+
cat_stats = ""
|
| 93 |
+
if cat_columns:
|
| 94 |
+
cat_stats = "\n\nCategorical columns statistics:\n"
|
| 95 |
+
for col in cat_columns:
|
| 96 |
+
value_counts = self.df[col].value_counts().head(10)
|
| 97 |
+
cat_stats += f"\n{col} - Top values:\n{value_counts.to_string()}\n"
|
| 98 |
+
|
| 99 |
+
return num_stats + cat_stats
|
| 100 |
+
|
| 101 |
+
def _analyze_correlations(self) -> str:
|
| 102 |
+
"""
|
| 103 |
+
Analyze correlations between numerical features
|
| 104 |
+
|
| 105 |
+
Returns:
|
| 106 |
+
str: String representation of top correlations
|
| 107 |
+
"""
|
| 108 |
+
num_columns = self._identify_numerical_columns()
|
| 109 |
+
|
| 110 |
+
if not num_columns or len(num_columns) < 2:
|
| 111 |
+
return "Not enough numerical columns for correlation analysis."
|
| 112 |
+
|
| 113 |
+
corr_matrix = self.df[num_columns].corr()
|
| 114 |
+
|
| 115 |
+
# Get top correlations (excluding self-correlations)
|
| 116 |
+
corr_pairs = []
|
| 117 |
+
for i in range(len(num_columns)):
|
| 118 |
+
for j in range(i+1, len(num_columns)):
|
| 119 |
+
col1, col2 = num_columns[i], num_columns[j]
|
| 120 |
+
corr_value = corr_matrix.loc[col1, col2]
|
| 121 |
+
if not np.isnan(corr_value):
|
| 122 |
+
corr_pairs.append((col1, col2, corr_value))
|
| 123 |
+
|
| 124 |
+
# Sort by absolute correlation value
|
| 125 |
+
corr_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
|
| 126 |
+
|
| 127 |
+
# Format results
|
| 128 |
+
result = "Top correlations:\n"
|
| 129 |
+
for col1, col2, corr in corr_pairs[:10]: # Top 10 correlations
|
| 130 |
+
result += f"{col1} -- {col2}: {corr:.4f}\n"
|
| 131 |
+
|
| 132 |
+
return result
|
| 133 |
+
|
| 134 |
+
def _identify_categorical_columns(self) -> List[str]:
|
| 135 |
+
"""
|
| 136 |
+
Identify categorical columns in the dataset
|
| 137 |
+
|
| 138 |
+
Returns:
|
| 139 |
+
List[str]: List of categorical column names
|
| 140 |
+
"""
|
| 141 |
+
cat_columns = []
|
| 142 |
+
for col in self.df.columns:
|
| 143 |
+
# Consider object, category, and boolean types as categorical
|
| 144 |
+
if self.df[col].dtype == 'object' or self.df[col].dtype == 'category' or self.df[col].dtype == 'bool':
|
| 145 |
+
cat_columns.append(col)
|
| 146 |
+
# Also consider int/float columns with few unique values as categorical
|
| 147 |
+
elif (self.df[col].dtype == 'int64' or self.df[col].dtype == 'float64') and \
|
| 148 |
+
self.df[col].nunique() < 10 and self.df[col].nunique() / len(self.df) < 0.05:
|
| 149 |
+
cat_columns.append(col)
|
| 150 |
+
|
| 151 |
+
return cat_columns
|
| 152 |
+
|
| 153 |
+
def _identify_numerical_columns(self) -> List[str]:
|
| 154 |
+
"""
|
| 155 |
+
Identify numerical columns in the dataset
|
| 156 |
+
|
| 157 |
+
Returns:
|
| 158 |
+
List[str]: List of numerical column names
|
| 159 |
+
"""
|
| 160 |
+
num_columns = []
|
| 161 |
+
cat_columns = self._identify_categorical_columns()
|
| 162 |
+
|
| 163 |
+
for col in self.df.columns:
|
| 164 |
+
if col not in cat_columns and pd.api.types.is_numeric_dtype(self.df[col].dtype):
|
| 165 |
+
num_columns.append(col)
|
| 166 |
+
|
| 167 |
+
return num_columns
|
| 168 |
+
|
| 169 |
+
def _count_unique_values(self) -> Dict[str, int]:
|
| 170 |
+
"""
|
| 171 |
+
Count unique values for each column
|
| 172 |
+
|
| 173 |
+
Returns:
|
| 174 |
+
Dict: Column names as keys, unique count as values
|
| 175 |
+
"""
|
| 176 |
+
return {col: self.df[col].nunique() for col in self.df.columns}
|
| 177 |
+
|
| 178 |
+
def generate_eda_visualizations(self) -> Dict[str, str]:
|
| 179 |
+
"""
|
| 180 |
+
Generate common EDA visualizations
|
| 181 |
+
|
| 182 |
+
Returns:
|
| 183 |
+
Dict: Dictionary of visualization titles and their base64-encoded images
|
| 184 |
+
"""
|
| 185 |
+
if self.df is None:
|
| 186 |
+
raise ValueError("No dataframe loaded. Please load a dataframe first.")
|
| 187 |
+
|
| 188 |
+
visualizations = {}
|
| 189 |
+
|
| 190 |
+
# 1. Missing values heatmap
|
| 191 |
+
visualizations["missing_values_heatmap"] = self._plot_missing_values()
|
| 192 |
+
|
| 193 |
+
# 2. Distribution plots for numerical features
|
| 194 |
+
num_columns = self._identify_numerical_columns()
|
| 195 |
+
for i, col in enumerate(num_columns[:5]): # Limit to first 5 numerical columns
|
| 196 |
+
visualizations[f"distribution_{col}"] = self._plot_distribution(col)
|
| 197 |
+
|
| 198 |
+
# 3. Correlation heatmap
|
| 199 |
+
visualizations["correlation_heatmap"] = self._plot_correlation_heatmap()
|
| 200 |
+
|
| 201 |
+
# 4. Categorical feature distributions
|
| 202 |
+
cat_columns = self._identify_categorical_columns()
|
| 203 |
+
for i, col in enumerate(cat_columns[:5]): # Limit to first 5 categorical columns
|
| 204 |
+
visualizations[f"categorical_{col}"] = self._plot_categorical_distribution(col)
|
| 205 |
+
|
| 206 |
+
# 5. Scatter plot of 2 most correlated features
|
| 207 |
+
if len(num_columns) >= 2:
|
| 208 |
+
visualizations["scatter_plot"] = self._plot_scatter_correlation()
|
| 209 |
+
|
| 210 |
+
return visualizations
|
| 211 |
+
|
| 212 |
+
def _plot_missing_values(self) -> str:
|
| 213 |
+
"""Generate missing values heatmap"""
|
| 214 |
+
plt.figure(figsize=(10, 6))
|
| 215 |
+
sns.heatmap(self.df.isnull(), cmap='viridis', yticklabels=False, cbar=True, cbar_kws={'label': 'Missing Data'})
|
| 216 |
+
plt.tight_layout()
|
| 217 |
+
plt.title('Missing Values Heatmap')
|
| 218 |
+
|
| 219 |
+
# Convert plot to base64 string
|
| 220 |
+
return self._fig_to_base64(plt.gcf())
|
| 221 |
+
|
| 222 |
+
def _plot_distribution(self, column: str) -> str:
|
| 223 |
+
"""Generate distribution plot for a numerical column"""
|
| 224 |
+
plt.figure(figsize=(10, 6))
|
| 225 |
+
|
| 226 |
+
# Histogram with KDE
|
| 227 |
+
sns.histplot(data=self.df, x=column, kde=True)
|
| 228 |
+
|
| 229 |
+
plt.title(f'Distribution of {column}')
|
| 230 |
+
plt.xlabel(column)
|
| 231 |
+
plt.ylabel('Frequency')
|
| 232 |
+
plt.tight_layout()
|
| 233 |
+
|
| 234 |
+
# Convert plot to base64 string
|
| 235 |
+
return self._fig_to_base64(plt.gcf())
|
| 236 |
+
|
| 237 |
+
def _plot_correlation_heatmap(self) -> str:
|
| 238 |
+
"""Generate correlation heatmap"""
|
| 239 |
+
num_columns = self._identify_numerical_columns()
|
| 240 |
+
|
| 241 |
+
if not num_columns or len(num_columns) < 2:
|
| 242 |
+
return ""
|
| 243 |
+
|
| 244 |
+
plt.figure(figsize=(12, 10))
|
| 245 |
+
corr_matrix = self.df[num_columns].corr()
|
| 246 |
+
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
|
| 247 |
+
|
| 248 |
+
# Custom diverging palette
|
| 249 |
+
cmap = sns.diverging_palette(230, 20, as_cmap=True)
|
| 250 |
+
|
| 251 |
+
# Draw heatmap
|
| 252 |
+
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0,
|
| 253 |
+
square=True, linewidths=.5, annot=True, fmt=".2f")
|
| 254 |
+
|
| 255 |
+
plt.title('Correlation Heatmap')
|
| 256 |
+
plt.tight_layout()
|
| 257 |
+
|
| 258 |
+
# Convert plot to base64 string
|
| 259 |
+
return self._fig_to_base64(plt.gcf())
|
| 260 |
+
|
| 261 |
+
def _plot_categorical_distribution(self, column: str) -> str:
|
| 262 |
+
"""Generate bar plot for categorical column"""
|
| 263 |
+
plt.figure(figsize=(10, 6))
|
| 264 |
+
|
| 265 |
+
# Get value counts and limit to top 10 categories if there are too many
|
| 266 |
+
value_counts = self.df[column].value_counts()
|
| 267 |
+
if len(value_counts) > 10:
|
| 268 |
+
# Keep top 9 categories and group the rest as 'Other'
|
| 269 |
+
top_categories = value_counts.nlargest(9).index
|
| 270 |
+
data = self.df.copy()
|
| 271 |
+
data[column] = data[column].apply(lambda x: x if x in top_categories else 'Other')
|
| 272 |
+
sns.countplot(y=column, data=data, order=data[column].value_counts().index)
|
| 273 |
+
else:
|
| 274 |
+
sns.countplot(y=column, data=self.df, order=value_counts.index)
|
| 275 |
+
|
| 276 |
+
plt.title(f'Distribution of {column}')
|
| 277 |
+
plt.xlabel('Count')
|
| 278 |
+
plt.ylabel(column)
|
| 279 |
+
plt.tight_layout()
|
| 280 |
+
|
| 281 |
+
# Convert plot to base64 string
|
| 282 |
+
return self._fig_to_base64(plt.gcf())
|
| 283 |
+
|
| 284 |
+
def _plot_scatter_correlation(self) -> str:
|
| 285 |
+
"""Generate scatter plot of two most correlated features"""
|
| 286 |
+
num_columns = self._identify_numerical_columns()
|
| 287 |
+
|
| 288 |
+
if not num_columns or len(num_columns) < 2:
|
| 289 |
+
return ""
|
| 290 |
+
|
| 291 |
+
# Find the two most correlated features
|
| 292 |
+
corr_matrix = self.df[num_columns].corr().abs()
|
| 293 |
+
|
| 294 |
+
# Get upper triangle mask
|
| 295 |
+
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
|
| 296 |
+
corr_matrix = corr_matrix.mask(mask)
|
| 297 |
+
|
| 298 |
+
# Find the max correlation
|
| 299 |
+
max_corr = corr_matrix.max().max()
|
| 300 |
+
max_corr_idx = corr_matrix.stack().idxmax()
|
| 301 |
+
|
| 302 |
+
if pd.isna(max_corr):
|
| 303 |
+
return ""
|
| 304 |
+
|
| 305 |
+
# Get the column names
|
| 306 |
+
col1, col2 = max_corr_idx
|
| 307 |
+
|
| 308 |
+
# Create scatter plot
|
| 309 |
+
plt.figure(figsize=(10, 6))
|
| 310 |
+
|
| 311 |
+
# Add regression line
|
| 312 |
+
sns.regplot(x=col1, y=col2, data=self.df, scatter_kws={'alpha': 0.5})
|
| 313 |
+
|
| 314 |
+
plt.title(f'Scatter plot of {col1} vs {col2} (correlation: {corr_matrix.loc[col1, col2]:.2f})')
|
| 315 |
+
plt.tight_layout()
|
| 316 |
+
|
| 317 |
+
# Convert plot to base64 string
|
| 318 |
+
return self._fig_to_base64(plt.gcf())
|
| 319 |
+
|
| 320 |
+
def _fig_to_base64(self, fig) -> str:
|
| 321 |
+
"""Convert matplotlib figure to base64 string"""
|
| 322 |
+
buf = BytesIO()
|
| 323 |
+
fig.savefig(buf, format='png', bbox_inches='tight')
|
| 324 |
+
buf.seek(0)
|
| 325 |
+
img_str = base64.b64encode(buf.read()).decode('utf-8')
|
| 326 |
+
plt.close(fig)
|
| 327 |
+
return img_str
|
| 328 |
+
|
| 329 |
+
def suggest_data_preprocessing(self) -> Dict[str, List[str]]:
|
| 330 |
+
"""
|
| 331 |
+
Suggest preprocessing steps based on dataset analysis
|
| 332 |
+
|
| 333 |
+
Returns:
|
| 334 |
+
Dict: Dictionary of preprocessing suggestions for each column type
|
| 335 |
+
"""
|
| 336 |
+
if not self.analysis_results:
|
| 337 |
+
self.analyze_dataset()
|
| 338 |
+
|
| 339 |
+
suggestions = {
|
| 340 |
+
"numerical": [],
|
| 341 |
+
"categorical": [],
|
| 342 |
+
"missing_values": [],
|
| 343 |
+
"outliers": [],
|
| 344 |
+
"general": []
|
| 345 |
+
}
|
| 346 |
+
|
| 347 |
+
# Missing values suggestions
|
| 348 |
+
missing_cols = [col for col, (count, _) in self.analysis_results["missing_values"].items() if count > 0]
|
| 349 |
+
if missing_cols:
|
| 350 |
+
suggestions["missing_values"].append(f"Found {len(missing_cols)} columns with missing values.")
|
| 351 |
+
if len(missing_cols) > 5:
|
| 352 |
+
suggestions["missing_values"].append(f"Columns with highest missing values: {', '.join(missing_cols[:5])}...")
|
| 353 |
+
else:
|
| 354 |
+
suggestions["missing_values"].append(f"Columns with missing values: {', '.join(missing_cols)}")
|
| 355 |
+
|
| 356 |
+
suggestions["missing_values"].append("Consider these strategies for handling missing values:")
|
| 357 |
+
suggestions["missing_values"].append("- Imputation (mean/median for numerical, mode for categorical)")
|
| 358 |
+
suggestions["missing_values"].append("- Creating missing value indicators as new features")
|
| 359 |
+
suggestions["missing_values"].append("- Removing rows or columns with too many missing values")
|
| 360 |
+
|
| 361 |
+
# Numerical column suggestions
|
| 362 |
+
num_cols = self.analysis_results["numerical_columns"]
|
| 363 |
+
if num_cols:
|
| 364 |
+
suggestions["numerical"].append(f"Found {len(num_cols)} numerical columns.")
|
| 365 |
+
suggestions["numerical"].append("Consider these preprocessing steps:")
|
| 366 |
+
suggestions["numerical"].append("- Scaling (StandardScaler or MinMaxScaler)")
|
| 367 |
+
suggestions["numerical"].append("- Check for skewness and apply log or Box-Cox transformation if needed")
|
| 368 |
+
suggestions["numerical"].append("- Create binned versions of continuous variables")
|
| 369 |
+
|
| 370 |
+
# Check for potential outliers
|
| 371 |
+
for col in num_cols:
|
| 372 |
+
if col in self.df.columns: # Safety check
|
| 373 |
+
q1 = self.df[col].quantile(0.25)
|
| 374 |
+
q3 = self.df[col].quantile(0.75)
|
| 375 |
+
iqr = q3 - q1
|
| 376 |
+
outlier_count = ((self.df[col] < (q1 - 1.5 * iqr)) | (self.df[col] > (q3 + 1.5 * iqr))).sum()
|
| 377 |
+
|
| 378 |
+
if outlier_count > 0:
|
| 379 |
+
percentage = round((outlier_count / len(self.df)) * 100, 2)
|
| 380 |
+
if percentage > 5: # If more than 5% are outliers
|
| 381 |
+
suggestions["outliers"].append(f"Column '{col}' has {outlier_count} potential outliers ({percentage}%).")
|
| 382 |
+
|
| 383 |
+
# Categorical column suggestions
|
| 384 |
+
cat_cols = self.analysis_results["categorical_columns"]
|
| 385 |
+
if cat_cols:
|
| 386 |
+
suggestions["categorical"].append(f"Found {len(cat_cols)} categorical columns.")
|
| 387 |
+
|
| 388 |
+
# Check cardinality (number of unique values)
|
| 389 |
+
high_cardinality = []
|
| 390 |
+
for col in cat_cols:
|
| 391 |
+
unique_count = self.analysis_results["unique_values"].get(col, 0)
|
| 392 |
+
if unique_count > 10:
|
| 393 |
+
high_cardinality.append((col, unique_count))
|
| 394 |
+
|
| 395 |
+
if high_cardinality:
|
| 396 |
+
suggestions["categorical"].append("High cardinality columns (many unique values):")
|
| 397 |
+
for col, count in sorted(high_cardinality, key=lambda x: x[1], reverse=True)[:5]:
|
| 398 |
+
suggestions["categorical"].append(f"- {col}: {count} unique values")
|
| 399 |
+
|
| 400 |
+
suggestions["categorical"].append("For high cardinality columns, consider:")
|
| 401 |
+
suggestions["categorical"].append("- Grouping less frequent categories")
|
| 402 |
+
suggestions["categorical"].append("- Target encoding or embedding techniques")
|
| 403 |
+
|
| 404 |
+
suggestions["categorical"].append("General categorical encoding strategies:")
|
| 405 |
+
suggestions["categorical"].append("- One-hot encoding for low cardinality columns")
|
| 406 |
+
suggestions["categorical"].append("- Label encoding for ordinal variables")
|
| 407 |
+
|
| 408 |
+
# General suggestions
|
| 409 |
+
suggestions["general"].append("General preprocessing recommendations:")
|
| 410 |
+
suggestions["general"].append("- Check for duplicate rows and remove if necessary")
|
| 411 |
+
suggestions["general"].append("- Normalize text fields (lowercase, remove special characters)")
|
| 412 |
+
suggestions["general"].append("- Create feature interactions for highly correlated features")
|
| 413 |
+
|
| 414 |
+
return suggestions
|
| 415 |
+
|
| 416 |
+
def generate_feature_engineering_ideas(self) -> List[str]:
|
| 417 |
+
"""
|
| 418 |
+
Generate feature engineering ideas based on dataset analysis
|
| 419 |
+
|
| 420 |
+
Returns:
|
| 421 |
+
List[str]: List of feature engineering suggestions
|
| 422 |
+
"""
|
| 423 |
+
if not self.analysis_results:
|
| 424 |
+
self.analyze_dataset()
|
| 425 |
+
|
| 426 |
+
ideas = []
|
| 427 |
+
|
| 428 |
+
# Get column types
|
| 429 |
+
num_cols = self.analysis_results["numerical_columns"]
|
| 430 |
+
cat_cols = self.analysis_results["categorical_columns"]
|
| 431 |
+
|
| 432 |
+
# Aggregation features
|
| 433 |
+
if len(num_cols) >= 2:
|
| 434 |
+
ideas.append("### Numerical Feature Transformations:")
|
| 435 |
+
ideas.append("1. Create polynomial features for continuous variables")
|
| 436 |
+
ideas.append("2. Apply mathematical transformations (log, sqrt, square) to handle skewed distributions")
|
| 437 |
+
ideas.append("3. Create binned versions of continuous features to capture non-linear relationships")
|
| 438 |
+
|
| 439 |
+
# Check for date/time related column names
|
| 440 |
+
time_related_cols = [col for col in self.df.columns if any(x in col.lower() for x in ['date', 'time', 'year', 'month', 'day'])]
|
| 441 |
+
if time_related_cols:
|
| 442 |
+
ideas.append("\n### Time-Based Features:")
|
| 443 |
+
ideas.append(f"Detected potential date/time columns: {', '.join(time_related_cols)}")
|
| 444 |
+
ideas.append("1. Extract components like year, month, day, weekday, quarter")
|
| 445 |
+
ideas.append("2. Create cyclical features using sine/cosine transformations for periodic time components")
|
| 446 |
+
ideas.append("3. Calculate time since specific events or time differences between dates")
|
| 447 |
+
|
| 448 |
+
# Categorical interactions
|
| 449 |
+
if len(cat_cols) >= 2:
|
| 450 |
+
ideas.append("\n### Categorical Feature Engineering:")
|
| 451 |
+
ideas.append("1. Create interaction features by combining categorical variables")
|
| 452 |
+
ideas.append("2. Use target encoding for high cardinality categorical features")
|
| 453 |
+
ideas.append("3. Combine rare categories into an 'Other' category to reduce dimensionality")
|
| 454 |
+
|
| 455 |
+
# Mixed interactions
|
| 456 |
+
if num_cols and cat_cols:
|
| 457 |
+
ideas.append("\n### Feature Interactions:")
|
| 458 |
+
ideas.append("1. Create group-based statistics (mean, median, min, max) of numerical features grouped by categorical features")
|
| 459 |
+
ideas.append("2. Calculate the difference from group means for numerical features")
|
| 460 |
+
ideas.append("3. Create ratio or difference features between related numerical columns")
|
| 461 |
+
|
| 462 |
+
# Dimensionality reduction
|
| 463 |
+
if len(num_cols) > 10:
|
| 464 |
+
ideas.append("\n### Dimensionality Reduction:")
|
| 465 |
+
ideas.append("1. Apply PCA to reduce dimensionality and create principal components")
|
| 466 |
+
ideas.append("2. Use feature selection methods (information gain, chi-square, mutual information)")
|
| 467 |
+
ideas.append("3. Try UMAP or t-SNE for non-linear dimensionality reduction")
|
| 468 |
+
|
| 469 |
+
# Text features
|
| 470 |
+
text_cols = [col for col in self.df.columns if self.df[col].dtype == 'object' and
|
| 471 |
+
self.df[col].apply(lambda x: isinstance(x, str) and len(x.split()) > 3).mean() > 0.5]
|
| 472 |
+
if text_cols:
|
| 473 |
+
ideas.append("\n### Text Feature Engineering:")
|
| 474 |
+
ideas.append(f"Detected potential text columns: {', '.join(text_cols)}")
|
| 475 |
+
ideas.append("1. Create bag-of-words or TF-IDF representations")
|
| 476 |
+
ideas.append("2. Extract text length, word count, and other statistical features")
|
| 477 |
+
ideas.append("3. Consider pretrained word embeddings or sentence transformers")
|
| 478 |
+
|
| 479 |
+
return ideas
|
llm_inference.py
ADDED
|
@@ -0,0 +1,377 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
LLM Inference Module
|
| 3 |
+
|
| 4 |
+
This module handles all interactions with the Groq API via LangChain,
|
| 5 |
+
allowing the application to generate EDA insights and feature engineering
|
| 6 |
+
recommendations from dataset analysis.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
import os
|
| 10 |
+
from dotenv import load_dotenv
|
| 11 |
+
import logging
|
| 12 |
+
import time
|
| 13 |
+
from typing import Dict, Any, List, Optional
|
| 14 |
+
from langchain_community.callbacks.manager import get_openai_callback
|
| 15 |
+
|
| 16 |
+
# LangChain imports
|
| 17 |
+
from langchain_groq import ChatGroq
|
| 18 |
+
from langchain_core.messages import HumanMessage
|
| 19 |
+
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
|
| 20 |
+
# from langchain_community.callbacks.manager import get_openai_callbatck
|
| 21 |
+
from langchain_core.runnables import RunnableSequence
|
| 22 |
+
|
| 23 |
+
# Configure logging
|
| 24 |
+
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")
|
| 25 |
+
logger = logging.getLogger(__name__)
|
| 26 |
+
|
| 27 |
+
# Load environment variables
|
| 28 |
+
load_dotenv()
|
| 29 |
+
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
|
| 30 |
+
|
| 31 |
+
if not GROQ_API_KEY:
|
| 32 |
+
raise ValueError("GROQ_API_KEY not found in environment variables. Please add it to your .env file.")
|
| 33 |
+
|
| 34 |
+
# Create LLM model
|
| 35 |
+
try:
|
| 36 |
+
llm = ChatGroq(model_name="llama3-8b-8192", groq_api_key=GROQ_API_KEY)
|
| 37 |
+
logger.info("Successfully initialized Groq client")
|
| 38 |
+
except Exception as e:
|
| 39 |
+
logger.error(f"Failed to initialize Groq client: {str(e)}")
|
| 40 |
+
raise
|
| 41 |
+
|
| 42 |
+
class LLMInference:
|
| 43 |
+
"""Class for interacting with LLM via Groq API using LangChain"""
|
| 44 |
+
|
| 45 |
+
def __init__(self, model_id: str = "llama3-8b-8192"):
|
| 46 |
+
"""Initialize the LLM inference class with Groq model"""
|
| 47 |
+
self.model_id = model_id
|
| 48 |
+
self.llm = llm
|
| 49 |
+
|
| 50 |
+
# Initialize prompt templates and chains
|
| 51 |
+
self._init_prompt_templates()
|
| 52 |
+
self._init_chains()
|
| 53 |
+
|
| 54 |
+
logger.info(f"LLMInference initialized with model: {model_id}")
|
| 55 |
+
|
| 56 |
+
def _init_prompt_templates(self):
|
| 57 |
+
"""Initialize all prompt templates"""
|
| 58 |
+
|
| 59 |
+
# EDA insights prompt template
|
| 60 |
+
self.eda_prompt_template = ChatPromptTemplate.from_messages([
|
| 61 |
+
HumanMessagePromptTemplate.from_template(
|
| 62 |
+
"""You are a data scientist tasked with performing Exploratory Data Analysis (EDA) on a dataset.
|
| 63 |
+
Based on the following dataset information, provide comprehensive EDA insights:
|
| 64 |
+
|
| 65 |
+
Dataset Information:
|
| 66 |
+
- Shape: {shape}
|
| 67 |
+
- Columns and their types:
|
| 68 |
+
{columns_info}
|
| 69 |
+
|
| 70 |
+
- Missing values:
|
| 71 |
+
{missing_info}
|
| 72 |
+
|
| 73 |
+
- Basic statistics:
|
| 74 |
+
{basic_stats}
|
| 75 |
+
|
| 76 |
+
- Top correlations:
|
| 77 |
+
{correlations}
|
| 78 |
+
|
| 79 |
+
- Sample data:
|
| 80 |
+
{sample_data}
|
| 81 |
+
|
| 82 |
+
Please provide a detailed EDA analysis that includes:
|
| 83 |
+
|
| 84 |
+
1. Summary of the dataset (what it appears to be about, key features, etc.)
|
| 85 |
+
2. Distribution analysis of key variables
|
| 86 |
+
3. Relationship analysis between variables
|
| 87 |
+
4. Identification of patterns, outliers, or anomalies
|
| 88 |
+
5. Recommended visualizations that would be insightful
|
| 89 |
+
6. Initial hypotheses based on the data
|
| 90 |
+
|
| 91 |
+
Your analysis should be structured, thorough, and provide actionable insights for further investigation.
|
| 92 |
+
"""
|
| 93 |
+
)
|
| 94 |
+
])
|
| 95 |
+
|
| 96 |
+
# Feature engineering prompt template
|
| 97 |
+
self.feature_engineering_prompt_template = ChatPromptTemplate.from_messages([
|
| 98 |
+
HumanMessagePromptTemplate.from_template(
|
| 99 |
+
"""You are a machine learning engineer specializing in feature engineering.
|
| 100 |
+
Based on the following dataset information, provide recommendations for feature engineering:
|
| 101 |
+
|
| 102 |
+
Dataset Information:
|
| 103 |
+
- Shape: {shape}
|
| 104 |
+
- Columns and their types:
|
| 105 |
+
{columns_info}
|
| 106 |
+
|
| 107 |
+
- Basic statistics:
|
| 108 |
+
{basic_stats}
|
| 109 |
+
|
| 110 |
+
- Top correlations:
|
| 111 |
+
{correlations}
|
| 112 |
+
|
| 113 |
+
Please provide comprehensive feature engineering recommendations that include:
|
| 114 |
+
|
| 115 |
+
1. Numerical feature transformations (scaling, normalization, log transforms, etc.)
|
| 116 |
+
2. Categorical feature encoding strategies
|
| 117 |
+
3. Feature interaction suggestions
|
| 118 |
+
4. Dimensionality reduction approaches if applicable
|
| 119 |
+
5. Time-based feature creation if applicable
|
| 120 |
+
6. Text processing techniques if there are text fields
|
| 121 |
+
7. Feature selection recommendations
|
| 122 |
+
|
| 123 |
+
For each recommendation, explain why it would be beneficial and how it could improve model performance.
|
| 124 |
+
Be specific to this dataset's characteristics rather than providing generic advice.
|
| 125 |
+
"""
|
| 126 |
+
)
|
| 127 |
+
])
|
| 128 |
+
|
| 129 |
+
# Data quality prompt template
|
| 130 |
+
self.data_quality_prompt_template = ChatPromptTemplate.from_messages([
|
| 131 |
+
HumanMessagePromptTemplate.from_template(
|
| 132 |
+
"""You are a data quality expert.
|
| 133 |
+
Based on the following dataset information, provide data quality insights and recommendations:
|
| 134 |
+
|
| 135 |
+
Dataset Information:
|
| 136 |
+
- Shape: {shape}
|
| 137 |
+
- Columns and their types:
|
| 138 |
+
{columns_info}
|
| 139 |
+
|
| 140 |
+
- Missing values:
|
| 141 |
+
{missing_info}
|
| 142 |
+
|
| 143 |
+
- Basic statistics:
|
| 144 |
+
{basic_stats}
|
| 145 |
+
|
| 146 |
+
Please provide a comprehensive data quality assessment that includes:
|
| 147 |
+
|
| 148 |
+
1. Assessment of data completeness (missing values)
|
| 149 |
+
2. Identification of potential data inconsistencies or errors
|
| 150 |
+
3. Recommendations for data cleaning and preprocessing
|
| 151 |
+
4. Advice on handling outliers
|
| 152 |
+
5. Suggestions for data validation checks
|
| 153 |
+
6. Recommendations to improve data quality
|
| 154 |
+
|
| 155 |
+
Your assessment should be specific to this dataset and provide actionable recommendations.
|
| 156 |
+
"""
|
| 157 |
+
)
|
| 158 |
+
])
|
| 159 |
+
|
| 160 |
+
# QA prompt template
|
| 161 |
+
self.qa_prompt_template = ChatPromptTemplate.from_messages([
|
| 162 |
+
HumanMessagePromptTemplate.from_template(
|
| 163 |
+
"""You are a data scientist answering questions about a dataset.
|
| 164 |
+
Based on the following dataset information, please answer the user's question:
|
| 165 |
+
|
| 166 |
+
Dataset Information:
|
| 167 |
+
- Shape: {shape}
|
| 168 |
+
- Columns and their types:
|
| 169 |
+
{columns_info}
|
| 170 |
+
|
| 171 |
+
- Basic statistics:
|
| 172 |
+
{basic_stats}
|
| 173 |
+
|
| 174 |
+
User's question: {question}
|
| 175 |
+
|
| 176 |
+
Please provide a clear, informative answer to the user's question based on the dataset information provided.
|
| 177 |
+
"""
|
| 178 |
+
)
|
| 179 |
+
])
|
| 180 |
+
|
| 181 |
+
def _init_chains(self):
|
| 182 |
+
"""Initialize all chains using modern RunnableSequence pattern"""
|
| 183 |
+
|
| 184 |
+
# EDA insights chain
|
| 185 |
+
self.eda_chain = self.eda_prompt_template | self.llm
|
| 186 |
+
|
| 187 |
+
# Feature engineering chain
|
| 188 |
+
self.feature_engineering_chain = self.feature_engineering_prompt_template | self.llm
|
| 189 |
+
|
| 190 |
+
# Data quality chain
|
| 191 |
+
self.data_quality_chain = self.data_quality_prompt_template | self.llm
|
| 192 |
+
|
| 193 |
+
# QA chain
|
| 194 |
+
self.qa_chain = self.qa_prompt_template | self.llm
|
| 195 |
+
|
| 196 |
+
def _format_columns_info(self, columns: List[str], dtypes: Dict[str, str]) -> str:
|
| 197 |
+
"""Format columns info for prompt"""
|
| 198 |
+
return "\n".join([f"- {col} ({dtypes.get(col, 'unknown')})" for col in columns])
|
| 199 |
+
|
| 200 |
+
def _format_missing_info(self, missing_values: Dict[str, tuple]) -> str:
|
| 201 |
+
"""Format missing values info for prompt"""
|
| 202 |
+
missing_info = "\n".join([f"- {col}: {count} missing values ({percent}%)"
|
| 203 |
+
for col, (count, percent) in missing_values.items() if count > 0])
|
| 204 |
+
|
| 205 |
+
if not missing_info:
|
| 206 |
+
missing_info = "No missing values detected."
|
| 207 |
+
|
| 208 |
+
return missing_info
|
| 209 |
+
|
| 210 |
+
def _execute_chain(
|
| 211 |
+
self,
|
| 212 |
+
chain: RunnableSequence,
|
| 213 |
+
input_data: Dict[str, Any],
|
| 214 |
+
operation_name: str
|
| 215 |
+
) -> str:
|
| 216 |
+
"""
|
| 217 |
+
Execute a chain with tracking and error handling
|
| 218 |
+
|
| 219 |
+
Args:
|
| 220 |
+
chain: The LangChain chain to execute
|
| 221 |
+
input_data: The input data for the chain
|
| 222 |
+
operation_name: Name of the operation for logging
|
| 223 |
+
|
| 224 |
+
Returns:
|
| 225 |
+
str: The generated text
|
| 226 |
+
"""
|
| 227 |
+
try:
|
| 228 |
+
start_time = time.time()
|
| 229 |
+
with get_openai_callback() as cb:
|
| 230 |
+
result = chain.invoke(input_data).content
|
| 231 |
+
elapsed_time = time.time() - start_time
|
| 232 |
+
|
| 233 |
+
logger.info(f"{operation_name} generated in {elapsed_time:.2f} seconds")
|
| 234 |
+
logger.info(f"Tokens used: {cb.total_tokens}, "
|
| 235 |
+
f"Prompt tokens: {cb.prompt_tokens}, "
|
| 236 |
+
f"Completion tokens: {cb.completion_tokens}")
|
| 237 |
+
|
| 238 |
+
return result
|
| 239 |
+
except Exception as e:
|
| 240 |
+
error_msg = f"Error executing {operation_name.lower()}: {str(e)}"
|
| 241 |
+
logger.error(error_msg)
|
| 242 |
+
return error_msg
|
| 243 |
+
|
| 244 |
+
def generate_eda_insights(self, dataset_info: Dict[str, Any]) -> str:
|
| 245 |
+
"""
|
| 246 |
+
Generate EDA insights based on dataset information using LangChain
|
| 247 |
+
|
| 248 |
+
Args:
|
| 249 |
+
dataset_info: Dictionary containing dataset analysis
|
| 250 |
+
|
| 251 |
+
Returns:
|
| 252 |
+
str: Detailed EDA insights and recommendations
|
| 253 |
+
"""
|
| 254 |
+
logger.info("Generating EDA insights")
|
| 255 |
+
|
| 256 |
+
# Format the input data
|
| 257 |
+
columns_info = self._format_columns_info(
|
| 258 |
+
dataset_info.get("columns", []),
|
| 259 |
+
dataset_info.get("dtypes", {})
|
| 260 |
+
)
|
| 261 |
+
|
| 262 |
+
missing_info = self._format_missing_info(
|
| 263 |
+
dataset_info.get("missing_values", {})
|
| 264 |
+
)
|
| 265 |
+
|
| 266 |
+
# Prepare input for the chain
|
| 267 |
+
input_data = {
|
| 268 |
+
"shape": dataset_info.get("shape", "N/A"),
|
| 269 |
+
"columns_info": columns_info,
|
| 270 |
+
"missing_info": missing_info,
|
| 271 |
+
"basic_stats": dataset_info.get("basic_stats", ""),
|
| 272 |
+
"correlations": dataset_info.get("correlations", ""),
|
| 273 |
+
"sample_data": dataset_info.get("sample_data", "N/A")
|
| 274 |
+
}
|
| 275 |
+
|
| 276 |
+
return self._execute_chain(self.eda_chain, input_data, "EDA insights")
|
| 277 |
+
|
| 278 |
+
def generate_feature_engineering_recommendations(self, dataset_info: Dict[str, Any]) -> str:
|
| 279 |
+
"""
|
| 280 |
+
Generate feature engineering recommendations based on dataset information using LangChain
|
| 281 |
+
|
| 282 |
+
Args:
|
| 283 |
+
dataset_info: Dictionary containing dataset analysis
|
| 284 |
+
|
| 285 |
+
Returns:
|
| 286 |
+
str: Feature engineering recommendations
|
| 287 |
+
"""
|
| 288 |
+
logger.info("Generating feature engineering recommendations")
|
| 289 |
+
|
| 290 |
+
# Format the input data
|
| 291 |
+
columns_info = self._format_columns_info(
|
| 292 |
+
dataset_info.get("columns", []),
|
| 293 |
+
dataset_info.get("dtypes", {})
|
| 294 |
+
)
|
| 295 |
+
|
| 296 |
+
# Prepare input for the chain
|
| 297 |
+
input_data = {
|
| 298 |
+
"shape": dataset_info.get("shape", "N/A"),
|
| 299 |
+
"columns_info": columns_info,
|
| 300 |
+
"basic_stats": dataset_info.get("basic_stats", ""),
|
| 301 |
+
"correlations": dataset_info.get("correlations", "")
|
| 302 |
+
}
|
| 303 |
+
|
| 304 |
+
return self._execute_chain(
|
| 305 |
+
self.feature_engineering_chain,
|
| 306 |
+
input_data,
|
| 307 |
+
"Feature engineering recommendations"
|
| 308 |
+
)
|
| 309 |
+
|
| 310 |
+
def generate_data_quality_insights(self, dataset_info: Dict[str, Any]) -> str:
|
| 311 |
+
"""
|
| 312 |
+
Generate data quality insights based on dataset information using LangChain
|
| 313 |
+
|
| 314 |
+
Args:
|
| 315 |
+
dataset_info: Dictionary containing dataset analysis
|
| 316 |
+
|
| 317 |
+
Returns:
|
| 318 |
+
str: Data quality insights and improvement recommendations
|
| 319 |
+
"""
|
| 320 |
+
logger.info("Generating data quality insights")
|
| 321 |
+
|
| 322 |
+
# Format the input data
|
| 323 |
+
columns_info = self._format_columns_info(
|
| 324 |
+
dataset_info.get("columns", []),
|
| 325 |
+
dataset_info.get("dtypes", {})
|
| 326 |
+
)
|
| 327 |
+
|
| 328 |
+
missing_info = self._format_missing_info(
|
| 329 |
+
dataset_info.get("missing_values", {})
|
| 330 |
+
)
|
| 331 |
+
|
| 332 |
+
# Prepare input for the chain
|
| 333 |
+
input_data = {
|
| 334 |
+
"shape": dataset_info.get("shape", "N/A"),
|
| 335 |
+
"columns_info": columns_info,
|
| 336 |
+
"missing_info": missing_info,
|
| 337 |
+
"basic_stats": dataset_info.get("basic_stats", "")
|
| 338 |
+
}
|
| 339 |
+
|
| 340 |
+
return self._execute_chain(
|
| 341 |
+
self.data_quality_chain,
|
| 342 |
+
input_data,
|
| 343 |
+
"Data quality insights"
|
| 344 |
+
)
|
| 345 |
+
|
| 346 |
+
def answer_dataset_question(self, question: str, dataset_info: Dict[str, Any]) -> str:
|
| 347 |
+
"""
|
| 348 |
+
Answer a specific question about the dataset using LangChain
|
| 349 |
+
|
| 350 |
+
Args:
|
| 351 |
+
question: User's question about the dataset
|
| 352 |
+
dataset_info: Dictionary containing dataset analysis
|
| 353 |
+
|
| 354 |
+
Returns:
|
| 355 |
+
str: Answer to the user's question
|
| 356 |
+
"""
|
| 357 |
+
logger.info(f"Answering dataset question: {question[:50]}...")
|
| 358 |
+
|
| 359 |
+
# Format the input data
|
| 360 |
+
columns_info = self._format_columns_info(
|
| 361 |
+
dataset_info.get("columns", []),
|
| 362 |
+
dataset_info.get("dtypes", {})
|
| 363 |
+
)
|
| 364 |
+
|
| 365 |
+
# Prepare input for the chain
|
| 366 |
+
input_data = {
|
| 367 |
+
"shape": dataset_info.get("shape", "N/A"),
|
| 368 |
+
"columns_info": columns_info,
|
| 369 |
+
"basic_stats": dataset_info.get("basic_stats", ""),
|
| 370 |
+
"question": question
|
| 371 |
+
}
|
| 372 |
+
|
| 373 |
+
return self._execute_chain(
|
| 374 |
+
self.qa_chain,
|
| 375 |
+
input_data,
|
| 376 |
+
"Answer"
|
| 377 |
+
)
|
requirements.txt
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
streamlit==1.43.2
|
| 2 |
+
pandas==1.5.3
|
| 3 |
+
langchain>=0.3.0,<0.4.0
|
| 4 |
+
langchain-community>=0.3.0,<0.4.0
|
| 5 |
+
langchain-groq>=0.3.0,<0.4.0
|
| 6 |
+
langchain-core>=0.3.47,<0.4.0
|
| 7 |
+
huggingface_hub==0.29.2
|
| 8 |
+
python-dotenv==1.0.0
|
| 9 |
+
matplotlib==3.10.0
|
| 10 |
+
seaborn==0.13.2
|
| 11 |
+
numpy==1.24.3
|
| 12 |
+
scikit-learn==1.6.1
|
| 13 |
+
plotly==5.24.1
|