Ramyamaheswari commited on
Commit
630ea40
Β·
verified Β·
1 Parent(s): c5d7418

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +106 -152
app.py CHANGED
@@ -1,160 +1,114 @@
1
  import streamlit as st
 
 
 
 
 
 
 
 
2
 
3
- # Set page configuration
4
- st.set_page_config(page_title="Decision Tree Theory", layout="wide")
 
5
 
6
- # Custom CSS styling
7
  st.markdown("""
8
- <style>
9
- .stApp {
10
- background-color: #4A90E2;
11
- }
12
- h1, h2, h3 {
13
- color: #003366;
14
- }
15
- .custom-font, p {
16
- font-family: 'Arial', sans-serif;
17
- font-size: 18px;
18
- color: white;
19
- line-height: 1.6;
20
- }
21
- </style>
22
- """, unsafe_allow_html=True)
23
-
24
- # Title
25
- st.markdown("<h1 style='color: #003366;'>Understanding Decision Trees</h1>", unsafe_allow_html=True)
26
-
27
- # Introduction
28
- st.markdown("""
29
- A **Decision Tree** is a versatile supervised learning algorithm used for both **classification** and **regression** tasks. It mimics human decision-making by using a tree-like model of decisions and their possible consequences.
30
-
31
- The basic structure includes:
32
- - **Root Node**: Represents the complete dataset.
33
- - **Internal Nodes**: Represent conditions on features.
34
- - **Leaf Nodes**: Represent outcomes or predictions.
35
-
36
- Think of it as a flowchart where each internal node asks a question, and each branch represents the outcome, eventually leading to a final decision.
37
- """, unsafe_allow_html=True)
38
-
39
- # Entropy
40
- st.markdown("<h2 style='color: #003366;'>Entropy: Quantifying Uncertainty</h2>", unsafe_allow_html=True)
41
- st.markdown("""
42
- **Entropy** measures the amount of randomness or disorder in the data. It’s commonly used in classification problems to decide how informative a feature is.
43
-
44
- Entropy formula:
45
- """)
46
- st.image("entropy-formula-2.jpg", width=300)
47
- st.markdown("""
48
- Where:
49
- - \( p(i) \) is the probability of class \( i \).
50
-
51
- **Example**:
52
- - If \( P(Yes) = 0.5 \), \( P(No) = 0.5 \),
53
-
54
- Then:
55
- $$ H(Y) = - (0.5 \cdot \log_2(0.5) + 0.5 \cdot \log_2(0.5)) = 1 $$
56
-
57
- This indicates maximum uncertainty (perfectly balanced classes).
58
- """, unsafe_allow_html=True)
59
-
60
- # Gini Impurity
61
- st.markdown("<h2 style='color: #003366;'>Gini Impurity: Measuring Impurity</h2>", unsafe_allow_html=True)
62
- st.markdown("""
63
- **Gini Impurity** is another popular impurity measure. It calculates how often a randomly chosen element would be incorrectly labeled.
64
-
65
- Formula:
66
  """)
67
- st.image("gini.png", width=300)
68
- st.markdown("""
69
- **Example**:
70
- - \( P(Yes) = 0.5 \), \( P(No) = 0.5 \)
71
-
72
- Then:
73
- $$ Gini(Y) = 1 - (0.5^2 + 0.5^2) = 0.5 $$
74
-
75
- A lower Gini value means purer splits.
76
- """, unsafe_allow_html=True)
77
-
78
- # Tree Construction
79
- st.markdown("<h2 style='color: #003366;'>Building the Decision Tree</h2>", unsafe_allow_html=True)
80
- st.markdown("""
81
- Decision Trees are built **top-down**, starting from the root. At each node, the algorithm selects the feature that best splits the data using metrics like **Entropy** or **Gini**.
82
-
83
- Splitting stops when:
84
- - The data is pure (contains one class), or
85
- - A stopping condition is met (like maximum depth).
86
- """, unsafe_allow_html=True)
87
-
88
- # Iris Tree Visualization
89
- st.markdown("<h2 style='color: #003366;'>Visualizing: Iris Dataset Tree</h2>", unsafe_allow_html=True)
90
- st.markdown("""
91
- Here's an example decision tree trained on the famous **Iris dataset**, which classifies flower species based on petal and sepal measurements.
92
- """, unsafe_allow_html=True)
93
- st.image("dt1 (1).jpg", caption="Decision Tree for Iris Dataset", use_container_width=True)
94
 
95
- # Training & Testing - Classification
96
- st.markdown("<h2 style='color: #003366;'>Training & Testing (Classification)</h2>", unsafe_allow_html=True)
97
- st.markdown("""
98
- **Training**:
99
- - Select features and split based on impurity reduction.
100
- - Recursively grow the tree until stopping criteria are met.
101
-
102
- **Testing**:
103
- - Traverse the tree with new data.
104
- - Follow the decision rules until you reach a leaf node (prediction).
105
-
106
- πŸ’‘ *Example: For Iris, classify the flower as Setosa, Versicolor, or Virginica based on petal dimensions.*
107
- """, unsafe_allow_html=True)
108
-
109
- # Training & Testing - Regression
110
- st.markdown("<h2 style='color: #003366;'>Training & Testing (Regression)</h2>", unsafe_allow_html=True)
111
- st.markdown("""
112
- **Training**:
113
- - Split data to minimize **Mean Squared Error (MSE)**.
114
-
115
- **Testing**:
116
- - Output the mean value in the corresponding leaf.
117
-
118
- πŸ’‘ *Example: Predict house price using features like size, location, and number of rooms.*
119
- """, unsafe_allow_html=True)
120
-
121
- # Pre-Pruning
122
- st.markdown("<h2 style='color: #003366;'>Pre-Pruning: Control Overfitting Early</h2>", unsafe_allow_html=True)
123
- st.markdown("""
124
- Pre-pruning stops the tree from growing too deep and complex. Common techniques include:
125
-
126
- - **Max Depth**
127
- - **Min Samples Split**
128
- - **Min Samples Leaf**
129
- - **Max Features**
130
-
131
- These help in generalizing better and reducing noise.
132
- """, unsafe_allow_html=True)
133
-
134
- # Post-Pruning
135
- st.markdown("<h2 style='color: #003366;'>Post-Pruning: Simplify After Growth</h2>", unsafe_allow_html=True)
136
- st.markdown("""
137
- In **post-pruning**, we allow the tree to grow fully, then trim unnecessary branches:
138
-
139
- - **Cost Complexity Pruning**
140
- - **Validation-based Pruning**
141
-
142
- This helps reduce overfitting and improves model simplicity.
143
- """, unsafe_allow_html=True)
144
-
145
- # Feature Importance
146
- st.markdown("<h2 style='color: #003366;'>Feature Selection with Decision Trees</h2>", unsafe_allow_html=True)
 
 
 
 
147
  st.markdown("""
148
- Decision Trees provide insight into which features are most important based on how often and how effectively they split data.
 
 
 
 
 
 
 
 
 
 
149
  """)
150
- st.image("feature.png", width=500)
151
- st.markdown("""
152
- πŸ’‘ *Higher importance β†’ More influential in decision making.*
153
- """, unsafe_allow_html=True)
154
-
155
- # Notebook Link
156
- st.markdown("<h2 style='color: #003366;'>Explore Hands-On Implementation</h2>", unsafe_allow_html=True)
157
- st.markdown(
158
- "<a href='https://colab.research.google.com/drive/1SqZ5I5h7ivS6SJDwlOZQ-V4IAOg90RE7?usp=sharing' target='_blank' style='font-size: 16px; color: #003366;'>πŸ”— Open Jupyter Notebook on Google Colab</a>",
159
- unsafe_allow_html=True
160
- )
 
1
  import streamlit as st
2
+ import pandas as pd
3
+ import matplotlib.pyplot as plt
4
+ import seaborn as sns
5
+ from sklearn.datasets import load_iris
6
+ from sklearn.model_selection import train_test_split
7
+ from sklearn.neighbors import KNeighborsClassifier
8
+ from sklearn.preprocessing import StandardScaler
9
+ from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
10
 
11
+ # Set up page
12
+ st.set_page_config(page_title="Explore KNN Algorithm", layout="wide")
13
+ st.title("πŸ“ K-Nearest Neighbors (KNN): Explained with Iris Dataset")
14
 
15
+ # Intro Section
16
  st.markdown("""
17
+ ## 🧠 What is K-Nearest Neighbors?
18
+ **KNN** is a simple and intuitive machine learning algorithm that makes predictions based on the **majority class of the K closest data points** in the feature space.
19
+ > 🧭 Think of it like asking your neighbors what they think β€” you take the majority opinion.
20
+
21
+ ---
22
+ ## βš™οΈ How KNN Works
23
+ 1. Choose the number of neighbors **K**.
24
+ 2. Calculate distance (usually **Euclidean**) between points.
25
+ 3. Pick the **K closest data points**.
26
+ 4. Predict the class that occurs most frequently among them.
27
+
28
+ πŸ” Distance Metrics:
29
+ - Euclidean (default)
30
+ - Manhattan
31
+ - Minkowski
32
+
33
+ ---
34
+ ### πŸ“ˆ Pros and Cons
35
+ βœ… Simple to understand
36
+ βœ… No training time (lazy learner)
37
+ βœ… Works well with small datasets
38
+ ⚠️ Slow on large datasets
39
+ ⚠️ Needs feature scaling
40
+ ⚠️ Sensitive to outliers and irrelevant features
41
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
  """)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
+ # Dataset and DataFrame
45
+ st.subheader("🌼 Let's Explore the Iris Dataset")
46
+ iris = load_iris()
47
+ df = pd.DataFrame(iris.data, columns=iris.feature_names)
48
+ df["target"] = iris.target
49
+ df["species"] = df["target"].apply(lambda x: iris.target_names[x])
50
+
51
+ st.markdown("Here's a peek at the dataset πŸ‘‡")
52
+ st.dataframe(df.head(), use_container_width=True)
53
+
54
+ # Feature distribution visualization
55
+ st.markdown("### πŸ“Š Visualize Features")
56
+ selected_features = st.multiselect("Pick features to visualize", iris.feature_names, default=iris.feature_names[:2])
57
+ if len(selected_features) == 2:
58
+ plt.figure(figsize=(8, 5))
59
+ sns.scatterplot(data=df, x=selected_features[0], y=selected_features[1], hue="species", palette="Set2", s=80)
60
+ st.pyplot(plt.gcf())
61
+ plt.clf()
62
+
63
+ # Sidebar controls
64
+ st.sidebar.header("πŸ“ KNN Model Settings")
65
+ n_neighbors = st.sidebar.slider("Number of Neighbors (K)", 1, 15, value=5)
66
+ metric = st.sidebar.selectbox("Distance Metric", ["euclidean", "manhattan", "minkowski"])
67
+
68
+ # Prepare data
69
+ X = df[iris.feature_names]
70
+ y = df["target"]
71
+
72
+ scaler = StandardScaler()
73
+ X_scaled = scaler.fit_transform(X)
74
+
75
+ X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
76
+
77
+ # Train model
78
+ model = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
79
+ model.fit(X_train, y_train)
80
+ y_pred = model.predict(X_test)
81
+
82
+ # Model performance
83
+ acc = accuracy_score(y_test, y_pred)
84
+ st.success(f"βœ… Model Accuracy: {acc*100:.2f}%")
85
+
86
+ # Classification report
87
+ st.markdown("### 🧾 Classification Report")
88
+ st.text(classification_report(y_test, y_pred, target_names=iris.target_names))
89
+
90
+ # Confusion matrix
91
+ st.markdown("### πŸ” Confusion Matrix")
92
+ cm = confusion_matrix(y_test, y_pred)
93
+ fig, ax = plt.subplots()
94
+ sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=iris.target_names, yticklabels=iris.target_names)
95
+ plt.xlabel("Predicted")
96
+ plt.ylabel("Actual")
97
+ st.pyplot(fig)
98
+
99
+ # Final tips
100
  st.markdown("""
101
+ ---
102
+ ## πŸ’‘ Key Takeaways
103
+ - KNN is **non-parametric** and easy to implement.
104
+ - It’s a **lazy learner** β€” no training, just prediction.
105
+ - Sensitive to **scaling**, **K value**, and **irrelevant features**.
106
+ ## πŸ“Œ When to Use KNN?
107
+ - When you want a **simple baseline model**.
108
+ - For **small- to medium-sized** datasets.
109
+ - When your features are properly scaled and meaningful.
110
+ > 🎯 *Tip:* Use cross-validation to choose the optimal value of **K** and avoid overfitting!
111
+ ---
112
  """)
113
+
114
+