File size: 4,384 Bytes
5168235
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Set up page
st.set_page_config(page_title="Explore Decision Tree Algorithm", layout="wide")
st.title("🌳 Decision Tree Classifier: Explained with Iris Dataset")

# Intro Section
st.markdown("""
## 🧠 What is a Decision Tree?

A **Decision Tree** is a machine learning algorithm that uses a tree-like structure to make decisions.
Each **internal node** asks a question about a feature, each **branch** is the outcome of that question, and each **leaf node** gives us a final decision or prediction.

> 🧩 Think of it like playing "20 Questions" to guess what something is β€” each question narrows down the possibilities.

---

## βš™οΈ How Decision Trees Work

1. Start with all data at the root.
2. Pick the **best feature** to split the data (using Gini or Entropy).
3. Repeat this process for every split until:
   - All points are classified
   - Or the **maximum depth** is reached

πŸ” Criteria used to choose the best feature:
- **Gini Index** (default)
- **Entropy** (Information Gain)

---

### πŸ“ˆ Pros and Cons

βœ… Easy to understand & visualize  
βœ… Handles numerical and categorical data  
βœ… No need for feature scaling  
⚠️ Can overfit if not controlled (use `max_depth`, `min_samples_leaf`, or pruning)

---
""")

# Dataset and DataFrame
st.subheader("🌼 Let's Explore the Iris Dataset")
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["target"] = iris.target
df["species"] = df["target"].apply(lambda x: iris.target_names[x])

st.markdown("Here's a peek at the dataset πŸ‘‡")
st.dataframe(df.head(), use_container_width=True)

# Feature distribution visualization
st.markdown("### πŸ“Š Visualize Features")
selected_features = st.multiselect("Pick features to visualize", iris.feature_names, default=iris.feature_names[:2])
if len(selected_features) == 2:
    plt.figure(figsize=(8, 5))
    sns.scatterplot(data=df, x=selected_features[0], y=selected_features[1], hue="species", palette="Set2", s=80)
    st.pyplot(plt.gcf())
    plt.clf()

# Sidebar controls
st.sidebar.header("🌲 Model Settings")
criterion = st.sidebar.radio("Splitting Criterion", ["gini", "entropy"])
max_depth = st.sidebar.slider("Max Depth", 1, 10, value=3)

# Prepare data
X = df[iris.feature_names]
y = df["target"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train model
model = DecisionTreeClassifier(criterion=criterion, max_depth=max_depth, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Model performance
acc = accuracy_score(y_test, y_pred)
st.success(f"βœ… Model Accuracy: {acc*100:.2f}%")

# Classification report
st.markdown("### 🧾 Classification Report")
st.text(classification_report(y_test, y_pred, target_names=iris.target_names))

# Confusion matrix
st.markdown("### πŸ” Confusion Matrix")
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel("Predicted")
plt.ylabel("Actual")
st.pyplot(fig)

# Decision tree plot
st.markdown("### 🌳 Visualizing the Tree Structure")
fig, ax = plt.subplots(figsize=(12, 6))
plot_tree(model, filled=True, feature_names=iris.feature_names, class_names=iris.target_names, fontsize=10)
st.pyplot(fig)

# Final tips
st.markdown("""
---
## πŸ’‘ Key Takeaways

- Decision Trees are great for **interpretable models**.
- They require **little to no preprocessing**.
- They're **prone to overfitting**, especially on small datasets β€” use settings like `max_depth` or pruning techniques.

## πŸ“Œ When to Use a Decision Tree?
- When interpretability matters
- When data includes both **numerical and categorical** variables
- When you want to **quickly prototype** and understand your data

> 🎯 *Tip:* Combine multiple trees in an ensemble (like **Random Forest** or **Gradient Boosting**) for better performance!

---
""")