| import streamlit as st |
|
|
| |
| st.set_page_config(page_title="Decision Tree Theory", layout="wide") |
|
|
| |
| st.markdown(""" |
| <style> |
| .stApp { |
| background-color: #4A90E2; |
| } |
| h1, h2, h3 { |
| color: #003366; |
| } |
| .custom-font, p { |
| font-family: 'Arial', sans-serif; |
| font-size: 18px; |
| color: white; |
| line-height: 1.6; |
| } |
| </style> |
| """, unsafe_allow_html=True) |
|
|
| |
| st.markdown("<h1 style='color: #003366;'>Understanding Decision Trees</h1>", unsafe_allow_html=True) |
|
|
| |
| st.markdown(""" |
| A **Decision Tree** is a versatile supervised learning algorithm used for both **classification** and **regression** tasks. It mimics human decision-making by using a tree-like model of decisions and their possible consequences. |
| |
| The basic structure includes: |
| - **Root Node**: Represents the complete dataset. |
| - **Internal Nodes**: Represent conditions on features. |
| - **Leaf Nodes**: Represent outcomes or predictions. |
| |
| Think of it as a flowchart where each internal node asks a question, and each branch represents the outcome, eventually leading to a final decision. |
| """, unsafe_allow_html=True) |
|
|
| |
| st.markdown("<h2 style='color: #003366;'>Entropy: Quantifying Uncertainty</h2>", unsafe_allow_html=True) |
| st.markdown(""" |
| **Entropy** measures the amount of randomness or disorder in the data. Itโs commonly used in classification problems to decide how informative a feature is. |
| |
| Entropy formula: |
| """) |
| st.image("entropy-formula-2.jpg", width=300) |
| st.markdown(""" |
| Where: |
| - \( p(i) \) is the probability of class \( i \). |
| |
| **Example**: |
| - If \( P(Yes) = 0.5 \), \( P(No) = 0.5 \), |
| |
| Then: |
| $$ H(Y) = - (0.5 \cdot \log_2(0.5) + 0.5 \cdot \log_2(0.5)) = 1 $$ |
| |
| This indicates maximum uncertainty (perfectly balanced classes). |
| """, unsafe_allow_html=True) |
|
|
| |
| st.markdown("<h2 style='color: #003366;'>Gini Impurity: Measuring Impurity</h2>", unsafe_allow_html=True) |
| st.markdown(""" |
| **Gini Impurity** is another popular impurity measure. It calculates how often a randomly chosen element would be incorrectly labeled. |
| |
| Formula: |
| """) |
| st.image("gini.png", width=300) |
| st.markdown(""" |
| **Example**: |
| - \( P(Yes) = 0.5 \), \( P(No) = 0.5 \) |
| |
| Then: |
| $$ Gini(Y) = 1 - (0.5^2 + 0.5^2) = 0.5 $$ |
| |
| A lower Gini value means purer splits. |
| """, unsafe_allow_html=True) |
|
|
| |
| st.markdown("<h2 style='color: #003366;'>Building the Decision Tree</h2>", unsafe_allow_html=True) |
| st.markdown(""" |
| Decision Trees are built **top-down**, starting from the root. At each node, the algorithm selects the feature that best splits the data using metrics like **Entropy** or **Gini**. |
| |
| Splitting stops when: |
| - The data is pure (contains one class), or |
| - A stopping condition is met (like maximum depth). |
| """, unsafe_allow_html=True) |
|
|
| |
| st.markdown("<h2 style='color: #003366;'>Visualizing: Iris Dataset Tree</h2>", unsafe_allow_html=True) |
| st.markdown(""" |
| Here's an example decision tree trained on the famous **Iris dataset**, which classifies flower species based on petal and sepal measurements. |
| """, unsafe_allow_html=True) |
| st.image("dt1 (1).jpg", caption="Decision Tree for Iris Dataset", use_container_width=True) |
|
|
| |
| st.markdown("<h2 style='color: #003366;'>Training & Testing (Classification)</h2>", unsafe_allow_html=True) |
| st.markdown(""" |
| **Training**: |
| - Select features and split based on impurity reduction. |
| - Recursively grow the tree until stopping criteria are met. |
| |
| **Testing**: |
| - Traverse the tree with new data. |
| - Follow the decision rules until you reach a leaf node (prediction). |
| |
| ๐ก *Example: For Iris, classify the flower as Setosa, Versicolor, or Virginica based on petal dimensions.* |
| """, unsafe_allow_html=True) |
|
|
| |
| st.markdown("<h2 style='color: #003366;'>Training & Testing (Regression)</h2>", unsafe_allow_html=True) |
| st.markdown(""" |
| **Training**: |
| - Split data to minimize **Mean Squared Error (MSE)**. |
| |
| **Testing**: |
| - Output the mean value in the corresponding leaf. |
| |
| ๐ก *Example: Predict house price using features like size, location, and number of rooms.* |
| """, unsafe_allow_html=True) |
|
|
| |
| st.markdown("<h2 style='color: #003366;'>Pre-Pruning: Control Overfitting Early</h2>", unsafe_allow_html=True) |
| st.markdown(""" |
| Pre-pruning stops the tree from growing too deep and complex. Common techniques include: |
| |
| - **Max Depth** |
| - **Min Samples Split** |
| - **Min Samples Leaf** |
| - **Max Features** |
| |
| These help in generalizing better and reducing noise. |
| """, unsafe_allow_html=True) |
|
|
| |
| st.markdown("<h2 style='color: #003366;'>Post-Pruning: Simplify After Growth</h2>", unsafe_allow_html=True) |
| st.markdown(""" |
| In **post-pruning**, we allow the tree to grow fully, then trim unnecessary branches: |
| |
| - **Cost Complexity Pruning** |
| - **Validation-based Pruning** |
| |
| This helps reduce overfitting and improves model simplicity. |
| """, unsafe_allow_html=True) |
|
|
| |
| st.markdown("<h2 style='color: #003366;'>Feature Selection with Decision Trees</h2>", unsafe_allow_html=True) |
| st.markdown(""" |
| Decision Trees provide insight into which features are most important based on how often and how effectively they split data. |
| """) |
| st.image("feature.png", width=500) |
| st.markdown(""" |
| ๐ก *Higher importance โ More influential in decision making.* |
| """, unsafe_allow_html=True) |
|
|
| |
| st.markdown("<h2 style='color: #003366;'>Explore Hands-On Implementation</h2>", unsafe_allow_html=True) |
| st.markdown( |
| "<a href='https://colab.research.google.com/drive/1SqZ5I5h7ivS6SJDwlOZQ-V4IAOg90RE7?usp=sharing' target='_blank' style='font-size: 16px; color: #003366;'>๐ Open Jupyter Notebook on Google Colab</a>", |
| unsafe_allow_html=True |
| ) |
|
|