Update pages/6_Feature_Engineering.py
Browse files
pages/6_Feature_Engineering.py
CHANGED
|
@@ -649,4 +649,57 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 649 |
unsafe_allow_html=True,
|
| 650 |
)
|
| 651 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 652 |
|
|
|
|
| 649 |
unsafe_allow_html=True,
|
| 650 |
)
|
| 651 |
|
| 652 |
+
st.subheader(":red[Why log is used]")
|
| 653 |
+
st.markdown("<h1 class='title'>π Understanding TF-IDF Scaling</h1>", unsafe_allow_html=True)
|
| 654 |
+
|
| 655 |
+
st.markdown(
|
| 656 |
+
"""
|
| 657 |
+
<div class='box'>
|
| 658 |
+
<h3 style='color: #6A0572;'>π Minimum and Maximum Values of N/n</h3>
|
| 659 |
+
<ul>
|
| 660 |
+
<li>When <strong>n is maximum</strong> β <span class='highlight'>N/n = 1</span></li>
|
| 661 |
+
<li>At <strong>training time</strong>: <span class='highlight'>1 β€ n β€ N</span></li>
|
| 662 |
+
<li>At <strong>test time</strong>: <span class='highlight'>0 β€ n β€ N</span> (due to Out-of-Vocabulary words)</li>
|
| 663 |
+
</ul>
|
| 664 |
+
</div>
|
| 665 |
+
""",
|
| 666 |
+
unsafe_allow_html=True,
|
| 667 |
+
)
|
| 668 |
+
|
| 669 |
+
st.markdown(
|
| 670 |
+
"""
|
| 671 |
+
<div class='box'>
|
| 672 |
+
<h3 style='color: #6A0572;'>βοΈ IDF Dominance Over TF</h3>
|
| 673 |
+
<ul>
|
| 674 |
+
<li>If <strong>n decreases</strong> β <span class='highlight'>N/n increases (max)</span></li>
|
| 675 |
+
<li>TF scale is very <span class='highlight'>small</span>, but IDF scale is very <span class='highlight'>high</span></li>
|
| 676 |
+
<li>IDF can <span class='highlight'>dominate</span> TF, favoring rare words over frequent ones</li>
|
| 677 |
+
</ul>
|
| 678 |
+
</div>
|
| 679 |
+
""",
|
| 680 |
+
unsafe_allow_html=True,
|
| 681 |
+
)
|
| 682 |
+
|
| 683 |
+
st.markdown(
|
| 684 |
+
"""
|
| 685 |
+
<div class='box'>
|
| 686 |
+
<h3 style='color: #6A0572;'>π οΈ How Log Solves IDF Dominance?</h3>
|
| 687 |
+
<ul>
|
| 688 |
+
<li>Applying <span class='highlight'>log</span> reduces the dominance of IDF</li>
|
| 689 |
+
<li>Logarithm <span class='highlight'>rounds off</span> values to a balanced scale</li>
|
| 690 |
+
<li>It prevents bias towards rare words and maintains proportionality</li>
|
| 691 |
+
</ul>
|
| 692 |
+
</div>
|
| 693 |
+
""",
|
| 694 |
+
unsafe_allow_html=True,
|
| 695 |
+
)
|
| 696 |
+
|
| 697 |
+
st.markdown(
|
| 698 |
+
"""
|
| 699 |
+
<div class='formula'>
|
| 700 |
+
<strong>TF balances frequent words, while log(IDF) prevents rare-word dominance! π</strong>
|
| 701 |
+
</div>
|
| 702 |
+
""",
|
| 703 |
+
unsafe_allow_html=True,
|
| 704 |
+
)
|
| 705 |
|