Update pages/6_Feature_Engineering.py
Browse files- pages/6_Feature_Engineering.py +108 -34
pages/6_Feature_Engineering.py
CHANGED
|
@@ -458,38 +458,112 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 458 |
""",
|
| 459 |
unsafe_allow_html=True,
|
| 460 |
)
|
| 461 |
-
st.markdown(''
|
| 462 |
-
|
| 463 |
-
|
| 464 |
-
|
| 465 |
-
|
| 466 |
-
|
| 467 |
-
|
| 468 |
-
|
| 469 |
-
|
| 470 |
-
|
| 471 |
-
|
| 472 |
-
|
| 473 |
-
|
| 474 |
-
|
| 475 |
-
|
| 476 |
-
|
| 477 |
-
|
| 478 |
-
|
| 479 |
-
|
| 480 |
-
|
| 481 |
-
|
| 482 |
-
|
| 483 |
-
|
| 484 |
-
|
| 485 |
-
|
| 486 |
-
|
| 487 |
-
|
| 488 |
-
|
| 489 |
-
|
| 490 |
-
|
| 491 |
-
|
| 492 |
-
|
| 493 |
-
|
| 494 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 495 |
|
|
|
|
| 458 |
""",
|
| 459 |
unsafe_allow_html=True,
|
| 460 |
)
|
| 461 |
+
st.markdown("<h1 class='title'>📌 Example of TF-IDF</h1>", unsafe_allow_html=True)
|
| 462 |
+
|
| 463 |
+
st.markdown(
|
| 464 |
+
"""
|
| 465 |
+
<div class='box'>
|
| 466 |
+
<strong>Given a corpus with 3 documents:</strong><br><br>
|
| 467 |
+
<strong>d1:</strong> w1, w2, w3, w1 → v1 <br>
|
| 468 |
+
<strong>d2:</strong> w1, w2, w2, w3, w4, w2, w3 → v2 <br>
|
| 469 |
+
<strong>d3:</strong> w1, w5 → v3 <br><br>
|
| 470 |
+
<strong>Vocabulary:</strong> {w1, w2, w3, w4, w5} <br>
|
| 471 |
+
<strong>Vocabulary Size:</strong> 5 (d-dimension)
|
| 472 |
+
</div>
|
| 473 |
+
""",
|
| 474 |
+
unsafe_allow_html=True,
|
| 475 |
+
)
|
| 476 |
+
|
| 477 |
+
st.markdown("<h2 style='color: #6A0572;'>📊 Term Frequency (TF) Calculation</h2>", unsafe_allow_html=True)
|
| 478 |
+
|
| 479 |
+
st.markdown(
|
| 480 |
+
"""
|
| 481 |
+
<div class='box'>
|
| 482 |
+
<ul>
|
| 483 |
+
<li>TF measures how often a word appears in a document.</li>
|
| 484 |
+
<li>Formula: <span class='highlight'>TF(wᵢ, dᵢ) = (Occurrences of wᵢ in dᵢ) / (Total words in dᵢ)</span></li>
|
| 485 |
+
<li>TF values change based on the document.</li>
|
| 486 |
+
</ul>
|
| 487 |
+
</div>
|
| 488 |
+
""",
|
| 489 |
+
unsafe_allow_html=True,
|
| 490 |
+
)
|
| 491 |
+
|
| 492 |
+
st.markdown(
|
| 493 |
+
"""
|
| 494 |
+
<div class='formula'>
|
| 495 |
+
TF(w1, d1) = 2/4 = 0.5 <br>
|
| 496 |
+
TF(w2, d1) = 1/4 = 0.25 <br>
|
| 497 |
+
TF(w3, d1) = 1/4 = 0.25 <br>
|
| 498 |
+
TF(w4, d1) = 0/4 = 0 <br>
|
| 499 |
+
TF(w5, d1) = 0/4 = 0 <br>
|
| 500 |
+
</div>
|
| 501 |
+
""",
|
| 502 |
+
unsafe_allow_html=True,
|
| 503 |
+
)
|
| 504 |
+
|
| 505 |
+
st.markdown(
|
| 506 |
+
"""
|
| 507 |
+
<div class='box'>
|
| 508 |
+
<ul>
|
| 509 |
+
<li>TF values always range from <strong>0 to 1</strong>.</li>
|
| 510 |
+
<li>Case-1: <span class='highlight'>TF = 0</span> → Word is not present in the document.</li>
|
| 511 |
+
<li>Case-2: <span class='highlight'>TF = 1</span> → Word is the only word in the document.</li>
|
| 512 |
+
</ul>
|
| 513 |
+
</div>
|
| 514 |
+
""",
|
| 515 |
+
unsafe_allow_html=True,
|
| 516 |
+
)
|
| 517 |
+
|
| 518 |
+
st.markdown("<h2 style='color: #6A0572;'>📉 Inverse Document Frequency (IDF) Calculation</h2>", unsafe_allow_html=True)
|
| 519 |
+
|
| 520 |
+
st.markdown(
|
| 521 |
+
"""
|
| 522 |
+
<div class='box'>
|
| 523 |
+
<ul>
|
| 524 |
+
<li>IDF measures how important a word is across the entire corpus.</li>
|
| 525 |
+
<li>Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
|
| 526 |
+
<li>N = Total number of documents.</li>
|
| 527 |
+
<li>n = Number of documents containing wᵢ.</li>
|
| 528 |
+
<li>IDF values range from <strong>0 to ∞</strong>.</li>
|
| 529 |
+
</ul>
|
| 530 |
+
</div>
|
| 531 |
+
""",
|
| 532 |
+
unsafe_allow_html=True,
|
| 533 |
+
)
|
| 534 |
+
|
| 535 |
+
st.markdown("<h2 style='color: #6A0572;'>📌 TF-IDF Calculation</h2>", unsafe_allow_html=True)
|
| 536 |
+
|
| 537 |
+
st.markdown(
|
| 538 |
+
"""
|
| 539 |
+
<div class='box'>
|
| 540 |
+
<ul>
|
| 541 |
+
<li>We calculate TF-IDF by multiplying TF and IDF values.</li>
|
| 542 |
+
<li>Formula: <span class='highlight'>TF-IDF = TF * IDF</span></li>
|
| 543 |
+
<li>TF-IDF helps reduce the impact of frequent words while keeping rare words important.</li>
|
| 544 |
+
</ul>
|
| 545 |
+
</div>
|
| 546 |
+
""",
|
| 547 |
+
unsafe_allow_html=True,
|
| 548 |
+
)
|
| 549 |
+
|
| 550 |
+
st.markdown(
|
| 551 |
+
"""
|
| 552 |
+
<div class='formula'>
|
| 553 |
+
d1 → v1 = [0, 0.04, 0.04, 0, 0] (TF * IDF values)
|
| 554 |
+
</div>
|
| 555 |
+
""",
|
| 556 |
+
unsafe_allow_html=True,
|
| 557 |
+
)
|
| 558 |
+
|
| 559 |
+
st.markdown(
|
| 560 |
+
"""
|
| 561 |
+
<div class='box'>
|
| 562 |
+
- The final TF-IDF values may be low, high, or even zero depending on term frequency and document frequency.
|
| 563 |
+
</div>
|
| 564 |
+
""",
|
| 565 |
+
unsafe_allow_html=True,
|
| 566 |
+
)
|
| 567 |
+
|
| 568 |
+
st.markdown("<p style='text-align: center; font-size: 18px;'><strong>TF-IDF effectively balances word significance and document relevance! 🚀</strong></p>", unsafe_allow_html=True)
|
| 569 |
|