Update pages/6_Feature_Engineering.py
Browse files
pages/6_Feature_Engineering.py
CHANGED
|
@@ -421,13 +421,11 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 421 |
|
| 422 |
st.markdown(
|
| 423 |
"""
|
| 424 |
-
<div class='step-box'>
|
| 425 |
<ul>
|
| 426 |
<li><strong>Create a vocabulary:</strong> A set of unique words from the corpus.</li>
|
| 427 |
<li><strong>Convert each document into a vector:</strong> A d-dimensional representation.</li>
|
| 428 |
<li><strong>Calculate Term Frequency (TF):</strong> Measures the importance of a word within a document.</li>
|
| 429 |
</ul>
|
| 430 |
-
</div>
|
| 431 |
""",
|
| 432 |
unsafe_allow_html=True,
|
| 433 |
)
|
|
@@ -436,12 +434,10 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 436 |
|
| 437 |
st.markdown(
|
| 438 |
"""
|
| 439 |
-
<div class='step-box'>
|
| 440 |
<ul>
|
| 441 |
<li><strong>Compute Inverse Document Frequency (IDF):</strong> Measures how important a word is across all documents.</li>
|
| 442 |
<li><strong>For every word in the vocabulary, apply IDF:</strong></li>
|
| 443 |
</ul>
|
| 444 |
-
</div>
|
| 445 |
""",
|
| 446 |
unsafe_allow_html=True,
|
| 447 |
)
|
|
@@ -450,11 +446,9 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 450 |
|
| 451 |
st.markdown(
|
| 452 |
"""
|
| 453 |
-
<div class='step-box'>
|
| 454 |
- <strong>N:</strong> Total number of documents in the corpus.<br>
|
| 455 |
- <strong>n:</strong> Number of documents containing the word wᵢ.<br>
|
| 456 |
- TF-IDF helps in understanding word significance while reducing the impact of commonly used words.
|
| 457 |
-
</div>
|
| 458 |
""",
|
| 459 |
unsafe_allow_html=True,
|
| 460 |
)
|
|
@@ -478,13 +472,11 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 478 |
|
| 479 |
st.markdown(
|
| 480 |
"""
|
| 481 |
-
<div class='box'>
|
| 482 |
<ul>
|
| 483 |
<li>TF measures how often a word appears in a document.</li>
|
| 484 |
<li>Formula: <span class='highlight'>TF(wᵢ, dᵢ) = (Occurrences of wᵢ in dᵢ) / (Total words in dᵢ)</span></li>
|
| 485 |
<li>TF values change based on the document.</li>
|
| 486 |
</ul>
|
| 487 |
-
</div>
|
| 488 |
""",
|
| 489 |
unsafe_allow_html=True,
|
| 490 |
)
|
|
@@ -504,13 +496,11 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 504 |
|
| 505 |
st.markdown(
|
| 506 |
"""
|
| 507 |
-
<div class='box'>
|
| 508 |
<ul>
|
| 509 |
<li>TF values always range from <strong>0 to 1</strong>.</li>
|
| 510 |
<li>Case-1: <span class='highlight'>TF = 0</span> → Word is not present in the document.</li>
|
| 511 |
<li>Case-2: <span class='highlight'>TF = 1</span> → Word is the only word in the document.</li>
|
| 512 |
</ul>
|
| 513 |
-
</div>
|
| 514 |
""",
|
| 515 |
unsafe_allow_html=True,
|
| 516 |
)
|
|
@@ -519,7 +509,6 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 519 |
|
| 520 |
st.markdown(
|
| 521 |
"""
|
| 522 |
-
<div class='box'>
|
| 523 |
<ul>
|
| 524 |
<li>IDF measures how important a word is across the entire corpus.</li>
|
| 525 |
<li>Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
|
|
@@ -527,7 +516,6 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 527 |
<li>n = Number of documents containing wᵢ.</li>
|
| 528 |
<li>IDF values range from <strong>0 to ∞</strong>.</li>
|
| 529 |
</ul>
|
| 530 |
-
</div>
|
| 531 |
""",
|
| 532 |
unsafe_allow_html=True,
|
| 533 |
)
|
|
@@ -536,13 +524,11 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 536 |
|
| 537 |
st.markdown(
|
| 538 |
"""
|
| 539 |
-
<div class='box'>
|
| 540 |
<ul>
|
| 541 |
<li>We calculate TF-IDF by multiplying TF and IDF values.</li>
|
| 542 |
<li>Formula: <span class='highlight'>TF-IDF = TF * IDF</span></li>
|
| 543 |
<li>TF-IDF helps reduce the impact of frequent words while keeping rare words important.</li>
|
| 544 |
</ul>
|
| 545 |
-
</div>
|
| 546 |
""",
|
| 547 |
unsafe_allow_html=True,
|
| 548 |
)
|
|
@@ -558,9 +544,7 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 558 |
|
| 559 |
st.markdown(
|
| 560 |
"""
|
| 561 |
-
<div class='box'>
|
| 562 |
- The final TF-IDF values may be low, high, or even zero depending on term frequency and document frequency.
|
| 563 |
-
</div>
|
| 564 |
""",
|
| 565 |
unsafe_allow_html=True,
|
| 566 |
)
|
|
@@ -569,53 +553,45 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 569 |
|
| 570 |
st.markdown(
|
| 571 |
"""
|
| 572 |
-
<div class='box'>
|
| 573 |
<h3 style='color: #6A0572;'>📈 Case 1: High TF-IDF Values</h3>
|
| 574 |
<ul>
|
| 575 |
<li>If the word appears <strong>frequently</strong> in a document → <span class='highlight'>High TF-IDF</span></li>
|
| 576 |
</ul>
|
| 577 |
-
</div>
|
| 578 |
""",
|
| 579 |
unsafe_allow_html=True,
|
| 580 |
)
|
| 581 |
|
| 582 |
st.markdown(
|
| 583 |
"""
|
| 584 |
-
<div class='box'>
|
| 585 |
<h3 style='color: #6A0572;'>📉 Case 2: Low TF-IDF Values</h3>
|
| 586 |
<ul>
|
| 587 |
<li>If the word appears <strong>rarely</strong> in a document → <span class='highlight'>Low TF-IDF</span></li>
|
| 588 |
<li>TF is always in the range: <strong>[0 - 1]</strong></li>
|
| 589 |
<li>IDF is in the range: <strong>[0 - ∞)</strong></li>
|
| 590 |
</ul>
|
| 591 |
-
</div>
|
| 592 |
""",
|
| 593 |
unsafe_allow_html=True,
|
| 594 |
)
|
| 595 |
|
| 596 |
st.markdown(
|
| 597 |
"""
|
| 598 |
-
<div class='box'>
|
| 599 |
<h3 style='color: #6A0572;'>📊 Understanding TF (Term Frequency)</h3>
|
| 600 |
<ul>
|
| 601 |
<li>TF gives <strong>more importance</strong> to words that occur <strong>frequently</strong> in a document.</li>
|
| 602 |
<li>As the word frequency <span class='highlight'>increases</span> → TF <span class='highlight'>increases</span>.</li>
|
| 603 |
</ul>
|
| 604 |
-
</div>
|
| 605 |
""",
|
| 606 |
unsafe_allow_html=True,
|
| 607 |
)
|
| 608 |
|
| 609 |
st.markdown(
|
| 610 |
"""
|
| 611 |
-
<div class='box'>
|
| 612 |
<h3 style='color: #6A0572;'>📉 Understanding IDF (Inverse Document Frequency)</h3>
|
| 613 |
<ul>
|
| 614 |
<li>IDF Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
|
| 615 |
<li><strong>N:</strong> Total number of documents</li>
|
| 616 |
<li><strong>n:</strong> Number of documents containing the word</li>
|
| 617 |
</ul>
|
| 618 |
-
</div>
|
| 619 |
""",
|
| 620 |
unsafe_allow_html=True,
|
| 621 |
)
|
|
@@ -637,14 +613,12 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 637 |
|
| 638 |
st.markdown(
|
| 639 |
"""
|
| 640 |
-
<div class='box'>
|
| 641 |
<h3 style='color: #6A0572;'>📌 TF-IDF Calculation</h3>
|
| 642 |
<ul>
|
| 643 |
<li><strong>TF</strong> focuses on words <strong>frequent</strong> in a document.</li>
|
| 644 |
<li><strong>IDF</strong> focuses on words <strong>rare</strong> in the corpus.</li>
|
| 645 |
<li><span class='highlight'>TF-IDF is high</span> for words that appear <strong>often in a document</strong> but <strong>rarely in the corpus</strong>.</li>
|
| 646 |
</ul>
|
| 647 |
-
</div>
|
| 648 |
""",
|
| 649 |
unsafe_allow_html=True,
|
| 650 |
)
|
|
@@ -654,42 +628,36 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 654 |
|
| 655 |
st.markdown(
|
| 656 |
"""
|
| 657 |
-
|
| 658 |
-
<h3 style='color: #6A0572;'>📊 Minimum and Maximum Values of N/n</h3>
|
| 659 |
<ul>
|
| 660 |
<li>When <strong>n is maximum</strong> → <span class='highlight'>N/n = 1</span></li>
|
| 661 |
<li>At <strong>training time</strong>: <span class='highlight'>1 ≤ n ≤ N</span></li>
|
| 662 |
<li>At <strong>test time</strong>: <span class='highlight'>0 ≤ n ≤ N</span> (due to Out-of-Vocabulary words)</li>
|
| 663 |
</ul>
|
| 664 |
-
</div>
|
| 665 |
""",
|
| 666 |
unsafe_allow_html=True,
|
| 667 |
)
|
| 668 |
|
| 669 |
st.markdown(
|
| 670 |
"""
|
| 671 |
-
|
| 672 |
-
<h3 style='color: #6A0572;'>⚖️ IDF Dominance Over TF</h3>
|
| 673 |
<ul>
|
| 674 |
<li>If <strong>n decreases</strong> → <span class='highlight'>N/n increases (max)</span></li>
|
| 675 |
<li>TF scale is very <span class='highlight'>small</span>, but IDF scale is very <span class='highlight'>high</span></li>
|
| 676 |
<li>IDF can <span class='highlight'>dominate</span> TF, favoring rare words over frequent ones</li>
|
| 677 |
</ul>
|
| 678 |
-
</div>
|
| 679 |
""",
|
| 680 |
unsafe_allow_html=True,
|
| 681 |
)
|
| 682 |
|
| 683 |
st.markdown(
|
| 684 |
"""
|
| 685 |
-
|
| 686 |
-
<h3 style='color: #6A0572;'>🛠️ How Log Solves IDF Dominance?</h3>
|
| 687 |
<ul>
|
| 688 |
<li>Applying <span class='highlight'>log</span> reduces the dominance of IDF</li>
|
| 689 |
<li>Logarithm <span class='highlight'>rounds off</span> values to a balanced scale</li>
|
| 690 |
<li>It prevents bias towards rare words and maintains proportionality</li>
|
| 691 |
</ul>
|
| 692 |
-
</div>
|
| 693 |
""",
|
| 694 |
unsafe_allow_html=True,
|
| 695 |
)
|
|
|
|
| 421 |
|
| 422 |
st.markdown(
|
| 423 |
"""
|
|
|
|
| 424 |
<ul>
|
| 425 |
<li><strong>Create a vocabulary:</strong> A set of unique words from the corpus.</li>
|
| 426 |
<li><strong>Convert each document into a vector:</strong> A d-dimensional representation.</li>
|
| 427 |
<li><strong>Calculate Term Frequency (TF):</strong> Measures the importance of a word within a document.</li>
|
| 428 |
</ul>
|
|
|
|
| 429 |
""",
|
| 430 |
unsafe_allow_html=True,
|
| 431 |
)
|
|
|
|
| 434 |
|
| 435 |
st.markdown(
|
| 436 |
"""
|
|
|
|
| 437 |
<ul>
|
| 438 |
<li><strong>Compute Inverse Document Frequency (IDF):</strong> Measures how important a word is across all documents.</li>
|
| 439 |
<li><strong>For every word in the vocabulary, apply IDF:</strong></li>
|
| 440 |
</ul>
|
|
|
|
| 441 |
""",
|
| 442 |
unsafe_allow_html=True,
|
| 443 |
)
|
|
|
|
| 446 |
|
| 447 |
st.markdown(
|
| 448 |
"""
|
|
|
|
| 449 |
- <strong>N:</strong> Total number of documents in the corpus.<br>
|
| 450 |
- <strong>n:</strong> Number of documents containing the word wᵢ.<br>
|
| 451 |
- TF-IDF helps in understanding word significance while reducing the impact of commonly used words.
|
|
|
|
| 452 |
""",
|
| 453 |
unsafe_allow_html=True,
|
| 454 |
)
|
|
|
|
| 472 |
|
| 473 |
st.markdown(
|
| 474 |
"""
|
|
|
|
| 475 |
<ul>
|
| 476 |
<li>TF measures how often a word appears in a document.</li>
|
| 477 |
<li>Formula: <span class='highlight'>TF(wᵢ, dᵢ) = (Occurrences of wᵢ in dᵢ) / (Total words in dᵢ)</span></li>
|
| 478 |
<li>TF values change based on the document.</li>
|
| 479 |
</ul>
|
|
|
|
| 480 |
""",
|
| 481 |
unsafe_allow_html=True,
|
| 482 |
)
|
|
|
|
| 496 |
|
| 497 |
st.markdown(
|
| 498 |
"""
|
|
|
|
| 499 |
<ul>
|
| 500 |
<li>TF values always range from <strong>0 to 1</strong>.</li>
|
| 501 |
<li>Case-1: <span class='highlight'>TF = 0</span> → Word is not present in the document.</li>
|
| 502 |
<li>Case-2: <span class='highlight'>TF = 1</span> → Word is the only word in the document.</li>
|
| 503 |
</ul>
|
|
|
|
| 504 |
""",
|
| 505 |
unsafe_allow_html=True,
|
| 506 |
)
|
|
|
|
| 509 |
|
| 510 |
st.markdown(
|
| 511 |
"""
|
|
|
|
| 512 |
<ul>
|
| 513 |
<li>IDF measures how important a word is across the entire corpus.</li>
|
| 514 |
<li>Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
|
|
|
|
| 516 |
<li>n = Number of documents containing wᵢ.</li>
|
| 517 |
<li>IDF values range from <strong>0 to ∞</strong>.</li>
|
| 518 |
</ul>
|
|
|
|
| 519 |
""",
|
| 520 |
unsafe_allow_html=True,
|
| 521 |
)
|
|
|
|
| 524 |
|
| 525 |
st.markdown(
|
| 526 |
"""
|
|
|
|
| 527 |
<ul>
|
| 528 |
<li>We calculate TF-IDF by multiplying TF and IDF values.</li>
|
| 529 |
<li>Formula: <span class='highlight'>TF-IDF = TF * IDF</span></li>
|
| 530 |
<li>TF-IDF helps reduce the impact of frequent words while keeping rare words important.</li>
|
| 531 |
</ul>
|
|
|
|
| 532 |
""",
|
| 533 |
unsafe_allow_html=True,
|
| 534 |
)
|
|
|
|
| 544 |
|
| 545 |
st.markdown(
|
| 546 |
"""
|
|
|
|
| 547 |
- The final TF-IDF values may be low, high, or even zero depending on term frequency and document frequency.
|
|
|
|
| 548 |
""",
|
| 549 |
unsafe_allow_html=True,
|
| 550 |
)
|
|
|
|
| 553 |
|
| 554 |
st.markdown(
|
| 555 |
"""
|
|
|
|
| 556 |
<h3 style='color: #6A0572;'>📈 Case 1: High TF-IDF Values</h3>
|
| 557 |
<ul>
|
| 558 |
<li>If the word appears <strong>frequently</strong> in a document → <span class='highlight'>High TF-IDF</span></li>
|
| 559 |
</ul>
|
|
|
|
| 560 |
""",
|
| 561 |
unsafe_allow_html=True,
|
| 562 |
)
|
| 563 |
|
| 564 |
st.markdown(
|
| 565 |
"""
|
|
|
|
| 566 |
<h3 style='color: #6A0572;'>📉 Case 2: Low TF-IDF Values</h3>
|
| 567 |
<ul>
|
| 568 |
<li>If the word appears <strong>rarely</strong> in a document → <span class='highlight'>Low TF-IDF</span></li>
|
| 569 |
<li>TF is always in the range: <strong>[0 - 1]</strong></li>
|
| 570 |
<li>IDF is in the range: <strong>[0 - ∞)</strong></li>
|
| 571 |
</ul>
|
|
|
|
| 572 |
""",
|
| 573 |
unsafe_allow_html=True,
|
| 574 |
)
|
| 575 |
|
| 576 |
st.markdown(
|
| 577 |
"""
|
|
|
|
| 578 |
<h3 style='color: #6A0572;'>📊 Understanding TF (Term Frequency)</h3>
|
| 579 |
<ul>
|
| 580 |
<li>TF gives <strong>more importance</strong> to words that occur <strong>frequently</strong> in a document.</li>
|
| 581 |
<li>As the word frequency <span class='highlight'>increases</span> → TF <span class='highlight'>increases</span>.</li>
|
| 582 |
</ul>
|
|
|
|
| 583 |
""",
|
| 584 |
unsafe_allow_html=True,
|
| 585 |
)
|
| 586 |
|
| 587 |
st.markdown(
|
| 588 |
"""
|
|
|
|
| 589 |
<h3 style='color: #6A0572;'>📉 Understanding IDF (Inverse Document Frequency)</h3>
|
| 590 |
<ul>
|
| 591 |
<li>IDF Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
|
| 592 |
<li><strong>N:</strong> Total number of documents</li>
|
| 593 |
<li><strong>n:</strong> Number of documents containing the word</li>
|
| 594 |
</ul>
|
|
|
|
| 595 |
""",
|
| 596 |
unsafe_allow_html=True,
|
| 597 |
)
|
|
|
|
| 613 |
|
| 614 |
st.markdown(
|
| 615 |
"""
|
|
|
|
| 616 |
<h3 style='color: #6A0572;'>📌 TF-IDF Calculation</h3>
|
| 617 |
<ul>
|
| 618 |
<li><strong>TF</strong> focuses on words <strong>frequent</strong> in a document.</li>
|
| 619 |
<li><strong>IDF</strong> focuses on words <strong>rare</strong> in the corpus.</li>
|
| 620 |
<li><span class='highlight'>TF-IDF is high</span> for words that appear <strong>often in a document</strong> but <strong>rarely in the corpus</strong>.</li>
|
| 621 |
</ul>
|
|
|
|
| 622 |
""",
|
| 623 |
unsafe_allow_html=True,
|
| 624 |
)
|
|
|
|
| 628 |
|
| 629 |
st.markdown(
|
| 630 |
"""
|
| 631 |
+
<h3 style='color: #6A0572;'> Minimum and Maximum Values of N/n</h3>
|
|
|
|
| 632 |
<ul>
|
| 633 |
<li>When <strong>n is maximum</strong> → <span class='highlight'>N/n = 1</span></li>
|
| 634 |
<li>At <strong>training time</strong>: <span class='highlight'>1 ≤ n ≤ N</span></li>
|
| 635 |
<li>At <strong>test time</strong>: <span class='highlight'>0 ≤ n ≤ N</span> (due to Out-of-Vocabulary words)</li>
|
| 636 |
</ul>
|
|
|
|
| 637 |
""",
|
| 638 |
unsafe_allow_html=True,
|
| 639 |
)
|
| 640 |
|
| 641 |
st.markdown(
|
| 642 |
"""
|
| 643 |
+
<h3 style='color: #6A0572;'> IDF Dominance Over TF</h3>
|
|
|
|
| 644 |
<ul>
|
| 645 |
<li>If <strong>n decreases</strong> → <span class='highlight'>N/n increases (max)</span></li>
|
| 646 |
<li>TF scale is very <span class='highlight'>small</span>, but IDF scale is very <span class='highlight'>high</span></li>
|
| 647 |
<li>IDF can <span class='highlight'>dominate</span> TF, favoring rare words over frequent ones</li>
|
| 648 |
</ul>
|
|
|
|
| 649 |
""",
|
| 650 |
unsafe_allow_html=True,
|
| 651 |
)
|
| 652 |
|
| 653 |
st.markdown(
|
| 654 |
"""
|
| 655 |
+
<h3 style='color: #6A0572;'>How Log Solves IDF Dominance?</h3>
|
|
|
|
| 656 |
<ul>
|
| 657 |
<li>Applying <span class='highlight'>log</span> reduces the dominance of IDF</li>
|
| 658 |
<li>Logarithm <span class='highlight'>rounds off</span> values to a balanced scale</li>
|
| 659 |
<li>It prevents bias towards rare words and maintains proportionality</li>
|
| 660 |
</ul>
|
|
|
|
| 661 |
""",
|
| 662 |
unsafe_allow_html=True,
|
| 663 |
)
|