Harika22 commited on
Commit
183160b
·
verified ·
1 Parent(s): e1792cc

Update pages/6_Feature_Engineering.py

Browse files
Files changed (1) hide show
  1. pages/6_Feature_Engineering.py +3 -35
pages/6_Feature_Engineering.py CHANGED
@@ -421,13 +421,11 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
421
 
422
  st.markdown(
423
  """
424
- <div class='step-box'>
425
  <ul>
426
  <li><strong>Create a vocabulary:</strong> A set of unique words from the corpus.</li>
427
  <li><strong>Convert each document into a vector:</strong> A d-dimensional representation.</li>
428
  <li><strong>Calculate Term Frequency (TF):</strong> Measures the importance of a word within a document.</li>
429
  </ul>
430
- </div>
431
  """,
432
  unsafe_allow_html=True,
433
  )
@@ -436,12 +434,10 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
436
 
437
  st.markdown(
438
  """
439
- <div class='step-box'>
440
  <ul>
441
  <li><strong>Compute Inverse Document Frequency (IDF):</strong> Measures how important a word is across all documents.</li>
442
  <li><strong>For every word in the vocabulary, apply IDF:</strong></li>
443
  </ul>
444
- </div>
445
  """,
446
  unsafe_allow_html=True,
447
  )
@@ -450,11 +446,9 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
450
 
451
  st.markdown(
452
  """
453
- <div class='step-box'>
454
  - <strong>N:</strong> Total number of documents in the corpus.<br>
455
  - <strong>n:</strong> Number of documents containing the word wᵢ.<br>
456
  - TF-IDF helps in understanding word significance while reducing the impact of commonly used words.
457
- </div>
458
  """,
459
  unsafe_allow_html=True,
460
  )
@@ -478,13 +472,11 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
478
 
479
  st.markdown(
480
  """
481
- <div class='box'>
482
  <ul>
483
  <li>TF measures how often a word appears in a document.</li>
484
  <li>Formula: <span class='highlight'>TF(wᵢ, dᵢ) = (Occurrences of wᵢ in dᵢ) / (Total words in dᵢ)</span></li>
485
  <li>TF values change based on the document.</li>
486
  </ul>
487
- </div>
488
  """,
489
  unsafe_allow_html=True,
490
  )
@@ -504,13 +496,11 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
504
 
505
  st.markdown(
506
  """
507
- <div class='box'>
508
  <ul>
509
  <li>TF values always range from <strong>0 to 1</strong>.</li>
510
  <li>Case-1: <span class='highlight'>TF = 0</span> → Word is not present in the document.</li>
511
  <li>Case-2: <span class='highlight'>TF = 1</span> → Word is the only word in the document.</li>
512
  </ul>
513
- </div>
514
  """,
515
  unsafe_allow_html=True,
516
  )
@@ -519,7 +509,6 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
519
 
520
  st.markdown(
521
  """
522
- <div class='box'>
523
  <ul>
524
  <li>IDF measures how important a word is across the entire corpus.</li>
525
  <li>Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
@@ -527,7 +516,6 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
527
  <li>n = Number of documents containing wᵢ.</li>
528
  <li>IDF values range from <strong>0 to ∞</strong>.</li>
529
  </ul>
530
- </div>
531
  """,
532
  unsafe_allow_html=True,
533
  )
@@ -536,13 +524,11 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
536
 
537
  st.markdown(
538
  """
539
- <div class='box'>
540
  <ul>
541
  <li>We calculate TF-IDF by multiplying TF and IDF values.</li>
542
  <li>Formula: <span class='highlight'>TF-IDF = TF * IDF</span></li>
543
  <li>TF-IDF helps reduce the impact of frequent words while keeping rare words important.</li>
544
  </ul>
545
- </div>
546
  """,
547
  unsafe_allow_html=True,
548
  )
@@ -558,9 +544,7 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
558
 
559
  st.markdown(
560
  """
561
- <div class='box'>
562
  - The final TF-IDF values may be low, high, or even zero depending on term frequency and document frequency.
563
- </div>
564
  """,
565
  unsafe_allow_html=True,
566
  )
@@ -569,53 +553,45 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
569
 
570
  st.markdown(
571
  """
572
- <div class='box'>
573
  <h3 style='color: #6A0572;'>📈 Case 1: High TF-IDF Values</h3>
574
  <ul>
575
  <li>If the word appears <strong>frequently</strong> in a document → <span class='highlight'>High TF-IDF</span></li>
576
  </ul>
577
- </div>
578
  """,
579
  unsafe_allow_html=True,
580
  )
581
 
582
  st.markdown(
583
  """
584
- <div class='box'>
585
  <h3 style='color: #6A0572;'>📉 Case 2: Low TF-IDF Values</h3>
586
  <ul>
587
  <li>If the word appears <strong>rarely</strong> in a document → <span class='highlight'>Low TF-IDF</span></li>
588
  <li>TF is always in the range: <strong>[0 - 1]</strong></li>
589
  <li>IDF is in the range: <strong>[0 - ∞)</strong></li>
590
  </ul>
591
- </div>
592
  """,
593
  unsafe_allow_html=True,
594
  )
595
 
596
  st.markdown(
597
  """
598
- <div class='box'>
599
  <h3 style='color: #6A0572;'>📊 Understanding TF (Term Frequency)</h3>
600
  <ul>
601
  <li>TF gives <strong>more importance</strong> to words that occur <strong>frequently</strong> in a document.</li>
602
  <li>As the word frequency <span class='highlight'>increases</span> → TF <span class='highlight'>increases</span>.</li>
603
  </ul>
604
- </div>
605
  """,
606
  unsafe_allow_html=True,
607
  )
608
 
609
  st.markdown(
610
  """
611
- <div class='box'>
612
  <h3 style='color: #6A0572;'>📉 Understanding IDF (Inverse Document Frequency)</h3>
613
  <ul>
614
  <li>IDF Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
615
  <li><strong>N:</strong> Total number of documents</li>
616
  <li><strong>n:</strong> Number of documents containing the word</li>
617
  </ul>
618
- </div>
619
  """,
620
  unsafe_allow_html=True,
621
  )
@@ -637,14 +613,12 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
637
 
638
  st.markdown(
639
  """
640
- <div class='box'>
641
  <h3 style='color: #6A0572;'>📌 TF-IDF Calculation</h3>
642
  <ul>
643
  <li><strong>TF</strong> focuses on words <strong>frequent</strong> in a document.</li>
644
  <li><strong>IDF</strong> focuses on words <strong>rare</strong> in the corpus.</li>
645
  <li><span class='highlight'>TF-IDF is high</span> for words that appear <strong>often in a document</strong> but <strong>rarely in the corpus</strong>.</li>
646
  </ul>
647
- </div>
648
  """,
649
  unsafe_allow_html=True,
650
  )
@@ -654,42 +628,36 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
654
 
655
  st.markdown(
656
  """
657
- <div class='box'>
658
- <h3 style='color: #6A0572;'>📊 Minimum and Maximum Values of N/n</h3>
659
  <ul>
660
  <li>When <strong>n is maximum</strong> → <span class='highlight'>N/n = 1</span></li>
661
  <li>At <strong>training time</strong>: <span class='highlight'>1 ≤ n ≤ N</span></li>
662
  <li>At <strong>test time</strong>: <span class='highlight'>0 ≤ n ≤ N</span> (due to Out-of-Vocabulary words)</li>
663
  </ul>
664
- </div>
665
  """,
666
  unsafe_allow_html=True,
667
  )
668
 
669
  st.markdown(
670
  """
671
- <div class='box'>
672
- <h3 style='color: #6A0572;'>⚖️ IDF Dominance Over TF</h3>
673
  <ul>
674
  <li>If <strong>n decreases</strong> → <span class='highlight'>N/n increases (max)</span></li>
675
  <li>TF scale is very <span class='highlight'>small</span>, but IDF scale is very <span class='highlight'>high</span></li>
676
  <li>IDF can <span class='highlight'>dominate</span> TF, favoring rare words over frequent ones</li>
677
  </ul>
678
- </div>
679
  """,
680
  unsafe_allow_html=True,
681
  )
682
 
683
  st.markdown(
684
  """
685
- <div class='box'>
686
- <h3 style='color: #6A0572;'>🛠️ How Log Solves IDF Dominance?</h3>
687
  <ul>
688
  <li>Applying <span class='highlight'>log</span> reduces the dominance of IDF</li>
689
  <li>Logarithm <span class='highlight'>rounds off</span> values to a balanced scale</li>
690
  <li>It prevents bias towards rare words and maintains proportionality</li>
691
  </ul>
692
- </div>
693
  """,
694
  unsafe_allow_html=True,
695
  )
 
421
 
422
  st.markdown(
423
  """
 
424
  <ul>
425
  <li><strong>Create a vocabulary:</strong> A set of unique words from the corpus.</li>
426
  <li><strong>Convert each document into a vector:</strong> A d-dimensional representation.</li>
427
  <li><strong>Calculate Term Frequency (TF):</strong> Measures the importance of a word within a document.</li>
428
  </ul>
 
429
  """,
430
  unsafe_allow_html=True,
431
  )
 
434
 
435
  st.markdown(
436
  """
 
437
  <ul>
438
  <li><strong>Compute Inverse Document Frequency (IDF):</strong> Measures how important a word is across all documents.</li>
439
  <li><strong>For every word in the vocabulary, apply IDF:</strong></li>
440
  </ul>
 
441
  """,
442
  unsafe_allow_html=True,
443
  )
 
446
 
447
  st.markdown(
448
  """
 
449
  - <strong>N:</strong> Total number of documents in the corpus.<br>
450
  - <strong>n:</strong> Number of documents containing the word wᵢ.<br>
451
  - TF-IDF helps in understanding word significance while reducing the impact of commonly used words.
 
452
  """,
453
  unsafe_allow_html=True,
454
  )
 
472
 
473
  st.markdown(
474
  """
 
475
  <ul>
476
  <li>TF measures how often a word appears in a document.</li>
477
  <li>Formula: <span class='highlight'>TF(wᵢ, dᵢ) = (Occurrences of wᵢ in dᵢ) / (Total words in dᵢ)</span></li>
478
  <li>TF values change based on the document.</li>
479
  </ul>
 
480
  """,
481
  unsafe_allow_html=True,
482
  )
 
496
 
497
  st.markdown(
498
  """
 
499
  <ul>
500
  <li>TF values always range from <strong>0 to 1</strong>.</li>
501
  <li>Case-1: <span class='highlight'>TF = 0</span> → Word is not present in the document.</li>
502
  <li>Case-2: <span class='highlight'>TF = 1</span> → Word is the only word in the document.</li>
503
  </ul>
 
504
  """,
505
  unsafe_allow_html=True,
506
  )
 
509
 
510
  st.markdown(
511
  """
 
512
  <ul>
513
  <li>IDF measures how important a word is across the entire corpus.</li>
514
  <li>Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
 
516
  <li>n = Number of documents containing wᵢ.</li>
517
  <li>IDF values range from <strong>0 to ∞</strong>.</li>
518
  </ul>
 
519
  """,
520
  unsafe_allow_html=True,
521
  )
 
524
 
525
  st.markdown(
526
  """
 
527
  <ul>
528
  <li>We calculate TF-IDF by multiplying TF and IDF values.</li>
529
  <li>Formula: <span class='highlight'>TF-IDF = TF * IDF</span></li>
530
  <li>TF-IDF helps reduce the impact of frequent words while keeping rare words important.</li>
531
  </ul>
 
532
  """,
533
  unsafe_allow_html=True,
534
  )
 
544
 
545
  st.markdown(
546
  """
 
547
  - The final TF-IDF values may be low, high, or even zero depending on term frequency and document frequency.
 
548
  """,
549
  unsafe_allow_html=True,
550
  )
 
553
 
554
  st.markdown(
555
  """
 
556
  <h3 style='color: #6A0572;'>📈 Case 1: High TF-IDF Values</h3>
557
  <ul>
558
  <li>If the word appears <strong>frequently</strong> in a document → <span class='highlight'>High TF-IDF</span></li>
559
  </ul>
 
560
  """,
561
  unsafe_allow_html=True,
562
  )
563
 
564
  st.markdown(
565
  """
 
566
  <h3 style='color: #6A0572;'>📉 Case 2: Low TF-IDF Values</h3>
567
  <ul>
568
  <li>If the word appears <strong>rarely</strong> in a document → <span class='highlight'>Low TF-IDF</span></li>
569
  <li>TF is always in the range: <strong>[0 - 1]</strong></li>
570
  <li>IDF is in the range: <strong>[0 - ∞)</strong></li>
571
  </ul>
 
572
  """,
573
  unsafe_allow_html=True,
574
  )
575
 
576
  st.markdown(
577
  """
 
578
  <h3 style='color: #6A0572;'>📊 Understanding TF (Term Frequency)</h3>
579
  <ul>
580
  <li>TF gives <strong>more importance</strong> to words that occur <strong>frequently</strong> in a document.</li>
581
  <li>As the word frequency <span class='highlight'>increases</span> → TF <span class='highlight'>increases</span>.</li>
582
  </ul>
 
583
  """,
584
  unsafe_allow_html=True,
585
  )
586
 
587
  st.markdown(
588
  """
 
589
  <h3 style='color: #6A0572;'>📉 Understanding IDF (Inverse Document Frequency)</h3>
590
  <ul>
591
  <li>IDF Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
592
  <li><strong>N:</strong> Total number of documents</li>
593
  <li><strong>n:</strong> Number of documents containing the word</li>
594
  </ul>
 
595
  """,
596
  unsafe_allow_html=True,
597
  )
 
613
 
614
  st.markdown(
615
  """
 
616
  <h3 style='color: #6A0572;'>📌 TF-IDF Calculation</h3>
617
  <ul>
618
  <li><strong>TF</strong> focuses on words <strong>frequent</strong> in a document.</li>
619
  <li><strong>IDF</strong> focuses on words <strong>rare</strong> in the corpus.</li>
620
  <li><span class='highlight'>TF-IDF is high</span> for words that appear <strong>often in a document</strong> but <strong>rarely in the corpus</strong>.</li>
621
  </ul>
 
622
  """,
623
  unsafe_allow_html=True,
624
  )
 
628
 
629
  st.markdown(
630
  """
631
+ <h3 style='color: #6A0572;'> Minimum and Maximum Values of N/n</h3>
 
632
  <ul>
633
  <li>When <strong>n is maximum</strong> → <span class='highlight'>N/n = 1</span></li>
634
  <li>At <strong>training time</strong>: <span class='highlight'>1 ≤ n ≤ N</span></li>
635
  <li>At <strong>test time</strong>: <span class='highlight'>0 ≤ n ≤ N</span> (due to Out-of-Vocabulary words)</li>
636
  </ul>
 
637
  """,
638
  unsafe_allow_html=True,
639
  )
640
 
641
  st.markdown(
642
  """
643
+ <h3 style='color: #6A0572;'> IDF Dominance Over TF</h3>
 
644
  <ul>
645
  <li>If <strong>n decreases</strong> → <span class='highlight'>N/n increases (max)</span></li>
646
  <li>TF scale is very <span class='highlight'>small</span>, but IDF scale is very <span class='highlight'>high</span></li>
647
  <li>IDF can <span class='highlight'>dominate</span> TF, favoring rare words over frequent ones</li>
648
  </ul>
 
649
  """,
650
  unsafe_allow_html=True,
651
  )
652
 
653
  st.markdown(
654
  """
655
+ <h3 style='color: #6A0572;'>How Log Solves IDF Dominance?</h3>
 
656
  <ul>
657
  <li>Applying <span class='highlight'>log</span> reduces the dominance of IDF</li>
658
  <li>Logarithm <span class='highlight'>rounds off</span> values to a balanced scale</li>
659
  <li>It prevents bias towards rare words and maintains proportionality</li>
660
  </ul>
 
661
  """,
662
  unsafe_allow_html=True,
663
  )