Harika22 commited on
Commit
f9c5382
·
verified ·
1 Parent(s): 6ac7719

Update pages/6_Feature_Engineering.py

Browse files
Files changed (1) hide show
  1. pages/6_Feature_Engineering.py +108 -34
pages/6_Feature_Engineering.py CHANGED
@@ -458,38 +458,112 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
458
  """,
459
  unsafe_allow_html=True,
460
  )
461
- st.markdown('''Example of TF-IDF
462
- - In a corpus there are 3 documents d1, d2, d3
463
- - d1 ➡️ w1, w2, w3, w1 ➡️ v1
464
- - d1 ➡️ w1, w2, w2, w3, w4, w2, w3 ➡️ v2
465
- - d1 ➡️ w1, w5 ➡️ v3
466
- - values are product of two values
467
- - wi = ith representation of word
468
- - Vocabulary = {w1, w2, w3, w4, w5}
469
- - len(voc) = 5
470
- - TF(w1, d1) = 2/4
471
- - TF(w2, d1) = 1/4
472
- - TF(w3, d1) = 1/4
473
- - TF(w4, d1) = 0/4
474
- - TF(w5, d1) = 0/4
475
- - TF value for every word will be going on changing as the document changes
476
- - TF lies between 0 and 1 [0 ... 1] ( sort of probability)
477
- - Case-1 : TF = 0 that wi is not present in particular di
478
- - Case-2 : TF = 1 → that wi is the only word present in particular di
479
- - IDF(wi, C) = log(N/n)
480
- - n= total no.of documents which contains wi
481
- - N = total no.of documents
482
- - IDF values lies between >=0 to ∞(infinite)
483
- - IDF(w1, C) = log(3/3)
484
- - IDF(w2, C) = log(3/2)
485
- - IDF(w3, C) = log(3/2)
486
- - IDF(w4, C) = log(3/1)
487
- - IDF(w5, C) = log(3/1)
488
- - Tf(wi, di) is calculated and stored in memory
489
- - Converting document to vector by product of TF and IDF
490
- - d1:v1 [0,0.04,0.04,0,0] → TF*IDF values
491
- - TF * IDF values can be low or high or zero
492
-
493
- ''')
494
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
495
 
 
458
  """,
459
  unsafe_allow_html=True,
460
  )
461
+ st.markdown("<h1 class='title'>📌 Example of TF-IDF</h1>", unsafe_allow_html=True)
462
+
463
+ st.markdown(
464
+ """
465
+ <div class='box'>
466
+ <strong>Given a corpus with 3 documents:</strong><br><br>
467
+ <strong>d1:</strong> w1, w2, w3, w1 v1 <br>
468
+ <strong>d2:</strong> w1, w2, w2, w3, w4, w2, w3 → v2 <br>
469
+ <strong>d3:</strong> w1, w5 → v3 <br><br>
470
+ <strong>Vocabulary:</strong> {w1, w2, w3, w4, w5} <br>
471
+ <strong>Vocabulary Size:</strong> 5 (d-dimension)
472
+ </div>
473
+ """,
474
+ unsafe_allow_html=True,
475
+ )
476
+
477
+ st.markdown("<h2 style='color: #6A0572;'>📊 Term Frequency (TF) Calculation</h2>", unsafe_allow_html=True)
478
+
479
+ st.markdown(
480
+ """
481
+ <div class='box'>
482
+ <ul>
483
+ <li>TF measures how often a word appears in a document.</li>
484
+ <li>Formula: <span class='highlight'>TF(wᵢ, dᵢ) = (Occurrences of wᵢ in dᵢ) / (Total words in dᵢ)</span></li>
485
+ <li>TF values change based on the document.</li>
486
+ </ul>
487
+ </div>
488
+ """,
489
+ unsafe_allow_html=True,
490
+ )
491
+
492
+ st.markdown(
493
+ """
494
+ <div class='formula'>
495
+ TF(w1, d1) = 2/4 = 0.5 <br>
496
+ TF(w2, d1) = 1/4 = 0.25 <br>
497
+ TF(w3, d1) = 1/4 = 0.25 <br>
498
+ TF(w4, d1) = 0/4 = 0 <br>
499
+ TF(w5, d1) = 0/4 = 0 <br>
500
+ </div>
501
+ """,
502
+ unsafe_allow_html=True,
503
+ )
504
+
505
+ st.markdown(
506
+ """
507
+ <div class='box'>
508
+ <ul>
509
+ <li>TF values always range from <strong>0 to 1</strong>.</li>
510
+ <li>Case-1: <span class='highlight'>TF = 0</span> → Word is not present in the document.</li>
511
+ <li>Case-2: <span class='highlight'>TF = 1</span> → Word is the only word in the document.</li>
512
+ </ul>
513
+ </div>
514
+ """,
515
+ unsafe_allow_html=True,
516
+ )
517
+
518
+ st.markdown("<h2 style='color: #6A0572;'>📉 Inverse Document Frequency (IDF) Calculation</h2>", unsafe_allow_html=True)
519
+
520
+ st.markdown(
521
+ """
522
+ <div class='box'>
523
+ <ul>
524
+ <li>IDF measures how important a word is across the entire corpus.</li>
525
+ <li>Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
526
+ <li>N = Total number of documents.</li>
527
+ <li>n = Number of documents containing wᵢ.</li>
528
+ <li>IDF values range from <strong>0 to ∞</strong>.</li>
529
+ </ul>
530
+ </div>
531
+ """,
532
+ unsafe_allow_html=True,
533
+ )
534
+
535
+ st.markdown("<h2 style='color: #6A0572;'>📌 TF-IDF Calculation</h2>", unsafe_allow_html=True)
536
+
537
+ st.markdown(
538
+ """
539
+ <div class='box'>
540
+ <ul>
541
+ <li>We calculate TF-IDF by multiplying TF and IDF values.</li>
542
+ <li>Formula: <span class='highlight'>TF-IDF = TF * IDF</span></li>
543
+ <li>TF-IDF helps reduce the impact of frequent words while keeping rare words important.</li>
544
+ </ul>
545
+ </div>
546
+ """,
547
+ unsafe_allow_html=True,
548
+ )
549
+
550
+ st.markdown(
551
+ """
552
+ <div class='formula'>
553
+ d1 → v1 = [0, 0.04, 0.04, 0, 0] (TF * IDF values)
554
+ </div>
555
+ """,
556
+ unsafe_allow_html=True,
557
+ )
558
+
559
+ st.markdown(
560
+ """
561
+ <div class='box'>
562
+ - The final TF-IDF values may be low, high, or even zero depending on term frequency and document frequency.
563
+ </div>
564
+ """,
565
+ unsafe_allow_html=True,
566
+ )
567
+
568
+ st.markdown("<p style='text-align: center; font-size: 18px;'><strong>TF-IDF effectively balances word significance and document relevance! 🚀</strong></p>", unsafe_allow_html=True)
569