Harika22 commited on
Commit
d019295
·
verified ·
1 Parent(s): 7007a94

Update pages/6_Feature_Engineering.py

Browse files
Files changed (1) hide show
  1. pages/6_Feature_Engineering.py +98 -0
pages/6_Feature_Engineering.py CHANGED
@@ -67,6 +67,24 @@ st.markdown("""
67
  .sidebar h2 {
68
  color: #495057;
69
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  /* Custom button style */
71
  .streamlit-button {
72
  background-color: #00FFFF;
@@ -378,4 +396,84 @@ elif file_type == "Bag of Words(BOW)":
378
 
379
  elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
380
  st.title(":red[Term Frequency - Inverse Document Frequency(TF-IDF)]")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
381
 
 
67
  .sidebar h2 {
68
  color: #495057;
69
  }
70
+ .step-box {
71
+ font-size: 18px;
72
+ background-color: #F0F8FF;
73
+ padding: 15px;
74
+ border-radius: 10px;
75
+ box-shadow: 2px 2px 8px #D3D3D3;
76
+ line-height: 1.6;
77
+ }
78
+ .formula {
79
+ font-size: 20px;
80
+ font-weight: bold;
81
+ color: #2A9D8F;
82
+ background-color: #F7F7F7;
83
+ padding: 10px;
84
+ border-radius: 5px;
85
+ text-align: center;
86
+ margin-top: 10px;
87
+ }
88
  /* Custom button style */
89
  .streamlit-button {
90
  background-color: #00FFFF;
 
396
 
397
  elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
398
  st.title(":red[Term Frequency - Inverse Document Frequency(TF-IDF)]")
399
+ st.markdown("""
400
+ ### 📌 What is Bag of Words(BOW)?
401
+ - It is a type of vectorization technique where text is converted into a numerical vector.
402
+ """)
403
+
404
+ st.subheader(":violet[🛠️ Steps in TF-IDF]")
405
+
406
+ st.markdown(
407
+ """
408
+ <div class='step-box'>
409
+ <ul>
410
+ <li><strong>Create a vocabulary:</strong> A set of unique words from the corpus.</li>
411
+ <li><strong>Convert each document into a vector:</strong> A d-dimensional representation.</li>
412
+ <li><strong>Calculate Term Frequency (TF):</strong> Measures the importance of a word within a document.</li>
413
+ </ul>
414
+ </div>
415
+ """,
416
+ unsafe_allow_html=True,
417
+ )
418
+
419
+ st.markdown("<div class='formula'>TF(wᵢ, dᵢ) = (Occurrences of wᵢ in dᵢ) / (Total words in dᵢ)</div>", unsafe_allow_html=True)
420
+
421
+ st.markdown(
422
+ """
423
+ <div class='step-box'>
424
+ <ul>
425
+ <li><strong>Compute Inverse Document Frequency (IDF):</strong> Measures how important a word is across all documents.</li>
426
+ <li><strong>For every word in the vocabulary, apply IDF:</strong></li>
427
+ </ul>
428
+ </div>
429
+ """,
430
+ unsafe_allow_html=True,
431
+ )
432
+
433
+ st.markdown("<div class='formula'>IDF(wᵢ, C) = log(N/n)</div>", unsafe_allow_html=True)
434
+
435
+ st.markdown(
436
+ """
437
+ <div class='step-box'>
438
+ - <strong>N:</strong> Total number of documents in the corpus.<br>
439
+ - <strong>n:</strong> Number of documents containing the word wᵢ.<br>
440
+ - TF-IDF helps in understanding word significance while reducing the impact of commonly used words.
441
+ </div>
442
+ """,
443
+ unsafe_allow_html=True,
444
+ )
445
+ st.markdown('''Example of TF-IDF
446
+ - In a corpus there are 3 documents d1, d2, d3
447
+ - d1 ➡️ w1, w2, w3, w1 ➡️ v1
448
+ - d1 ➡️ w1, w2, w2, w3, w4, w2, w3 ➡️ v2
449
+ - d1 ➡️ w1, w5 ➡️ v3
450
+ - values are product of two values
451
+ - wi = ith representation of word
452
+ - Vocabulary = {w1, w2, w3, w4, w5}
453
+ - len(voc) = 5
454
+ - TF(w1, d1) = 2/4
455
+ - TF(w2, d1) = 1/4
456
+ - TF(w3, d1) = 1/4
457
+ - TF(w4, d1) = 0/4
458
+ - TF(w5, d1) = 0/4
459
+ - TF value for every word will be going on changing as the document changes
460
+ - TF lies between 0 and 1 [0 ... 1] ( sort of probability)
461
+ - Case-1 : TF = 0 → that wi is not present in particular di
462
+ - Case-2 : TF = 1 → that wi is the only word present in particular di
463
+ - IDF(wi, C) = log(N/n)
464
+ - n= total no.of documents which contains wi
465
+ - N = total no.of documents
466
+ - IDF values lies between >=0 to ∞(infinite)
467
+ - IDF(w1, C) = log(3/3)
468
+ - IDF(w2, C) = log(3/2)
469
+ - IDF(w3, C) = log(3/2)
470
+ - IDF(w4, C) = log(3/1)
471
+ - IDF(w5, C) = log(3/1)
472
+ - Tf(wi, di) is calculated and stored in memory
473
+ - Converting document to vector by product of TF and IDF
474
+ - d1:v1 [0,0.04,0.04,0,0] → TF*IDF values
475
+ - TF * IDF values can be low or high or zero
476
+
477
+ ''')
478
+
479