Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Harika22 commited on Feb 2, 2025

Commit

47fbe17

verified ·

1 Parent(s): 058b5ee

Update pages/6_Feature_Engineering.py

Browse files

Files changed (1) hide show

pages/6_Feature_Engineering.py +61 -1

pages/6_Feature_Engineering.py CHANGED Viewed

@@ -237,4 +237,64 @@ if file_type == "One-Hot Vectorization":
     st.subheader(":blue[Lack of Sequential Information]")
     st.markdown('''
         - Sequential information is not preserved
-    ''')

     st.subheader(":blue[Lack of Sequential Information]")
     st.markdown('''
         - Sequential information is not preserved
+    ''')
+elif file_type == "Bag of Words(BOW)":
+    st.title(":red[Bag of Words(BOW)]")
+    st.markdown("""
+        ### 📌 What is Bag of Words(BOW)?
+        -  It is a type of vectorization technique where text is converted into a numerical vector.
+        -  To overcome the problem of different document length(can't convert into tabular data) BOW is implemented.
+    """)
+    st.markdown("""
+        ### 🛠️ Steps in Bag of Words(BOW):
+         - Create a Vocabulary ➡️ (A set of all unique words in the collected corpus).
+         - Find the Length of Vocabulary ➡️ (Total number of unique words = d-dimensions).
+             - Each document is converted into vector which is in d- dimension
+             - Every dimeension belongs to a unique word
+         - Bag of Words is actually interested in how many times the word is occuring
+         - If the two documents are same they will find out a similarity based on same words repeating in 2 different documents
+         - By converting into documents into vectors we can concatenate all vectors to form tabular data
+             - where roes are documents and columns represent features which are unique words
+             - Every dimension value will be count
+             - how many times the word is occuring in document
+        """)
+    st.markdown(
+    "<div class='corpus-box'>"
+    "<strong>Document 1:</strong> I love cricket I <br>"
+    "<strong>Document 2:</strong> I hate cricket <br>"
+    "<strong>Document 3:</strong> I like cricket"
+    "</div>",
+    unsafe_allow_html=True,
+    )
+    st.subheader(":green[Unique Words (Vocabulary)]")
+    st.markdown(
+    "<p class='content'>The set of unique words in our corpus is: <strong>{I, love, cricket, hate, like}</strong>. "
+    "This set forms the vocabulary, and the number of unique words determines the vector dimensions.</p>",
+    unsafe_allow_html=True,
+    )
+    st.subheader(":green[Word Count Representation]")
+    st.markdown(
+    "<p class='content'>Each document is converted into a numerical vector by counting the occurrences of words "
+    "from the vocabulary within each document.</p>",
+    unsafe_allow_html=True,
+    )
+    st.markdown(
+    "<div class='vector-box'><strong>Vector Representation:</strong><br>"
+    "Document 1 ➝ [2,1,1,0,0] (I = 2, love = 1, cricket = 1, hate = 0, like = 0)<br>"
+    "Document 2 ➝ [1,0,1,1,0] (I = 1, love = 0, cricket = 1, hate = 1, like = 0)<br>"
+    "Document 3 ➝ [1,0,1,0,1] (I = 1, love = 0, cricket = 1, hate = 0, like = 1)"
+    "</div>",
+    unsafe_allow_html=True,
+    )
+    st.subheader(":green[Tabular Representation]")
+    st.markdown(
+    "<p class='content'>Since all three vectors have the same number of dimensions, we can merge them into a tabular format:</p>",
+    unsafe_allow_html=True,
+    )