Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

Harika22 commited on Feb 2, 2025

Commit

1a1805f

verified ·

1 Parent(s): 79fc281

Update pages/6_Feature_Engineering.py

Files changed (1) hide show

pages/6_Feature_Engineering.py CHANGED Viewed

@@ -353,6 +353,25 @@ elif file_type == "Bag of Words(BOW)":
             cv = CountVectorizer(lowercase=True,strip_accents="unicode",analyzer="word",stop_words=stp,token_pattern=r"((?u)\b\w\w+\b))")
             cv.fit(corpus["Review"])  ### learning vocabulary
             vector = cv.transform(corpus["Review"]) ### it converts into vector form based on cv and vocabulary learned
             vector.toarray()
-    ''')

             cv = CountVectorizer(lowercase=True,strip_accents="unicode",analyzer="word",stop_words=stp,token_pattern=r"((?u)\b\w\w+\b))")
             cv.fit(corpus["Review"])  ### learning vocabulary
             vector = cv.transform(corpus["Review"]) ### it converts into vector form based on cv and vocabulary learned
+            cv.get_feature_names_out()
+            cv.vocabulary_
             vector.toarray()
+    ''')
+    st.header("Binary Bag of Words(BBOW)")
+    st.markdown('''
+    - Extension of Bag of Words(BOW) is Binary Bag of Words(BBOW)
+    ''')
+    st.markdown("""
+        ### 🛠️ Steps in Bag of Words(BOW):
+        - Create a vocabulary (set of unique words)
+        - Each document is converted into vector form(d-dimension)
+        - In bag of words the value is count , but in binary bag of words it tells whether the word is preseent or not
+        - So, that it is way more easier to find the distance between vectors (here distance is nothing but no.of unique words)
+        - If the unique words are more --> distance is high
+        - Calculation of distance will be way more faster than bag of words
+            - distance is total no.of unique words between two documents
+    """)