Harika22 commited on
Commit
ea2f020
·
verified ·
1 Parent(s): 747b9ef

Update pages/6_Feature_Engineering.py

Browse files
Files changed (1) hide show
  1. pages/6_Feature_Engineering.py +64 -0
pages/6_Feature_Engineering.py CHANGED
@@ -703,3 +703,67 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
703
  unsafe_allow_html=True,
704
  )
705
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
703
  unsafe_allow_html=True,
704
  )
705
 
706
+ st.subheader(":red[Advantages]")
707
+ st.markdown('''
708
+ - Easy to implement
709
+ - Can convert into tabular format
710
+ - It gives importance to both frequently occuring word and rarely occuring in corpus
711
+ ''')
712
+ st.subheader(":red[Disadvantages]")
713
+
714
+ st.subheader(":blue[Curse of Dimensionality]")
715
+ st.markdown('''
716
+ - Document increases ↑ Vocabulary ↑ and vector increases ↑ dimensionality also increases ↑
717
+ - Ml performance decreases ↓ - as the dimensionality totally depends on vocabulary and it shootup as the document increases and different
718
+ - As the corpus increases , vocabulary increases -- dimensionality increses
719
+ ''')
720
+
721
+
722
+ st.subheader(":blue[Sparsity]")
723
+ st.markdown('''
724
+ - The vector which is created using BOW gives sparse vector
725
+ - Entire data is given to any alogorithm and machine is going to learn fom data and algorithm it is baised towards zero values as the data is sparse data
726
+ - This issue in ML is known as overfitting
727
+ - It is solved in Deep learning
728
+ ''')
729
+
730
+ st.subheader(":blue[Out of Vocabulary Issue]")
731
+ st.markdown('''
732
+ - Document only converted during training time and we're giving our own dataset
733
+ - If the word is not present in our dataset while training it can't convert into vector format results in key error
734
+ - This is solved by Fasttext
735
+ ''')
736
+
737
+ st.subheader(":blue[Inability to Preserve Semantic Meaning]")
738
+ st.markdown('''
739
+ - It slightly preserves semantic meaning
740
+ ''')
741
+
742
+ st.subheader(":blue[Lack of Sequential Information]")
743
+ st.markdown('''
744
+ - Sequential information is not preserved
745
+ - Because in TF-IDF we're giving importance to words as we're doing word tokenization
746
+ - In ML no algorithm is capable of preserving sequential information
747
+ - This is only solved by Deep-learning concept
748
+ - But by applying a trick to BOW/BBOW/TF-IDF we can slightly preserve sequential information
749
+ - That technique is known as n-gram
750
+ ''')
751
+
752
+ st.header(":red[n-gram]")
753
+ st.markdown('''
754
+ - n-gram default will always be 1-gram in BOW/BBOW/TF-IDF
755
+ - Based on n-gram onlt it can create a vocabulary
756
+ - n- gram is mostly used upto 1,2,3 gram only because as dimension increases ML performance decreases
757
+ - n-gram is used to slightly preserve sequential information
758
+ ''')
759
+
760
+ st.code('''
761
+ from sklearn.feature_extraction.text import TfidfVectorizer\
762
+
763
+ corpus = pd.DataFrame({"Review":["biryani is is is is résume is good","biryani biryani biryani is not good","biryani is too costly"]})
764
+ tf = TfidfVectorizer()
765
+
766
+ vector = tf.fit_transform(corpus["Review"])
767
+ vector.toarray()
768
+ tf.vocabulary_
769
+ ''')