Update pages/6_Feature_Engineering.py
Browse files
pages/6_Feature_Engineering.py
CHANGED
|
@@ -703,3 +703,67 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
|
|
| 703 |
unsafe_allow_html=True,
|
| 704 |
)
|
| 705 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 703 |
unsafe_allow_html=True,
|
| 704 |
)
|
| 705 |
|
| 706 |
+
st.subheader(":red[Advantages]")
|
| 707 |
+
st.markdown('''
|
| 708 |
+
- Easy to implement
|
| 709 |
+
- Can convert into tabular format
|
| 710 |
+
- It gives importance to both frequently occuring word and rarely occuring in corpus
|
| 711 |
+
''')
|
| 712 |
+
st.subheader(":red[Disadvantages]")
|
| 713 |
+
|
| 714 |
+
st.subheader(":blue[Curse of Dimensionality]")
|
| 715 |
+
st.markdown('''
|
| 716 |
+
- Document increases ↑ Vocabulary ↑ and vector increases ↑ dimensionality also increases ↑
|
| 717 |
+
- Ml performance decreases ↓ - as the dimensionality totally depends on vocabulary and it shootup as the document increases and different
|
| 718 |
+
- As the corpus increases , vocabulary increases -- dimensionality increses
|
| 719 |
+
''')
|
| 720 |
+
|
| 721 |
+
|
| 722 |
+
st.subheader(":blue[Sparsity]")
|
| 723 |
+
st.markdown('''
|
| 724 |
+
- The vector which is created using BOW gives sparse vector
|
| 725 |
+
- Entire data is given to any alogorithm and machine is going to learn fom data and algorithm it is baised towards zero values as the data is sparse data
|
| 726 |
+
- This issue in ML is known as overfitting
|
| 727 |
+
- It is solved in Deep learning
|
| 728 |
+
''')
|
| 729 |
+
|
| 730 |
+
st.subheader(":blue[Out of Vocabulary Issue]")
|
| 731 |
+
st.markdown('''
|
| 732 |
+
- Document only converted during training time and we're giving our own dataset
|
| 733 |
+
- If the word is not present in our dataset while training it can't convert into vector format results in key error
|
| 734 |
+
- This is solved by Fasttext
|
| 735 |
+
''')
|
| 736 |
+
|
| 737 |
+
st.subheader(":blue[Inability to Preserve Semantic Meaning]")
|
| 738 |
+
st.markdown('''
|
| 739 |
+
- It slightly preserves semantic meaning
|
| 740 |
+
''')
|
| 741 |
+
|
| 742 |
+
st.subheader(":blue[Lack of Sequential Information]")
|
| 743 |
+
st.markdown('''
|
| 744 |
+
- Sequential information is not preserved
|
| 745 |
+
- Because in TF-IDF we're giving importance to words as we're doing word tokenization
|
| 746 |
+
- In ML no algorithm is capable of preserving sequential information
|
| 747 |
+
- This is only solved by Deep-learning concept
|
| 748 |
+
- But by applying a trick to BOW/BBOW/TF-IDF we can slightly preserve sequential information
|
| 749 |
+
- That technique is known as n-gram
|
| 750 |
+
''')
|
| 751 |
+
|
| 752 |
+
st.header(":red[n-gram]")
|
| 753 |
+
st.markdown('''
|
| 754 |
+
- n-gram default will always be 1-gram in BOW/BBOW/TF-IDF
|
| 755 |
+
- Based on n-gram onlt it can create a vocabulary
|
| 756 |
+
- n- gram is mostly used upto 1,2,3 gram only because as dimension increases ML performance decreases
|
| 757 |
+
- n-gram is used to slightly preserve sequential information
|
| 758 |
+
''')
|
| 759 |
+
|
| 760 |
+
st.code('''
|
| 761 |
+
from sklearn.feature_extraction.text import TfidfVectorizer\
|
| 762 |
+
|
| 763 |
+
corpus = pd.DataFrame({"Review":["biryani is is is is résume is good","biryani biryani biryani is not good","biryani is too costly"]})
|
| 764 |
+
tf = TfidfVectorizer()
|
| 765 |
+
|
| 766 |
+
vector = tf.fit_transform(corpus["Review"])
|
| 767 |
+
vector.toarray()
|
| 768 |
+
tf.vocabulary_
|
| 769 |
+
''')
|