UmaKumpatla commited on
Commit
84b9be8
Β·
verified Β·
1 Parent(s): 3903d39

Update pages/4.Feature Engineering.py

Browse files
Files changed (1) hide show
  1. pages/4.Feature Engineering.py +216 -0
pages/4.Feature Engineering.py CHANGED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+
3
+ # Function to display the Home Page
4
+ def show_home_page():
5
+ st.title("πŸ”¦ :red[Natural Language Processing (NLP)]")
6
+ st.markdown(
7
+ """
8
+ ### :green[Welcome to the NLP Guide]
9
+ Natural Language Processing (NLP) is a fascinating branch of Artificial Intelligence that focuses on the interaction between
10
+ computers and humans using natural language. It enables machines to read, understand, and generate human language in a meaningful way.
11
+ This guide explores key NLP concepts and techniques, from basic terminologies to advanced vectorization methods. Use the sidebar to explore each topic in detail.
12
+
13
+ #### :green[Applications of NLP:]
14
+ - Chatbots and virtual assistants (e.g., Alexa, Siri)
15
+ - Sentiment analysis
16
+ - Language translation tools (e.g., Google Translate)
17
+ - Text summarization and more!
18
+ """
19
+ )
20
+ st.image("https://cdn-uploads.huggingface.co/production/uploads/66be28cc7e8987822d129400/1zCao_p5aQZr6zgYScaOB.png")
21
+ # Function to display specific topic pages
22
+ def show_page(page):
23
+ if page == "NLP Terminologies":
24
+ st.title("πŸ” :blue[NLP Terminologies]")
25
+ st.markdown(
26
+ """
27
+ ### :red[Key NLP Terms:]
28
+ - **Tokenization**: Splitting text into smaller units like words or sentences.
29
+ - **Stop Words**: Commonly used words (e.g., "the", "is") often removed during preprocessing.
30
+ - **Stemming**: Reducing words to their root form (e.g., "running" β†’ "run").
31
+ - **Lemmatization**: Converting words to their dictionary base form (e.g., "running" β†’ "run").
32
+ - **Corpus**: A large collection of text used for NLP training and analysis.
33
+ - **Vocabulary**: The set of unique words in a corpus.
34
+ - **n-grams**: Sequences of *n* words or characters in text.
35
+ - **POS Tagging**: Assigning parts of speech (e.g., noun, verb) to words.
36
+ - **NER (Named Entity Recognition)**: Identifying names, places, organizations, etc.
37
+ - **Parsing**: Analyzing the grammatical structure of a sentence.
38
+ """
39
+ )
40
+
41
+ elif page == "One-Hot Vectorization":
42
+ st.title("πŸ”§ :green[One-Hot Vectorization]")
43
+ st.markdown(
44
+ """
45
+ ### :red[One-Hot Vectorization Explained]
46
+ One-Hot Vectorization is a simple representation where each word is encoded as a binary vector.
47
+ #### :red[How It Works:]
48
+ - Each unique word in the vocabulary is assigned an index.
49
+ - The vector for a word is all zeros except for a `1` at the index of that word.
50
+ #### :red[Example:]
51
+ Vocabulary: ["cat", "dog", "bird"]
52
+ - "cat" β†’ [1, 0, 0]
53
+ - "dog" β†’ [0, 1, 0]
54
+ - "bird" β†’ [0, 0, 1]
55
+ #### :red[Advantages:]
56
+ - Simple and intuitive to implement.
57
+ #### :red[Limitations:]
58
+ - High dimensionality for large vocabularies.
59
+ - Does not capture semantic relationships (e.g., "cat" and "kitten" have no connection).
60
+ #### :red[Applications:]
61
+ - Suitable for small datasets where simplicity is a priority.
62
+ """
63
+ )
64
+
65
+ elif page == "Bag of Words":
66
+ st.title("πŸ”„ :green[Bag of Words (BoW)]")
67
+ st.markdown(
68
+ """
69
+ ### :orange[Bag of Words (BoW) Method]
70
+ Bag of Words is a way of representing text by counting word occurrences while ignoring word order.
71
+ #### :orange[How It Works:]
72
+ 1. Create a vocabulary of all unique words in the text.
73
+ 2. Count the frequency of each word in a document.
74
+ #### :orange[Example:]
75
+ Given two sentences:
76
+ - Sentence 1: "I love NLP."
77
+ - Sentence 2: "I love programming."
78
+ Vocabulary: ["I", "love", "NLP", "programming"]
79
+ - Sentence 1: [1, 1, 1, 0]
80
+ - Sentence 2: [1, 1, 0, 1]
81
+ #### :orange[Advantages:]
82
+ - Simple to implement and interpret.
83
+ #### :orange[Limitations:]
84
+ - High dimensionality for large vocabularies.
85
+ - Ignores word order and semantic meaning.
86
+ - Sensitive to noisy or frequent terms.
87
+ #### :orange[Applications:]
88
+ - Text classification and clustering.
89
+ """
90
+ )
91
+
92
+ elif page == "TF-IDF Vectorizer":
93
+ st.title("πŸ”„ :blue[TF-IDF Vectorizer]")
94
+ st.markdown(
95
+ """
96
+ ### :green[TF-IDF (Term Frequency-Inverse Document Frequency)]
97
+ TF-IDF evaluates the importance of a word in a document relative to a collection of documents (corpus).
98
+ #### :rainbow[Formula:]
99
+ \[ \text{TF-IDF} = \text{TF} \times \text{IDF} \]
100
+ - **TF (Term Frequency)**: Frequency of a word in a document divided by the total words in the document.
101
+ - **IDF (Inverse Document Frequency)**: Logarithm of total documents divided by the number of documents containing the word.
102
+ #### :rainbow[Example:]
103
+ For the corpus:
104
+ - Document 1: "NLP is amazing."
105
+ - Document 2: "NLP is fun and amazing."
106
+ Words like "fun" and "amazing" will have higher weights than commonly occurring words like "is".
107
+ #### :rainbow[Advantages:]
108
+ - Highlights unique and relevant terms.
109
+ - Reduces the impact of frequent, less informative words.
110
+ #### :rainbow[Applications:]
111
+ - Information retrieval, search engines, and document classification.
112
+ """
113
+ )
114
+
115
+ elif page == "Word2Vec":
116
+ st.title("🌐 :red[Word2Vec]")
117
+ st.markdown(
118
+ """
119
+ ### :green[Word2Vec]
120
+ Word2Vec creates dense vector representations of words, capturing semantic relationships using neural networks.
121
+ #### :green[Key Models:]
122
+ - **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
123
+ - **Skip-gram**: Predicts the context from a target word.
124
+ #### :green[Example:]
125
+ Word2Vec can capture relationships like:
126
+ - "king" - "man" + "woman" β‰ˆ "queen"
127
+ #### :green[Advantages:]
128
+ - Captures semantic meaning and relationships.
129
+ - Efficient for large datasets.
130
+ #### :green[Applications:]
131
+ - Sentiment analysis, recommendation systems, and machine translation.
132
+ #### :green[Limitations:]
133
+ - Computationally intensive for training on large datasets.
134
+ """
135
+ )
136
+
137
+ elif page == "FastText":
138
+ st.title("πŸ”„ :red[FastText]")
139
+ st.markdown(
140
+ """
141
+ ### :blue[FastText]
142
+ FastText extends Word2Vec by representing words as character n-grams, enabling it to handle rare and out-of-vocabulary words.
143
+ #### :blue[Example:]
144
+ The word "playing" might be represented by subwords like "pla", "lay", "ayi", "ing".
145
+ #### :blue[Advantages:]
146
+ - Handles rare words and misspellings.
147
+ - Captures subword information (e.g., prefixes and suffixes).
148
+ #### :blue[Applications:]
149
+ - Multilingual text processing.
150
+ - Working with noisy or incomplete data.
151
+ #### :blue[Limitations:]
152
+ - Higher computational cost than Word2Vec.
153
+ """
154
+ )
155
+
156
+ elif page == "Tokenization":
157
+ st.title("πŸ”’ :blue[Tokenization]")
158
+ st.markdown(
159
+ """
160
+ ### :red[Tokenization]
161
+ Tokenization is the process of splitting text into smaller units (tokens) such as words, phrases, or sentences.
162
+ #### :red[Types:]
163
+ - **Word Tokenization**: Splits text into words.
164
+ - **Sentence Tokenization**: Splits text into sentences.
165
+ #### :red[Example:]
166
+ Sentence: "NLP is exciting."
167
+ - Word Tokens: ["NLP", "is", "exciting", "."]
168
+ #### :red[Libraries:]
169
+ - NLTK
170
+ - SpaCy
171
+ - Hugging Face Transformers
172
+ #### :red[Challenges:]
173
+ - Handling complex text (e.g., abbreviations, contractions, multilingual data).
174
+ #### :red[Applications:]
175
+ - Preprocessing for machine learning models.
176
+ """
177
+ )
178
+
179
+ elif page == "Stop Words":
180
+ st.title("πŸ” :green[Stop Words]")
181
+ st.markdown(
182
+ """
183
+ ### :rainbow[Stop Words]
184
+ Stop words are commonly used words in a language that are often removed during text preprocessing (e.g., "is", "the", "and").
185
+ #### :rainbow[Why Remove Stop Words?]
186
+ - To reduce noise and focus on meaningful terms in text.
187
+ #### :rainbow[Example Stop Words:]
188
+ - English: "is", "the", "and".
189
+ - Spanish: "es", "el", "y".
190
+ #### :rainbow[Challenges:]
191
+ - Some stop words might carry important context in specific use cases.
192
+ #### :rainbow[Applications:]
193
+ - Sentiment analysis, text classification, and search engines.
194
+ """
195
+ )
196
+
197
+ # Sidebar navigation
198
+ st.sidebar.title("πŸ” NLP Topics")
199
+ menu_options = [
200
+ "Home",
201
+ "NLP Terminologies",
202
+ "One-Hot Vectorization",
203
+ "Bag of Words",
204
+ "TF-IDF Vectorizer",
205
+ "Word2Vec",
206
+ "FastText",
207
+ "Tokenization",
208
+ "Stop Words",
209
+ ]
210
+ selected_page = st.sidebar.radio("Select a topic", menu_options)
211
+
212
+ # Display the selected page
213
+ if selected_page == "Home":
214
+ show_home_page()
215
+ else:
216
+ show_page(selected_page)