Phani1008 commited on
Commit
3ab14ea
·
verified ·
1 Parent(s): 63d1f4e

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +98 -56
app.py CHANGED
@@ -1,5 +1,6 @@
1
  import streamlit as st
2
 
 
3
  def show_home_page():
4
  st.title("Natural Language Processing (NLP)")
5
  st.markdown(
@@ -10,72 +11,44 @@ def show_home_page():
10
  language in a way that is both meaningful and useful. NLP powers a wide range of applications like chatbots,
11
  translation tools, sentiment analysis, and search engines.
12
 
13
- Use the buttons below to explore each topic in detail.
14
  """
15
  )
16
 
17
- if st.button("NLP Terminologies"):
18
- st.session_state["page"] = "terminologies"
19
- if st.button("One-Hot Vectorization"):
20
- st.session_state["page"] = "one_hot"
21
- if st.button("Bag of Words"):
22
- st.session_state["page"] = "bow"
23
- if st.button("TF-IDF Vectorizer"):
24
- st.session_state["page"] = "tfidf"
25
- if st.button("Word2Vec"):
26
- st.session_state["page"] = "word2vec"
27
- if st.button("FastText"):
28
- st.session_state["page"] = "fasttext"
29
- if st.button("Tokenization"):
30
- st.session_state["page"] = "tokenization"
31
- if st.button("Stop Words"):
32
- st.session_state["page"] = "stop_words"
33
-
34
  def show_page(page):
35
- if page == "terminologies":
36
  st.title("NLP Terminologies")
37
  st.markdown(
38
  """
39
  ### NLP Terminologies (Detailed Explanation)
40
 
41
- - **Tokenization**: Tokenization is the process of breaking text into smaller units like words or sentences.
42
- For example, the sentence "I love NLP" can be tokenized into words: ["I", "love", "NLP"].
43
-
44
- - **Stop Words**: These are common words in a language (e.g., "the", "is", "and") that are often removed
45
- during preprocessing because they carry little unique information.
46
-
47
- - **Stemming**: Stemming reduces words to their root form by removing suffixes. For example, "running" -> "run".
48
- It may produce non-lexical words (e.g., "better" -> "bett").
49
-
50
- - **Lemmatization**: Unlike stemming, lemmatization converts a word to its dictionary base form (e.g., "running" -> "run").
51
-
52
  - **Corpus**: A large collection of text used for NLP training and analysis.
53
-
54
- - **Vocabulary**: The set of all unique words present in the corpus.
55
-
56
- - **n-grams**: Continuous sequences of n items (words or characters) from a text. For example, bigrams from "NLP is fun" are ["NLP is", "is fun"].
57
-
58
- - **POS Tagging**: Assigning parts of speech to words, like noun, verb, etc.
59
-
60
- - **Named Entity Recognition (NER)**: Identifying entities like names, locations, and organizations in text.
61
-
62
- - **Parsing**: Analyzing grammatical structure and relationships between words.
63
  """
64
  )
65
- elif page == "one_hot":
66
  st.title("One-Hot Vectorization")
67
  st.markdown(
68
  """
69
  ### One-Hot Vectorization
70
 
71
- One-hot vectorization is a simple representation where each word in the vocabulary is represented as a binary vector.
72
 
73
  #### How It Works:
74
  - Each unique word in the corpus is assigned an index.
75
  - The vector for a word is all zeros except for a 1 at the index corresponding to that word.
76
 
77
  #### Example:
78
- For a vocabulary ["cat", "dog", "bird"]:
79
  - "cat" -> [1, 0, 0]
80
  - "dog" -> [0, 1, 0]
81
  - "bird" -> [0, 0, 1]
@@ -91,7 +64,7 @@ def show_page(page):
91
  - Useful for small datasets and when computational simplicity is prioritized.
92
  """
93
  )
94
- elif page == "bow":
95
  st.title("Bag of Words (BoW)")
96
  st.markdown(
97
  """
@@ -124,7 +97,7 @@ def show_page(page):
124
  - Text classification and clustering.
125
  """
126
  )
127
- elif page == "tfidf":
128
  st.title("TF-IDF Vectorizer")
129
  st.markdown(
130
  """
@@ -138,11 +111,22 @@ def show_page(page):
138
  - **Term Frequency (TF)**: Number of times a term appears in a document divided by total terms in the document.
139
  - **Inverse Document Frequency (IDF)**: Logarithm of total documents divided by the number of documents containing the term.
140
 
 
 
 
 
 
 
 
 
 
 
 
141
  #### Applications:
142
  - Search engines, information retrieval, and document classification.
143
  """
144
  )
145
- elif page == "word2vec":
146
  st.title("Word2Vec")
147
  st.markdown(
148
  """
@@ -154,11 +138,18 @@ def show_page(page):
154
  - **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
155
  - **Skip-gram**: Predicts the context from the target word.
156
 
 
 
 
 
157
  #### Applications:
158
  - Text classification, sentiment analysis, and recommendation systems.
 
 
 
159
  """
160
  )
161
- elif page == "fasttext":
162
  st.title("FastText")
163
  st.markdown(
164
  """
@@ -166,36 +157,87 @@ def show_page(page):
166
 
167
  FastText is an extension of Word2Vec that represents words as a combination of character n-grams.
168
 
 
 
 
 
 
 
 
169
  #### Applications:
170
  - Multilingual text processing.
171
  - Handling noisy and incomplete data.
 
 
 
172
  """
173
  )
174
- elif page == "tokenization":
175
  st.title("Tokenization")
176
  st.markdown(
177
  """
178
  ### Tokenization
179
 
180
  Tokenization is the process of breaking text into smaller units (tokens) such as words, phrases, or sentences.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181
  """
182
  )
183
- elif page == "stop_words":
184
  st.title("Stop Words")
185
  st.markdown(
186
  """
187
  ### Stop Words
188
 
189
  Stop words are commonly used words in a language that are often removed during text preprocessing.
 
 
 
 
 
 
 
 
 
 
 
 
 
190
  """
191
  )
192
 
193
- # Initialize session state for page navigation
194
- if "page" not in st.session_state:
195
- st.session_state["page"] = "home"
196
-
197
- # Show appropriate page
198
- if st.session_state["page"] == "home":
 
 
 
 
 
 
 
 
 
 
 
199
  show_home_page()
200
  else:
201
- show_page(st.session_state["page"])
 
1
  import streamlit as st
2
 
3
+ # Function to display the Home Page
4
  def show_home_page():
5
  st.title("Natural Language Processing (NLP)")
6
  st.markdown(
 
11
  language in a way that is both meaningful and useful. NLP powers a wide range of applications like chatbots,
12
  translation tools, sentiment analysis, and search engines.
13
 
14
+ Use the menu in the sidebar to explore each topic in detail.
15
  """
16
  )
17
 
18
+ # Function to display specific topic pages
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  def show_page(page):
20
+ if page == "NLP Terminologies":
21
  st.title("NLP Terminologies")
22
  st.markdown(
23
  """
24
  ### NLP Terminologies (Detailed Explanation)
25
 
26
+ - **Tokenization**: Breaking text into smaller units like words or sentences.
27
+ - **Stop Words**: Commonly used words (e.g., "the", "is") often removed during preprocessing.
28
+ - **Stemming**: Reducing words to their root forms (e.g., "running" -> "run").
29
+ - **Lemmatization**: Converting words to their dictionary base forms (e.g., "running" -> "run").
 
 
 
 
 
 
 
30
  - **Corpus**: A large collection of text used for NLP training and analysis.
31
+ - **Vocabulary**: The set of all unique words in a corpus.
32
+ - **n-grams**: Continuous sequences of n words/characters from text.
33
+ - **POS Tagging**: Assigning parts of speech to words.
34
+ - **NER (Named Entity Recognition)**: Identifying names, places, organizations, etc.
35
+ - **Parsing**: Analyzing grammatical structure of text.
 
 
 
 
 
36
  """
37
  )
38
+ elif page == "One-Hot Vectorization":
39
  st.title("One-Hot Vectorization")
40
  st.markdown(
41
  """
42
  ### One-Hot Vectorization
43
 
44
+ A simple representation where each word in the vocabulary is represented as a binary vector.
45
 
46
  #### How It Works:
47
  - Each unique word in the corpus is assigned an index.
48
  - The vector for a word is all zeros except for a 1 at the index corresponding to that word.
49
 
50
  #### Example:
51
+ Vocabulary: ["cat", "dog", "bird"]
52
  - "cat" -> [1, 0, 0]
53
  - "dog" -> [0, 1, 0]
54
  - "bird" -> [0, 0, 1]
 
64
  - Useful for small datasets and when computational simplicity is prioritized.
65
  """
66
  )
67
+ elif page == "Bag of Words":
68
  st.title("Bag of Words (BoW)")
69
  st.markdown(
70
  """
 
97
  - Text classification and clustering.
98
  """
99
  )
100
+ elif page == "TF-IDF Vectorizer":
101
  st.title("TF-IDF Vectorizer")
102
  st.markdown(
103
  """
 
111
  - **Term Frequency (TF)**: Number of times a term appears in a document divided by total terms in the document.
112
  - **Inverse Document Frequency (IDF)**: Logarithm of total documents divided by the number of documents containing the term.
113
 
114
+ #### Advantages:
115
+ - Reduces the weight of common words.
116
+ - Highlights unique and important words.
117
+
118
+ #### Example:
119
+ For the corpus:
120
+ - Doc1: "NLP is amazing."
121
+ - Doc2: "NLP is fun and amazing."
122
+
123
+ TF-IDF highlights words like "fun" and "amazing" over commonly occurring words like "is".
124
+
125
  #### Applications:
126
  - Search engines, information retrieval, and document classification.
127
  """
128
  )
129
+ elif page == "Word2Vec":
130
  st.title("Word2Vec")
131
  st.markdown(
132
  """
 
138
  - **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
139
  - **Skip-gram**: Predicts the context from the target word.
140
 
141
+ #### Advantages:
142
+ - Captures semantic meaning (e.g., "king" - "man" + "woman" ≈ "queen").
143
+ - Efficient for large datasets.
144
+
145
  #### Applications:
146
  - Text classification, sentiment analysis, and recommendation systems.
147
+
148
+ #### Limitations:
149
+ - Requires significant computational resources.
150
  """
151
  )
152
+ elif page == "FastText":
153
  st.title("FastText")
154
  st.markdown(
155
  """
 
157
 
158
  FastText is an extension of Word2Vec that represents words as a combination of character n-grams.
159
 
160
+ #### Advantages:
161
+ - Handles rare and out-of-vocabulary words.
162
+ - Captures subword information (e.g., prefixes and suffixes).
163
+
164
+ #### Example:
165
+ The word "playing" might be represented by n-grams like "pla", "lay", "ayi", "ing".
166
+
167
  #### Applications:
168
  - Multilingual text processing.
169
  - Handling noisy and incomplete data.
170
+
171
+ #### Limitations:
172
+ - Higher computational cost compared to Word2Vec.
173
  """
174
  )
175
+ elif page == "Tokenization":
176
  st.title("Tokenization")
177
  st.markdown(
178
  """
179
  ### Tokenization
180
 
181
  Tokenization is the process of breaking text into smaller units (tokens) such as words, phrases, or sentences.
182
+
183
+ #### Types of Tokenization:
184
+ - **Word Tokenization**: Splits text into words.
185
+ - **Sentence Tokenization**: Splits text into sentences.
186
+
187
+ #### Libraries for Tokenization:
188
+ - NLTK, SpaCy, and Hugging Face Transformers.
189
+
190
+ #### Example:
191
+ Sentence: "NLP is exciting."
192
+ - Word Tokens: ["NLP", "is", "exciting", "."]
193
+
194
+ #### Applications:
195
+ - Preprocessing for machine learning models.
196
+
197
+ #### Challenges:
198
+ - Handling complex text like abbreviations and multilingual data.
199
  """
200
  )
201
+ elif page == "Stop Words":
202
  st.title("Stop Words")
203
  st.markdown(
204
  """
205
  ### Stop Words
206
 
207
  Stop words are commonly used words in a language that are often removed during text preprocessing.
208
+
209
+ #### Examples of Stop Words:
210
+ - English: "is", "the", "and", "in".
211
+ - Spanish: "es", "el", "y", "en".
212
+
213
+ #### Why Remove Stop Words?
214
+ - To reduce noise in text data.
215
+
216
+ #### Applications:
217
+ - Sentiment analysis, text classification, and search engines.
218
+
219
+ #### Challenges:
220
+ - Some stop words might carry context-specific importance.
221
  """
222
  )
223
 
224
+ # Sidebar navigation
225
+ st.sidebar.title("NLP Topics")
226
+ menu_options = [
227
+ "Home",
228
+ "NLP Terminologies",
229
+ "One-Hot Vectorization",
230
+ "Bag of Words",
231
+ "TF-IDF Vectorizer",
232
+ "Word2Vec",
233
+ "FastText",
234
+ "Tokenization",
235
+ "Stop Words",
236
+ ]
237
+ selected_page = st.sidebar.radio("Select a topic", menu_options)
238
+
239
+ # Display the selected page
240
+ if selected_page == "Home":
241
  show_home_page()
242
  else:
243
+ show_page(selected_page)