Update app.py
Browse files
app.py
CHANGED
|
@@ -15,21 +15,21 @@ def show_home_page():
|
|
| 15 |
)
|
| 16 |
|
| 17 |
if st.button("NLP Terminologies"):
|
| 18 |
-
st.
|
| 19 |
if st.button("One-Hot Vectorization"):
|
| 20 |
-
st.
|
| 21 |
if st.button("Bag of Words"):
|
| 22 |
-
st.
|
| 23 |
if st.button("TF-IDF Vectorizer"):
|
| 24 |
-
st.
|
| 25 |
if st.button("Word2Vec"):
|
| 26 |
-
st.
|
| 27 |
if st.button("FastText"):
|
| 28 |
-
st.
|
| 29 |
if st.button("Tokenization"):
|
| 30 |
-
st.
|
| 31 |
if st.button("Stop Words"):
|
| 32 |
-
st.
|
| 33 |
|
| 34 |
def show_page(page):
|
| 35 |
if page == "terminologies":
|
|
@@ -60,7 +60,6 @@ def show_page(page):
|
|
| 60 |
- **Named Entity Recognition (NER)**: Identifying entities like names, locations, and organizations in text.
|
| 61 |
|
| 62 |
- **Parsing**: Analyzing grammatical structure and relationships between words.
|
| 63 |
-
|
| 64 |
"""
|
| 65 |
)
|
| 66 |
elif page == "one_hot":
|
|
@@ -139,17 +138,6 @@ def show_page(page):
|
|
| 139 |
- **Term Frequency (TF)**: Number of times a term appears in a document divided by total terms in the document.
|
| 140 |
- **Inverse Document Frequency (IDF)**: Logarithm of total documents divided by the number of documents containing the term.
|
| 141 |
|
| 142 |
-
#### Advantages:
|
| 143 |
-
- Reduces the weight of common words.
|
| 144 |
-
- Highlights unique and important words.
|
| 145 |
-
|
| 146 |
-
#### Example:
|
| 147 |
-
For the corpus:
|
| 148 |
-
- Doc1: "NLP is amazing."
|
| 149 |
-
- Doc2: "NLP is fun and amazing."
|
| 150 |
-
|
| 151 |
-
TF-IDF highlights words like "fun" and "amazing" over commonly occurring words like "is".
|
| 152 |
-
|
| 153 |
#### Applications:
|
| 154 |
- Search engines, information retrieval, and document classification.
|
| 155 |
"""
|
|
@@ -166,19 +154,8 @@ def show_page(page):
|
|
| 166 |
- **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
|
| 167 |
- **Skip-gram**: Predicts the context from the target word.
|
| 168 |
|
| 169 |
-
#### Advantages:
|
| 170 |
-
- Captures semantic meaning (e.g., "king" - "man" + "woman" ≈ "queen").
|
| 171 |
-
- Efficient for large datasets.
|
| 172 |
-
|
| 173 |
-
#### Training Process:
|
| 174 |
-
- Uses shallow neural networks.
|
| 175 |
-
- Optimized using techniques like negative sampling.
|
| 176 |
-
|
| 177 |
#### Applications:
|
| 178 |
- Text classification, sentiment analysis, and recommendation systems.
|
| 179 |
-
|
| 180 |
-
#### Limitations:
|
| 181 |
-
- Requires significant computational resources.
|
| 182 |
"""
|
| 183 |
)
|
| 184 |
elif page == "fasttext":
|
|
@@ -189,19 +166,9 @@ def show_page(page):
|
|
| 189 |
|
| 190 |
FastText is an extension of Word2Vec that represents words as a combination of character n-grams.
|
| 191 |
|
| 192 |
-
#### Advantages:
|
| 193 |
-
- Handles rare and out-of-vocabulary words.
|
| 194 |
-
- Captures subword information (e.g., prefixes and suffixes).
|
| 195 |
-
|
| 196 |
-
#### Example:
|
| 197 |
-
The word "playing" might be represented by n-grams like "pla", "lay", "ayi", "ing".
|
| 198 |
-
|
| 199 |
#### Applications:
|
| 200 |
- Multilingual text processing.
|
| 201 |
- Handling noisy and incomplete data.
|
| 202 |
-
|
| 203 |
-
#### Limitations:
|
| 204 |
-
- Higher computational cost compared to Word2Vec.
|
| 205 |
"""
|
| 206 |
)
|
| 207 |
elif page == "tokenization":
|
|
@@ -211,23 +178,6 @@ def show_page(page):
|
|
| 211 |
### Tokenization
|
| 212 |
|
| 213 |
Tokenization is the process of breaking text into smaller units (tokens) such as words, phrases, or sentences.
|
| 214 |
-
|
| 215 |
-
#### Types of Tokenization:
|
| 216 |
-
- **Word Tokenization**: Splits text into words.
|
| 217 |
-
- **Sentence Tokenization**: Splits text into sentences.
|
| 218 |
-
|
| 219 |
-
#### Libraries for Tokenization:
|
| 220 |
-
- NLTK, SpaCy, and Hugging Face Transformers.
|
| 221 |
-
|
| 222 |
-
#### Example:
|
| 223 |
-
Sentence: "NLP is exciting."
|
| 224 |
-
- Word Tokens: ["NLP", "is", "exciting", "."]
|
| 225 |
-
|
| 226 |
-
#### Applications:
|
| 227 |
-
- Preprocessing for machine learning models.
|
| 228 |
-
|
| 229 |
-
#### Challenges:
|
| 230 |
-
- Handling complex text like abbreviations and multilingual data.
|
| 231 |
"""
|
| 232 |
)
|
| 233 |
elif page == "stop_words":
|
|
@@ -237,26 +187,15 @@ def show_page(page):
|
|
| 237 |
### Stop Words
|
| 238 |
|
| 239 |
Stop words are commonly used words in a language that are often removed during text preprocessing.
|
| 240 |
-
|
| 241 |
-
#### Examples of Stop Words:
|
| 242 |
-
- English: "is", "the", "and", "in".
|
| 243 |
-
- Spanish: "es", "el", "y", "en".
|
| 244 |
-
|
| 245 |
-
#### Why Remove Stop Words?
|
| 246 |
-
- To reduce noise in text data.
|
| 247 |
-
|
| 248 |
-
#### Applications:
|
| 249 |
-
- Sentiment analysis, text classification, and search engines.
|
| 250 |
-
|
| 251 |
-
#### Challenges:
|
| 252 |
-
- Some stop words might carry context-specific importance.
|
| 253 |
"""
|
| 254 |
)
|
| 255 |
|
| 256 |
-
|
| 257 |
-
|
|
|
|
| 258 |
|
| 259 |
-
|
|
|
|
| 260 |
show_home_page()
|
| 261 |
else:
|
| 262 |
-
show_page(page)
|
|
|
|
| 15 |
)
|
| 16 |
|
| 17 |
if st.button("NLP Terminologies"):
|
| 18 |
+
st.session_state["page"] = "terminologies"
|
| 19 |
if st.button("One-Hot Vectorization"):
|
| 20 |
+
st.session_state["page"] = "one_hot"
|
| 21 |
if st.button("Bag of Words"):
|
| 22 |
+
st.session_state["page"] = "bow"
|
| 23 |
if st.button("TF-IDF Vectorizer"):
|
| 24 |
+
st.session_state["page"] = "tfidf"
|
| 25 |
if st.button("Word2Vec"):
|
| 26 |
+
st.session_state["page"] = "word2vec"
|
| 27 |
if st.button("FastText"):
|
| 28 |
+
st.session_state["page"] = "fasttext"
|
| 29 |
if st.button("Tokenization"):
|
| 30 |
+
st.session_state["page"] = "tokenization"
|
| 31 |
if st.button("Stop Words"):
|
| 32 |
+
st.session_state["page"] = "stop_words"
|
| 33 |
|
| 34 |
def show_page(page):
|
| 35 |
if page == "terminologies":
|
|
|
|
| 60 |
- **Named Entity Recognition (NER)**: Identifying entities like names, locations, and organizations in text.
|
| 61 |
|
| 62 |
- **Parsing**: Analyzing grammatical structure and relationships between words.
|
|
|
|
| 63 |
"""
|
| 64 |
)
|
| 65 |
elif page == "one_hot":
|
|
|
|
| 138 |
- **Term Frequency (TF)**: Number of times a term appears in a document divided by total terms in the document.
|
| 139 |
- **Inverse Document Frequency (IDF)**: Logarithm of total documents divided by the number of documents containing the term.
|
| 140 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
#### Applications:
|
| 142 |
- Search engines, information retrieval, and document classification.
|
| 143 |
"""
|
|
|
|
| 154 |
- **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
|
| 155 |
- **Skip-gram**: Predicts the context from the target word.
|
| 156 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
#### Applications:
|
| 158 |
- Text classification, sentiment analysis, and recommendation systems.
|
|
|
|
|
|
|
|
|
|
| 159 |
"""
|
| 160 |
)
|
| 161 |
elif page == "fasttext":
|
|
|
|
| 166 |
|
| 167 |
FastText is an extension of Word2Vec that represents words as a combination of character n-grams.
|
| 168 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
#### Applications:
|
| 170 |
- Multilingual text processing.
|
| 171 |
- Handling noisy and incomplete data.
|
|
|
|
|
|
|
|
|
|
| 172 |
"""
|
| 173 |
)
|
| 174 |
elif page == "tokenization":
|
|
|
|
| 178 |
### Tokenization
|
| 179 |
|
| 180 |
Tokenization is the process of breaking text into smaller units (tokens) such as words, phrases, or sentences.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
"""
|
| 182 |
)
|
| 183 |
elif page == "stop_words":
|
|
|
|
| 187 |
### Stop Words
|
| 188 |
|
| 189 |
Stop words are commonly used words in a language that are often removed during text preprocessing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
"""
|
| 191 |
)
|
| 192 |
|
| 193 |
+
# Initialize session state for page navigation
|
| 194 |
+
if "page" not in st.session_state:
|
| 195 |
+
st.session_state["page"] = "home"
|
| 196 |
|
| 197 |
+
# Show appropriate page
|
| 198 |
+
if st.session_state["page"] == "home":
|
| 199 |
show_home_page()
|
| 200 |
else:
|
| 201 |
+
show_page(st.session_state["page"])
|