Spaces:

Phani1008
/

Natural_Language_Processing

Sleeping

App Files Files Community

Phani1008 commited on Jan 27, 2025

Commit

3ab14ea

verified ·

1 Parent(s): 63d1f4e

Update app.py

Browse files

Files changed (1) hide show

app.py +98 -56

app.py CHANGED Viewed

@@ -1,5 +1,6 @@
 import streamlit as st
 def show_home_page():
     st.title("Natural Language Processing (NLP)")
     st.markdown(
@@ -10,72 +11,44 @@ def show_home_page():
         language in a way that is both meaningful and useful. NLP powers a wide range of applications like chatbots,
         translation tools, sentiment analysis, and search engines.
-        Use the buttons below to explore each topic in detail.
         """
     )
-    if st.button("NLP Terminologies"):
-        st.session_state["page"] = "terminologies"
-    if st.button("One-Hot Vectorization"):
-        st.session_state["page"] = "one_hot"
-    if st.button("Bag of Words"):
-        st.session_state["page"] = "bow"
-    if st.button("TF-IDF Vectorizer"):
-        st.session_state["page"] = "tfidf"
-    if st.button("Word2Vec"):
-        st.session_state["page"] = "word2vec"
-    if st.button("FastText"):
-        st.session_state["page"] = "fasttext"
-    if st.button("Tokenization"):
-        st.session_state["page"] = "tokenization"
-    if st.button("Stop Words"):
-        st.session_state["page"] = "stop_words"
 def show_page(page):
-    if page == "terminologies":
         st.title("NLP Terminologies")
         st.markdown(
             """
             ### NLP Terminologies (Detailed Explanation)
-            - **Tokenization**: Tokenization is the process of breaking text into smaller units like words or sentences.
-              For example, the sentence "I love NLP" can be tokenized into words: ["I", "love", "NLP"].
-            - **Stop Words**: These are common words in a language (e.g., "the", "is", "and") that are often removed
-              during preprocessing because they carry little unique information.
-            - **Stemming**: Stemming reduces words to their root form by removing suffixes. For example, "running" -> "run".
-              It may produce non-lexical words (e.g., "better" -> "bett").
-            - **Lemmatization**: Unlike stemming, lemmatization converts a word to its dictionary base form (e.g., "running" -> "run").
             - **Corpus**: A large collection of text used for NLP training and analysis.
-            - **Vocabulary**: The set of all unique words present in the corpus.
-            - **n-grams**: Continuous sequences of n items (words or characters) from a text. For example, bigrams from "NLP is fun" are ["NLP is", "is fun"].
-            - **POS Tagging**: Assigning parts of speech to words, like noun, verb, etc.
-            - **Named Entity Recognition (NER)**: Identifying entities like names, locations, and organizations in text.
-            - **Parsing**: Analyzing grammatical structure and relationships between words.
             """
         )
-    elif page == "one_hot":
         st.title("One-Hot Vectorization")
         st.markdown(
             """
             ### One-Hot Vectorization
-            One-hot vectorization is a simple representation where each word in the vocabulary is represented as a binary vector.
             #### How It Works:
             - Each unique word in the corpus is assigned an index.
             - The vector for a word is all zeros except for a 1 at the index corresponding to that word.
             #### Example:
-            For a vocabulary ["cat", "dog", "bird"]:
             - "cat" -> [1, 0, 0]
             - "dog" -> [0, 1, 0]
             - "bird" -> [0, 0, 1]
@@ -91,7 +64,7 @@ def show_page(page):
             - Useful for small datasets and when computational simplicity is prioritized.
             """
         )
-    elif page == "bow":
         st.title("Bag of Words (BoW)")
         st.markdown(
             """
@@ -124,7 +97,7 @@ def show_page(page):
             - Text classification and clustering.
             """
         )
-    elif page == "tfidf":
         st.title("TF-IDF Vectorizer")
         st.markdown(
             """
@@ -138,11 +111,22 @@ def show_page(page):
             - **Term Frequency (TF)**: Number of times a term appears in a document divided by total terms in the document.
             - **Inverse Document Frequency (IDF)**: Logarithm of total documents divided by the number of documents containing the term.
             #### Applications:
             - Search engines, information retrieval, and document classification.
             """
         )
-    elif page == "word2vec":
         st.title("Word2Vec")
         st.markdown(
             """
@@ -154,11 +138,18 @@ def show_page(page):
             - **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
             - **Skip-gram**: Predicts the context from the target word.
             #### Applications:
             - Text classification, sentiment analysis, and recommendation systems.
             """
         )
-    elif page == "fasttext":
         st.title("FastText")
         st.markdown(
             """
@@ -166,36 +157,87 @@ def show_page(page):
             FastText is an extension of Word2Vec that represents words as a combination of character n-grams.
             #### Applications:
             - Multilingual text processing.
             - Handling noisy and incomplete data.
             """
         )
-    elif page == "tokenization":
         st.title("Tokenization")
         st.markdown(
             """
             ### Tokenization
             Tokenization is the process of breaking text into smaller units (tokens) such as words, phrases, or sentences.
             """
         )
-    elif page == "stop_words":
         st.title("Stop Words")
         st.markdown(
             """
             ### Stop Words
             Stop words are commonly used words in a language that are often removed during text preprocessing.
             """
         )
-# Initialize session state for page navigation
-if "page" not in st.session_state:
-    st.session_state["page"] = "home"
-# Show appropriate page
-if st.session_state["page"] == "home":
     show_home_page()
 else:
-    show_page(st.session_state["page"])

 import streamlit as st
+# Function to display the Home Page
 def show_home_page():
     st.title("Natural Language Processing (NLP)")
     st.markdown(
         language in a way that is both meaningful and useful. NLP powers a wide range of applications like chatbots,
         translation tools, sentiment analysis, and search engines.
+        Use the menu in the sidebar to explore each topic in detail.
         """
     )
+# Function to display specific topic pages
 def show_page(page):
+    if page == "NLP Terminologies":
         st.title("NLP Terminologies")
         st.markdown(
             """
             ### NLP Terminologies (Detailed Explanation)
+            - **Tokenization**: Breaking text into smaller units like words or sentences.
+            - **Stop Words**: Commonly used words (e.g., "the", "is") often removed during preprocessing.
+            - **Stemming**: Reducing words to their root forms (e.g., "running" -> "run").
+            - **Lemmatization**: Converting words to their dictionary base forms (e.g., "running" -> "run").
             - **Corpus**: A large collection of text used for NLP training and analysis.
+            - **Vocabulary**: The set of all unique words in a corpus.
+            - **n-grams**: Continuous sequences of n words/characters from text.
+            - **POS Tagging**: Assigning parts of speech to words.
+            - **NER (Named Entity Recognition)**: Identifying names, places, organizations, etc.
+            - **Parsing**: Analyzing grammatical structure of text.
             """
         )
+    elif page == "One-Hot Vectorization":
         st.title("One-Hot Vectorization")
         st.markdown(
             """
             ### One-Hot Vectorization
+            A simple representation where each word in the vocabulary is represented as a binary vector.
             #### How It Works:
             - Each unique word in the corpus is assigned an index.
             - The vector for a word is all zeros except for a 1 at the index corresponding to that word.
             #### Example:
+            Vocabulary: ["cat", "dog", "bird"]
             - "cat" -> [1, 0, 0]
             - "dog" -> [0, 1, 0]
             - "bird" -> [0, 0, 1]
             - Useful for small datasets and when computational simplicity is prioritized.
             """
         )
+    elif page == "Bag of Words":
         st.title("Bag of Words (BoW)")
         st.markdown(
             """
             - Text classification and clustering.
             """
         )
+    elif page == "TF-IDF Vectorizer":
         st.title("TF-IDF Vectorizer")
         st.markdown(
             """
             - **Term Frequency (TF)**: Number of times a term appears in a document divided by total terms in the document.
             - **Inverse Document Frequency (IDF)**: Logarithm of total documents divided by the number of documents containing the term.
+            #### Advantages:
+            - Reduces the weight of common words.
+            - Highlights unique and important words.
+            #### Example:
+            For the corpus:
+            - Doc1: "NLP is amazing."
+            - Doc2: "NLP is fun and amazing."
+            TF-IDF highlights words like "fun" and "amazing" over commonly occurring words like "is".
             #### Applications:
             - Search engines, information retrieval, and document classification.
             """
         )
+    elif page == "Word2Vec":
         st.title("Word2Vec")
         st.markdown(
             """
             - **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
             - **Skip-gram**: Predicts the context from the target word.
+            #### Advantages:
+            - Captures semantic meaning (e.g., "king" - "man" + "woman" ≈ "queen").
+            - Efficient for large datasets.
             #### Applications:
             - Text classification, sentiment analysis, and recommendation systems.
+            #### Limitations:
+            - Requires significant computational resources.
             """
         )
+    elif page == "FastText":
         st.title("FastText")
         st.markdown(
             """
             FastText is an extension of Word2Vec that represents words as a combination of character n-grams.
+            #### Advantages:
+            - Handles rare and out-of-vocabulary words.
+            - Captures subword information (e.g., prefixes and suffixes).
+            #### Example:
+            The word "playing" might be represented by n-grams like "pla", "lay", "ayi", "ing".
             #### Applications:
             - Multilingual text processing.
             - Handling noisy and incomplete data.
+            #### Limitations:
+            - Higher computational cost compared to Word2Vec.
             """
         )
+    elif page == "Tokenization":
         st.title("Tokenization")
         st.markdown(
             """
             ### Tokenization
             Tokenization is the process of breaking text into smaller units (tokens) such as words, phrases, or sentences.
+            #### Types of Tokenization:
+            - **Word Tokenization**: Splits text into words.
+            - **Sentence Tokenization**: Splits text into sentences.
+            #### Libraries for Tokenization:
+            - NLTK, SpaCy, and Hugging Face Transformers.
+            #### Example:
+            Sentence: "NLP is exciting."
+            - Word Tokens: ["NLP", "is", "exciting", "."]
+            #### Applications:
+            - Preprocessing for machine learning models.
+            #### Challenges:
+            - Handling complex text like abbreviations and multilingual data.
             """
         )
+    elif page == "Stop Words":
         st.title("Stop Words")
         st.markdown(
             """
             ### Stop Words
             Stop words are commonly used words in a language that are often removed during text preprocessing.
+            #### Examples of Stop Words:
+            - English: "is", "the", "and", "in".
+            - Spanish: "es", "el", "y", "en".
+            #### Why Remove Stop Words?
+            - To reduce noise in text data.
+            #### Applications:
+            - Sentiment analysis, text classification, and search engines.
+            #### Challenges:
+            - Some stop words might carry context-specific importance.
             """
         )
+# Sidebar navigation
+st.sidebar.title("NLP Topics")
+menu_options = [
+    "Home",
+    "NLP Terminologies",
+    "One-Hot Vectorization",
+    "Bag of Words",
+    "TF-IDF Vectorizer",
+    "Word2Vec",
+    "FastText",
+    "Tokenization",
+    "Stop Words",
+]
+selected_page = st.sidebar.radio("Select a topic", menu_options)
+# Display the selected page
+if selected_page == "Home":
     show_home_page()
 else:
+    show_page(selected_page)