Spaces:

UmaKumpatla
/

NLP

Sleeping

App Files Files Community

UmaKumpatla commited on Feb 13, 2025

Commit

84b9be8

verified ·

1 Parent(s): 3903d39

Update pages/4.Feature Engineering.py

Browse files

Files changed (1) hide show

pages/4.Feature Engineering.py +216 -0

pages/4.Feature Engineering.py CHANGED Viewed

	@@ -0,0 +1,216 @@

+import streamlit as st
+# Function to display the Home Page
+def show_home_page():
+    st.title("🔦 :red[Natural Language Processing (NLP)]")
+    st.markdown(
+        """
+        ### :green[Welcome to the NLP Guide]
+        Natural Language Processing (NLP) is a fascinating branch of Artificial Intelligence that focuses on the interaction between
+        computers and humans using natural language. It enables machines to read, understand, and generate human language in a meaningful way.
+        This guide explores key NLP concepts and techniques, from basic terminologies to advanced vectorization methods. Use the sidebar to explore each topic in detail.
+        #### :green[Applications of NLP:]
+        - Chatbots and virtual assistants (e.g., Alexa, Siri)
+        - Sentiment analysis
+        - Language translation tools (e.g., Google Translate)
+        - Text summarization and more!
+        """
+    )
+    st.image("https://cdn-uploads.huggingface.co/production/uploads/66be28cc7e8987822d129400/1zCao_p5aQZr6zgYScaOB.png")
+# Function to display specific topic pages
+def show_page(page):
+    if page == "NLP Terminologies":
+        st.title("🔍 :blue[NLP Terminologies]")
+        st.markdown(
+            """
+            ### :red[Key NLP Terms:]
+            - **Tokenization**: Splitting text into smaller units like words or sentences.
+            - **Stop Words**: Commonly used words (e.g., "the", "is") often removed during preprocessing.
+            - **Stemming**: Reducing words to their root form (e.g., "running" → "run").
+            - **Lemmatization**: Converting words to their dictionary base form (e.g., "running" → "run").
+            - **Corpus**: A large collection of text used for NLP training and analysis.
+            - **Vocabulary**: The set of unique words in a corpus.
+            - **n-grams**: Sequences of *n* words or characters in text.
+            - **POS Tagging**: Assigning parts of speech (e.g., noun, verb) to words.
+            - **NER (Named Entity Recognition)**: Identifying names, places, organizations, etc.
+            - **Parsing**: Analyzing the grammatical structure of a sentence.
+            """
+        )
+    elif page == "One-Hot Vectorization":
+        st.title("🔧 :green[One-Hot Vectorization]")
+        st.markdown(
+            """
+            ### :red[One-Hot Vectorization Explained]
+            One-Hot Vectorization is a simple representation where each word is encoded as a binary vector.
+            #### :red[How It Works:]
+            - Each unique word in the vocabulary is assigned an index.
+            - The vector for a word is all zeros except for a `1` at the index of that word.
+            #### :red[Example:]
+            Vocabulary: ["cat", "dog", "bird"]
+            - "cat" → [1, 0, 0]
+            - "dog" → [0, 1, 0]
+            - "bird" → [0, 0, 1]
+            #### :red[Advantages:]
+            - Simple and intuitive to implement.
+            #### :red[Limitations:]
+            - High dimensionality for large vocabularies.
+            - Does not capture semantic relationships (e.g., "cat" and "kitten" have no connection).
+            #### :red[Applications:]
+            - Suitable for small datasets where simplicity is a priority.
+            """
+        )
+    elif page == "Bag of Words":
+        st.title("🔄 :green[Bag of Words (BoW)]")
+        st.markdown(
+            """
+            ### :orange[Bag of Words (BoW) Method]
+            Bag of Words is a way of representing text by counting word occurrences while ignoring word order.
+            #### :orange[How It Works:]
+            1. Create a vocabulary of all unique words in the text.
+            2. Count the frequency of each word in a document.
+            #### :orange[Example:]
+            Given two sentences:
+            - Sentence 1: "I love NLP."
+            - Sentence 2: "I love programming."
+            Vocabulary: ["I", "love", "NLP", "programming"]
+            - Sentence 1: [1, 1, 1, 0]
+            - Sentence 2: [1, 1, 0, 1]
+            #### :orange[Advantages:]
+            - Simple to implement and interpret.
+            #### :orange[Limitations:]
+            - High dimensionality for large vocabularies.
+            - Ignores word order and semantic meaning.
+            - Sensitive to noisy or frequent terms.
+            #### :orange[Applications:]
+            - Text classification and clustering.
+            """
+        )
+    elif page == "TF-IDF Vectorizer":
+        st.title("🔄 :blue[TF-IDF Vectorizer]")
+        st.markdown(
+            """
+            ### :green[TF-IDF (Term Frequency-Inverse Document Frequency)]
+            TF-IDF evaluates the importance of a word in a document relative to a collection of documents (corpus).
+            #### :rainbow[Formula:]
+            \[ \text{TF-IDF} = \text{TF} \times \text{IDF} \]
+            - **TF (Term Frequency)**: Frequency of a word in a document divided by the total words in the document.
+            - **IDF (Inverse Document Frequency)**: Logarithm of total documents divided by the number of documents containing the word.
+            #### :rainbow[Example:]
+            For the corpus:
+            - Document 1: "NLP is amazing."
+            - Document 2: "NLP is fun and amazing."
+            Words like "fun" and "amazing" will have higher weights than commonly occurring words like "is".
+            #### :rainbow[Advantages:]
+            - Highlights unique and relevant terms.
+            - Reduces the impact of frequent, less informative words.
+            #### :rainbow[Applications:]
+            - Information retrieval, search engines, and document classification.
+            """
+        )
+    elif page == "Word2Vec":
+        st.title("🌐 :red[Word2Vec]")
+        st.markdown(
+            """
+            ### :green[Word2Vec]
+            Word2Vec creates dense vector representations of words, capturing semantic relationships using neural networks.
+            #### :green[Key Models:]
+            - **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
+            - **Skip-gram**: Predicts the context from a target word.
+            #### :green[Example:]
+            Word2Vec can capture relationships like:
+            - "king" - "man" + "woman" ≈ "queen"
+            #### :green[Advantages:]
+            - Captures semantic meaning and relationships.
+            - Efficient for large datasets.
+            #### :green[Applications:]
+            - Sentiment analysis, recommendation systems, and machine translation.
+            #### :green[Limitations:]
+            - Computationally intensive for training on large datasets.
+            """
+        )
+    elif page == "FastText":
+        st.title("🔄 :red[FastText]")
+        st.markdown(
+            """
+            ### :blue[FastText]
+            FastText extends Word2Vec by representing words as character n-grams, enabling it to handle rare and out-of-vocabulary words.
+            #### :blue[Example:]
+            The word "playing" might be represented by subwords like "pla", "lay", "ayi", "ing".
+            #### :blue[Advantages:]
+            - Handles rare words and misspellings.
+            - Captures subword information (e.g., prefixes and suffixes).
+            #### :blue[Applications:]
+            - Multilingual text processing.
+            - Working with noisy or incomplete data.
+            #### :blue[Limitations:]
+            - Higher computational cost than Word2Vec.
+            """
+        )
+    elif page == "Tokenization":
+        st.title("🔢 :blue[Tokenization]")
+        st.markdown(
+            """
+            ### :red[Tokenization]
+            Tokenization is the process of splitting text into smaller units (tokens) such as words, phrases, or sentences.
+            #### :red[Types:]
+            - **Word Tokenization**: Splits text into words.
+            - **Sentence Tokenization**: Splits text into sentences.
+            #### :red[Example:]
+            Sentence: "NLP is exciting."
+            - Word Tokens: ["NLP", "is", "exciting", "."]
+            #### :red[Libraries:]
+            - NLTK
+            - SpaCy
+            - Hugging Face Transformers
+            #### :red[Challenges:]
+            - Handling complex text (e.g., abbreviations, contractions, multilingual data).
+            #### :red[Applications:]
+            - Preprocessing for machine learning models.
+            """
+        )
+    elif page == "Stop Words":
+        st.title("🔐 :green[Stop Words]")
+        st.markdown(
+            """
+            ### :rainbow[Stop Words]
+            Stop words are commonly used words in a language that are often removed during text preprocessing (e.g., "is", "the", "and").
+            #### :rainbow[Why Remove Stop Words?]
+            - To reduce noise and focus on meaningful terms in text.
+            #### :rainbow[Example Stop Words:]
+            - English: "is", "the", "and".
+            - Spanish: "es", "el", "y".
+            #### :rainbow[Challenges:]
+            - Some stop words might carry important context in specific use cases.
+            #### :rainbow[Applications:]
+            - Sentiment analysis, text classification, and search engines.
+            """
+        )
+# Sidebar navigation
+st.sidebar.title("🔍 NLP Topics")
+menu_options = [
+    "Home",
+    "NLP Terminologies",
+    "One-Hot Vectorization",
+    "Bag of Words",
+    "TF-IDF Vectorizer",
+    "Word2Vec",
+    "FastText",
+    "Tokenization",
+    "Stop Words",
+]
+selected_page = st.sidebar.radio("Select a topic", menu_options)
+# Display the selected page
+if selected_page == "Home":
+    show_home_page()
+else:
+    show_page(selected_page)