import streamlit as st # Function to display the Home Page def show_home_page(): st.title("🔦 :red[Natural Language Processing (NLP)]") st.markdown( """ ### :green[Welcome to the NLP Guide] Natural Language Processing (NLP) is a fascinating branch of Artificial Intelligence that focuses on the interaction between computers and humans using natural language. It enables machines to read, understand, and generate human language in a meaningful way. This guide explores key NLP concepts and techniques, from basic terminologies to advanced vectorization methods. Use the sidebar to explore each topic in detail. #### :green[Applications of NLP:] - Chatbots and virtual assistants (e.g., Alexa, Siri) - Sentiment analysis - Language translation tools (e.g., Google Translate) - Text summarization and more! """ ) st.image("https://cdn-uploads.huggingface.co/production/uploads/66be28cc7e8987822d129400/1zCao_p5aQZr6zgYScaOB.png") # Function to display specific topic pages def show_page(page): if page == "NLP Terminologies": st.title("🔍 :blue[NLP Terminologies]") st.markdown( """ ### :red[Key NLP Terms:] - **Tokenization**: Splitting text into smaller units like words or sentences. - **Stop Words**: Commonly used words (e.g., "the", "is") often removed during preprocessing. - **Stemming**: Reducing words to their root form (e.g., "running" → "run"). - **Lemmatization**: Converting words to their dictionary base form (e.g., "running" → "run"). - **Corpus**: A large collection of text used for NLP training and analysis. - **Vocabulary**: The set of unique words in a corpus. - **n-grams**: Sequences of *n* words or characters in text. - **POS Tagging**: Assigning parts of speech (e.g., noun, verb) to words. - **NER (Named Entity Recognition)**: Identifying names, places, organizations, etc. - **Parsing**: Analyzing the grammatical structure of a sentence. """ ) elif page == "One-Hot Vectorization": st.title("🔧 :green[One-Hot Vectorization]") st.markdown( """ ### :red[One-Hot Vectorization Explained] One-Hot Vectorization is a simple representation where each word is encoded as a binary vector. #### :red[How It Works:] - Each unique word in the vocabulary is assigned an index. - The vector for a word is all zeros except for a `1` at the index of that word. #### :red[Example:] Vocabulary: ["cat", "dog", "bird"] - "cat" → [1, 0, 0] - "dog" → [0, 1, 0] - "bird" → [0, 0, 1] #### :red[Advantages:] - Simple and intuitive to implement. #### :red[Limitations:] - High dimensionality for large vocabularies. - Does not capture semantic relationships (e.g., "cat" and "kitten" have no connection). #### :red[Applications:] - Suitable for small datasets where simplicity is a priority. """ ) elif page == "Bag of Words": st.title("🔄 :green[Bag of Words (BoW)]") st.markdown( """ ### :orange[Bag of Words (BoW) Method] Bag of Words is a way of representing text by counting word occurrences while ignoring word order. #### :orange[How It Works:] 1. Create a vocabulary of all unique words in the text. 2. Count the frequency of each word in a document. #### :orange[Example:] Given two sentences: - Sentence 1: "I love NLP." - Sentence 2: "I love programming." Vocabulary: ["I", "love", "NLP", "programming"] - Sentence 1: [1, 1, 1, 0] - Sentence 2: [1, 1, 0, 1] #### :orange[Advantages:] - Simple to implement and interpret. #### :orange[Limitations:] - High dimensionality for large vocabularies. - Ignores word order and semantic meaning. - Sensitive to noisy or frequent terms. #### :orange[Applications:] - Text classification and clustering. """ ) elif page == "TF-IDF Vectorizer": st.title("🔄 :blue[TF-IDF Vectorizer]") st.markdown( """ ### :green[TF-IDF (Term Frequency-Inverse Document Frequency)] TF-IDF evaluates the importance of a word in a document relative to a collection of documents (corpus). #### :rainbow[Formula:] \[ \text{TF-IDF} = \text{TF} \times \text{IDF} \] - **TF (Term Frequency)**: Frequency of a word in a document divided by the total words in the document. - **IDF (Inverse Document Frequency)**: Logarithm of total documents divided by the number of documents containing the word. #### :rainbow[Example:] For the corpus: - Document 1: "NLP is amazing." - Document 2: "NLP is fun and amazing." Words like "fun" and "amazing" will have higher weights than commonly occurring words like "is". #### :rainbow[Advantages:] - Highlights unique and relevant terms. - Reduces the impact of frequent, less informative words. #### :rainbow[Applications:] - Information retrieval, search engines, and document classification. """ ) elif page == "Word2Vec": st.title("🌐 :red[Word2Vec]") st.markdown( """ ### :green[Word2Vec] Word2Vec creates dense vector representations of words, capturing semantic relationships using neural networks. #### :green[Key Models:] - **CBOW (Continuous Bag of Words)**: Predicts the target word from its context. - **Skip-gram**: Predicts the context from a target word. #### :green[Example:] Word2Vec can capture relationships like: - "king" - "man" + "woman" ≈ "queen" #### :green[Advantages:] - Captures semantic meaning and relationships. - Efficient for large datasets. #### :green[Applications:] - Sentiment analysis, recommendation systems, and machine translation. #### :green[Limitations:] - Computationally intensive for training on large datasets. """ ) elif page == "FastText": st.title("🔄 :red[FastText]") st.markdown( """ ### :blue[FastText] FastText extends Word2Vec by representing words as character n-grams, enabling it to handle rare and out-of-vocabulary words. #### :blue[Example:] The word "playing" might be represented by subwords like "pla", "lay", "ayi", "ing". #### :blue[Advantages:] - Handles rare words and misspellings. - Captures subword information (e.g., prefixes and suffixes). #### :blue[Applications:] - Multilingual text processing. - Working with noisy or incomplete data. #### :blue[Limitations:] - Higher computational cost than Word2Vec. """ ) elif page == "Tokenization": st.title("🔢 :blue[Tokenization]") st.markdown( """ ### :red[Tokenization] Tokenization is the process of splitting text into smaller units (tokens) such as words, phrases, or sentences. #### :red[Types:] - **Word Tokenization**: Splits text into words. - **Sentence Tokenization**: Splits text into sentences. #### :red[Example:] Sentence: "NLP is exciting." - Word Tokens: ["NLP", "is", "exciting", "."] #### :red[Libraries:] - NLTK - SpaCy - Hugging Face Transformers #### :red[Challenges:] - Handling complex text (e.g., abbreviations, contractions, multilingual data). #### :red[Applications:] - Preprocessing for machine learning models. """ ) elif page == "Stop Words": st.title("🔐 :green[Stop Words]") st.markdown( """ ### :rainbow[Stop Words] Stop words are commonly used words in a language that are often removed during text preprocessing (e.g., "is", "the", "and"). #### :rainbow[Why Remove Stop Words?] - To reduce noise and focus on meaningful terms in text. #### :rainbow[Example Stop Words:] - English: "is", "the", "and". - Spanish: "es", "el", "y". #### :rainbow[Challenges:] - Some stop words might carry important context in specific use cases. #### :rainbow[Applications:] - Sentiment analysis, text classification, and search engines. """ ) # Sidebar navigation st.sidebar.title("🔍 NLP Topics") menu_options = [ "Home", "NLP Terminologies", "One-Hot Vectorization", "Bag of Words", "TF-IDF Vectorizer", "Word2Vec", "FastText", "Tokenization", "Stop Words", ] selected_page = st.sidebar.radio("Select a topic", menu_options) # Display the selected page if selected_page == "Home": show_home_page() else: show_page(selected_page)