Spaces:
Sleeping
Sleeping
| import streamlit as st | |
| # Function to display the Home Page | |
| def show_home_page(): | |
| st.title("π¦ :red[Natural Language Processing (NLP)]") | |
| st.markdown( | |
| """ | |
| ### :green[Welcome to the NLP Guide] | |
| Natural Language Processing (NLP) is a fascinating branch of Artificial Intelligence that focuses on the interaction between | |
| computers and humans using natural language. It enables machines to read, understand, and generate human language in a meaningful way. | |
| This guide explores key NLP concepts and techniques, from basic terminologies to advanced vectorization methods. Use the sidebar to explore each topic in detail. | |
| #### :green[Applications of NLP:] | |
| - Chatbots and virtual assistants (e.g., Alexa, Siri) | |
| - Sentiment analysis | |
| - Language translation tools (e.g., Google Translate) | |
| - Text summarization and more! | |
| """ | |
| ) | |
| st.image("https://cdn-uploads.huggingface.co/production/uploads/66be28cc7e8987822d129400/1zCao_p5aQZr6zgYScaOB.png") | |
| # Function to display specific topic pages | |
| def show_page(page): | |
| if page == "NLP Terminologies": | |
| st.title("π :blue[NLP Terminologies]") | |
| st.markdown( | |
| """ | |
| ### :red[Key NLP Terms:] | |
| - **Tokenization**: Splitting text into smaller units like words or sentences. | |
| - **Stop Words**: Commonly used words (e.g., "the", "is") often removed during preprocessing. | |
| - **Stemming**: Reducing words to their root form (e.g., "running" β "run"). | |
| - **Lemmatization**: Converting words to their dictionary base form (e.g., "running" β "run"). | |
| - **Corpus**: A large collection of text used for NLP training and analysis. | |
| - **Vocabulary**: The set of unique words in a corpus. | |
| - **n-grams**: Sequences of *n* words or characters in text. | |
| - **POS Tagging**: Assigning parts of speech (e.g., noun, verb) to words. | |
| - **NER (Named Entity Recognition)**: Identifying names, places, organizations, etc. | |
| - **Parsing**: Analyzing the grammatical structure of a sentence. | |
| """ | |
| ) | |
| elif page == "One-Hot Vectorization": | |
| st.title("π§ :green[One-Hot Vectorization]") | |
| st.markdown( | |
| """ | |
| ### :red[One-Hot Vectorization Explained] | |
| One-Hot Vectorization is a simple representation where each word is encoded as a binary vector. | |
| #### :red[How It Works:] | |
| - Each unique word in the vocabulary is assigned an index. | |
| - The vector for a word is all zeros except for a `1` at the index of that word. | |
| #### :red[Example:] | |
| Vocabulary: ["cat", "dog", "bird"] | |
| - "cat" β [1, 0, 0] | |
| - "dog" β [0, 1, 0] | |
| - "bird" β [0, 0, 1] | |
| #### :red[Advantages:] | |
| - Simple and intuitive to implement. | |
| #### :red[Limitations:] | |
| - High dimensionality for large vocabularies. | |
| - Does not capture semantic relationships (e.g., "cat" and "kitten" have no connection). | |
| #### :red[Applications:] | |
| - Suitable for small datasets where simplicity is a priority. | |
| """ | |
| ) | |
| elif page == "Bag of Words": | |
| st.title("π :green[Bag of Words (BoW)]") | |
| st.markdown( | |
| """ | |
| ### :orange[Bag of Words (BoW) Method] | |
| Bag of Words is a way of representing text by counting word occurrences while ignoring word order. | |
| #### :orange[How It Works:] | |
| 1. Create a vocabulary of all unique words in the text. | |
| 2. Count the frequency of each word in a document. | |
| #### :orange[Example:] | |
| Given two sentences: | |
| - Sentence 1: "I love NLP." | |
| - Sentence 2: "I love programming." | |
| Vocabulary: ["I", "love", "NLP", "programming"] | |
| - Sentence 1: [1, 1, 1, 0] | |
| - Sentence 2: [1, 1, 0, 1] | |
| #### :orange[Advantages:] | |
| - Simple to implement and interpret. | |
| #### :orange[Limitations:] | |
| - High dimensionality for large vocabularies. | |
| - Ignores word order and semantic meaning. | |
| - Sensitive to noisy or frequent terms. | |
| #### :orange[Applications:] | |
| - Text classification and clustering. | |
| """ | |
| ) | |
| elif page == "TF-IDF Vectorizer": | |
| st.title("π :blue[TF-IDF Vectorizer]") | |
| st.markdown( | |
| """ | |
| ### :green[TF-IDF (Term Frequency-Inverse Document Frequency)] | |
| TF-IDF evaluates the importance of a word in a document relative to a collection of documents (corpus). | |
| #### :rainbow[Formula:] | |
| \[ \text{TF-IDF} = \text{TF} \times \text{IDF} \] | |
| - **TF (Term Frequency)**: Frequency of a word in a document divided by the total words in the document. | |
| - **IDF (Inverse Document Frequency)**: Logarithm of total documents divided by the number of documents containing the word. | |
| #### :rainbow[Example:] | |
| For the corpus: | |
| - Document 1: "NLP is amazing." | |
| - Document 2: "NLP is fun and amazing." | |
| Words like "fun" and "amazing" will have higher weights than commonly occurring words like "is". | |
| #### :rainbow[Advantages:] | |
| - Highlights unique and relevant terms. | |
| - Reduces the impact of frequent, less informative words. | |
| #### :rainbow[Applications:] | |
| - Information retrieval, search engines, and document classification. | |
| """ | |
| ) | |
| elif page == "Word2Vec": | |
| st.title("π :red[Word2Vec]") | |
| st.markdown( | |
| """ | |
| ### :green[Word2Vec] | |
| Word2Vec creates dense vector representations of words, capturing semantic relationships using neural networks. | |
| #### :green[Key Models:] | |
| - **CBOW (Continuous Bag of Words)**: Predicts the target word from its context. | |
| - **Skip-gram**: Predicts the context from a target word. | |
| #### :green[Example:] | |
| Word2Vec can capture relationships like: | |
| - "king" - "man" + "woman" β "queen" | |
| #### :green[Advantages:] | |
| - Captures semantic meaning and relationships. | |
| - Efficient for large datasets. | |
| #### :green[Applications:] | |
| - Sentiment analysis, recommendation systems, and machine translation. | |
| #### :green[Limitations:] | |
| - Computationally intensive for training on large datasets. | |
| """ | |
| ) | |
| elif page == "FastText": | |
| st.title("π :red[FastText]") | |
| st.markdown( | |
| """ | |
| ### :blue[FastText] | |
| FastText extends Word2Vec by representing words as character n-grams, enabling it to handle rare and out-of-vocabulary words. | |
| #### :blue[Example:] | |
| The word "playing" might be represented by subwords like "pla", "lay", "ayi", "ing". | |
| #### :blue[Advantages:] | |
| - Handles rare words and misspellings. | |
| - Captures subword information (e.g., prefixes and suffixes). | |
| #### :blue[Applications:] | |
| - Multilingual text processing. | |
| - Working with noisy or incomplete data. | |
| #### :blue[Limitations:] | |
| - Higher computational cost than Word2Vec. | |
| """ | |
| ) | |
| elif page == "Tokenization": | |
| st.title("π’ :blue[Tokenization]") | |
| st.markdown( | |
| """ | |
| ### :red[Tokenization] | |
| Tokenization is the process of splitting text into smaller units (tokens) such as words, phrases, or sentences. | |
| #### :red[Types:] | |
| - **Word Tokenization**: Splits text into words. | |
| - **Sentence Tokenization**: Splits text into sentences. | |
| #### :red[Example:] | |
| Sentence: "NLP is exciting." | |
| - Word Tokens: ["NLP", "is", "exciting", "."] | |
| #### :red[Libraries:] | |
| - NLTK | |
| - SpaCy | |
| - Hugging Face Transformers | |
| #### :red[Challenges:] | |
| - Handling complex text (e.g., abbreviations, contractions, multilingual data). | |
| #### :red[Applications:] | |
| - Preprocessing for machine learning models. | |
| """ | |
| ) | |
| elif page == "Stop Words": | |
| st.title("π :green[Stop Words]") | |
| st.markdown( | |
| """ | |
| ### :rainbow[Stop Words] | |
| Stop words are commonly used words in a language that are often removed during text preprocessing (e.g., "is", "the", "and"). | |
| #### :rainbow[Why Remove Stop Words?] | |
| - To reduce noise and focus on meaningful terms in text. | |
| #### :rainbow[Example Stop Words:] | |
| - English: "is", "the", "and". | |
| - Spanish: "es", "el", "y". | |
| #### :rainbow[Challenges:] | |
| - Some stop words might carry important context in specific use cases. | |
| #### :rainbow[Applications:] | |
| - Sentiment analysis, text classification, and search engines. | |
| """ | |
| ) | |
| # Sidebar navigation | |
| st.sidebar.title("π NLP Topics") | |
| menu_options = [ | |
| "Home", | |
| "NLP Terminologies", | |
| "One-Hot Vectorization", | |
| "Bag of Words", | |
| "TF-IDF Vectorizer", | |
| "Word2Vec", | |
| "FastText", | |
| "Tokenization", | |
| "Stop Words", | |
| ] | |
| selected_page = st.sidebar.radio("Select a topic", menu_options) | |
| # Display the selected page | |
| if selected_page == "Home": | |
| show_home_page() | |
| else: | |
| show_page(selected_page) |