Spaces:

trohith89
/

NLP

Sleeping

App Files Files Community

trohith89 commited on Dec 21, 2024

Commit

f04a4d6

verified ·

1 Parent(s): e1ffa0a

Update app.py

Browse files

Files changed (1) hide show

app.py +302 -136

app.py CHANGED Viewed

@@ -1,138 +1,304 @@
 import streamlit as st
-st.header("Introduction to Natural Language Processing (NLP)")
-st.markdown("<p>Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that focuses on the interaction between computers and human language. The goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is valuable and meaningful.</p>",unsafe_allow_html= True)
-st.subheader("What is NLP?")
-st.markdown("<p>Natural language processing (NLP) is a field of computer science and a subfield of artificial intelligence that aims to make computers understand human language. NLP uses computational linguistics, which is the study of how language works, and various models based on statistics, machine learning, and deep learning. These technologies allow computers to analyze and process text or voice data, and to grasp their full meaning, including the speaker’s or writer’s intentions and emotions. </p>",unsafe_allow_html= True)
-st.image("NLP.jpg")
-st.markdown("<p>NLP powers many applications that use language, such as text translation, voice recognition, text summarization, and chatbots. You may have used some of these applications yourself, such as voice-operated GPS systems, digital assistants, speech-to-text software, and customer service bots. NLP also helps businesses improve their efficiency, productivity, and performance by simplifying complex tasks that involve language. </p>",unsafe_allow_html= True)
-st.subheader("NLP Techniques")
-st.markdown("<p>NLP encompasses a wide array of techniques that aimed at enabling computers to process and understand human language. These tasks can be categorized into several broad areas, each addressing different aspects of language processing. Here are some of the key NLP techniques:</p>",unsafe_allow_html= True)
-st.markdown('<p style="color:;"><b>1. Text Processing and Preprocessing In NLP</b></p>', unsafe_allow_html=True)
-st.write("Before performing any analysis or modeling, raw text data must be cleaned and prepared.")
-st.markdown('<p style="color:;"><b>a. Tokenization</b></p>', unsafe_allow_html=True)
-st.write("Splits text into smaller units like words or sentences.")
-st.write("**Types:**")
-st.write("**(i) Word Tokenization:** Breaking text into words.")
-st.write("Example: _'I love NLP'_ → [‘I’, ‘love’, ‘NLP’]")
-st.write("**(ii) Sentence Tokenization:** Breaking text into sentences.")
-st.write("Example: _'I love NLP. It’s fascinating!'_ → [‘I love NLP.’, ‘It’s fascinating!’]")
-st.markdown('<p style="color:;"><b>b. Stopword Removal</b></p>', unsafe_allow_html=True)
-st.write("Removes common words like “the,” “and,” “is” that do not contribute much to analysis.")
-st.markdown('<p style="color:;"><b>c. Stemming and Lemmatization</b></p>', unsafe_allow_html=True)
-st.write("Stemming: Reduces words to their base or root form by chopping off suffixes (may not produce valid words).")
-st.write("Example: _“running” _ → “run”")
-st.write("Lemmatization: Converts words to their base form using vocabulary and grammar")
-st.write("Example: _“good” _ → “better”")
-st.markdown('<p style="color:;"><b>d. Part-of-Speech (POS) Tagging</b></p>', unsafe_allow_html=True)
-st.write("Labels words with their grammatical roles (noun, verb, adjective, etc.)")
-st.write("Example: _The cat sleeps”_ → [“The/DET”, “cat/NOUN”, “sleeps/VERB”]")
-st.markdown('<p style="color:;"><b>e. Named Entity Recognition (NER)</b></p>', unsafe_allow_html=True)
-st.write("Identifies and classifies entities in text (e.g., names, dates, locations)")
-st.write("Example: _ “Barack Obama was born in Hawaii. _ ” → [Barack Obama: PERSON, Hawaii: LOCATION]")
-st.markdown('<p style="color:;"><b>f. Text Normalization</b></p>', unsafe_allow_html=True)
-st.write("Converts text to a standard format (lowercasing, removing punctuation, etc.).")
-st.markdown('<p style="color:;"><b>2. Feature Extraction Techniques</b></p>', unsafe_allow_html=True)
-st.write("Text needs to be transformed into numerical representations for machine learning models.")
-st.markdown('<p style="color:;"><b>a. Bag of Words (BoW)</b></p>', unsafe_allow_html=True)
-st.write("Represents text as a vector of word frequencies or occurrences, ignoring grammar and order")
-st.write("Examples:")
-st.write("Text: “I love NLP” and “NLP is great”")
-st.write("Vocabulary: [“I”, “love”, “NLP”, “is”, “great”]")
-st.write("Vector for “I love NLP”: [1, 1, 1, 0, 0]")
-st.markdown('<p style="color:;"><b>b. Term Frequency-Inverse Document Frequency (TF-IDF)</b></p>', unsafe_allow_html=True)
-st.write("The **TF-IDF Vectorizer** is a popular technique in Natural Language Processing (NLP) used to convert text into numerical values that can be used by machine learning models. It stands for Term Frequency-Inverse Document Frequency and helps highlight the importance of words in a document relative to a collection of documents (called a corpus).")
-st.write('**Term Frequency (TF)** \n - Measures how often a word appears in a single document. \n - Formula: \n _TF_ = Number of times the word appears in the document / Total number of words in the document' )
-st.write('**Inverse Document Frequency (IDF)** \n Measures how unique or rare a word is across all documents in the corpus. \n - Formula: \n  _IDF_ = log(Total no.of documents / No of Documnets containing the word) \n Words that appear in many documents (like "the" or "and") will have a low IDF value, while unique words (like "NLP") will have a higher IDF.')
-st.write('**TF - IDF Score:** \n - Combines TF and IDF to calculate the importance of a word in a document. \n - Formula: \n _TF - IDF = TF x IDF_ \n Words that are frequent in a document but rare in the overall corpus get a higher score.')
-st.write("""
-**Example**
-**Consider these two documents:**
-- "I love NLP"
-- "NLP is amazing"
-**Step 1: Calculate TF**
-- "NLP" appears once in each document, so its TF is **1/3** in both.
-- Words like "love" and "amazing" also have a TF of **1/3**.
-**Step 2: Calculate IDF**
-- "NLP" appears in both documents, so its IDF is **log(2/2) = 0**.
-- "love" and "amazing" appear in only one document each, so their IDF is **log(2/1) = 0.69**.
-**Step 3: Compute TF-IDF**
-- "NLP" gets a TF-IDF score of **1/3 × 0 = 0** (not unique).
-- "love" and "amazing" get scores of **1/3 × 0.69 = 0.23** (more unique).
-""")
-st.markdown('<p style="color:;"><b>c. Word Embeddings</b></p>', unsafe_allow_html=True)
-st.write("Word embeddings are a type of representation for text where words are converted into dense numerical vectors. These vectors capture the semantic meaning of words and their relationships with other words in a way that computers can understand.")
-st.write("""
-**Word Embedding Techniques**
-**1. Word2Vec**
-Developed by Google, it uses two main approaches:
-- **CBOW (Continuous Bag of Words):** Predicts a word based on its context.
-- **Skip-Gram:** Predicts the context given a word.
-**2. GloVe (Global Vectors)**
-Developed by Stanford, it captures word relationships by analyzing co-occurrence statistics of words in a large corpus.
-**3. FastText**
-Developed by Facebook, it extends Word2Vec by considering subword information, making it better at handling rare and misspelled words.
-**4. Transformers (Contextual Embeddings)**
-Models like **BERT**, **ELMo**, and **GPT** generate embeddings based on the context in which a word appears, capturing nuanced meanings.
-""")
-st.subheader("Future of NLP")
-st.write("""
-The future of Natural Language Processing (NLP) is exciting, with advancements that aim to make machines understand and interact with human language more effectively. Here are key areas shaping the future of NLP: \n
-**1.Context-Aware Models:**
-- Enhanced Understanding of Context: Models like GPT and BERT have already revolutionized NLP. Future advancements will further refine their ability to comprehend nuanced context, sarcasm, and idioms.
-**2.Real-Time Multilingual NLP**
-- Instant Translations: Real-time and accurate translation across diverse languages, including low-resource ones.
-- Language Independence: NLP systems capable of handling any language seamlessly.
-**3.Conversational AI**
-- Human-like Conversations: Chatbots and virtual assistants will become more natural, empathetic, and intuitive in conversations.
-- Emotion Recognition: Understanding and responding to user emotions effectively.
-**4. Zero-shot and Few-shot Learning**
-- Minimal Data Requirement: Models will handle new tasks or languages with little to no additional training, making NLP accessible across domains with limited data.
-**5. Multimodal Learning**
-- Beyond Text: Integrating text with images, audio, and video for richer applications like understanding memes, videos, or interactive media.
-The future of NLP is about creating systems that communicate more naturally, inclusively, and intelligently, enabling transformative applications in every aspect of life.
-""")

 import streamlit as st
+# Sidebar for navigation
+sidebar = st.sidebar
+# Sidebar header
+sidebar.header('🌐 NLP Navigation')
+# Sidebar options for NLP Overview, Lifecycle, and Techniques
+sidebar_option = sidebar.radio('Choose a section to explore:', ['What is NLP?', 'NLP Lifecycle', 'NLP Techniques'])
+# Store the selected page in session state
+if 'selected_page' not in st.session_state:
+    st.session_state.selected_page = sidebar_option
+# Update the selected page if the user selects a different option
+if sidebar_option != st.session_state.selected_page:
+    st.session_state.selected_page = sidebar_option
+# Dynamically update the title based on the selected option
+if st.session_state.selected_page == 'What is NLP?':
+    st.title('🤖 What is Natural Language Processing (NLP)?')
+elif st.session_state.selected_page == 'NLP Lifecycle':
+    st.title('🔄 Natural Language Processing (NLP) Lifecycle')
+    if sidebar_option == 'Problem Definition':
+        st.title('🔧 Steps in the Natural Language Processing (NLP) lifecycle:')
+elif st.session_state.selected_page == 'NLP Techniques':
+    st.title('⚙️ Techniques in Natural Language Processing (NLP)')
+# Content for "What is NLP?"
+if st.session_state.selected_page == 'What is NLP?':
+    st.write("""
+    ### 🤖 What is NLP?
+    Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and human (natural) languages. The goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is meaningful.
+    NLP is essential for enabling computers to process and analyze large amounts of natural language data, such as:
+    - 📜 Text from documents
+    - 🗣️ Speech from conversations
+    - 🖼️ Images with textual descriptions
+    #### Key Components of NLP:
+    - **Syntax**: Refers to the arrangement of words in a sentence.
+    - **Semantics**: Focuses on the meaning of the words and sentences.
+    - **Pragmatics**: Involves the context and intent behind language.
+    - **Discourse**: Studies how previous sentences and context influence meaning.
+    #### Example Applications of NLP:
+    - **Machine Translation**: Automatic translation of text from one language to another (e.g., Google Translate).
+    - **Speech Recognition**: Converting spoken language into text (e.g., Siri, Alexa).
+    - **Sentiment Analysis**: Analyzing text to determine the sentiment (positive, negative, neutral) (e.g., analyzing customer reviews).
+    - **Text Summarization**: Creating a short summary of a long text (e.g., summarizing articles).
+    NLP is used across multiple domains like healthcare, finance, and customer service to automate and improve various tasks.
+    """)
+# Content for NLP Lifecycle
+elif st.session_state.selected_page == "NLP Lifecycle":
+    lifecycle_option = sidebar.radio("Select NLP Lifecycle Step:", [
+        "Overview of the NLP Life Cycle",
+        "Problem Definition",
+        "Data Collection",
+        "Text Preprocessing",
+        "Text Representation",
+        "Model Training",
+        "Evaluation",
+        "Deployment"
+    ])
+    if lifecycle_option == "Overview of the NLP Life Cycle":
+        st.write("""
+        #### 🔄 Overview of the NLP Life Cycle
+        The NLP life cycle is a structured process for building, using, and maintaining systems that work with human language. It turns unstructured text into meaningful insights or automated actions. This process ensures continuous improvement and adapts to real-world needs.
+        - **How It Flows**:
+            - The process starts with identifying the problem and collecting the required text data.
+            - Then, the data is cleaned and prepared for analysis.
+            - Models are built and tested before being deployed for use.
+            - Regular checks and updates ensure the solution keeps working well.
+        - **Flexible and Adaptive**:
+            - Since languages and data change (e.g., new words, trends), the process is repeated as needed.
+            - Models may need updates or retraining to stay accurate.
+        - **Combines Different Fields**:
+            - The process involves skills from language studies, programming, and data analysis to make sure language is understood effectively.
+        - **Designed for Practical Use**:
+            - The goal is to create solutions that can handle tasks like analyzing text, identifying emotions, powering chatbots, or translating languages accurately and efficiently.
+        - **Key Challenges Solved**:
+            - Managing the complexity of language (e.g., meaning, structure).
+            - Working with large and messy datasets.
+            - Handling multiple languages and specific industries.
+            - Ensuring solutions are fast and efficient.
+        #### Steps in the NLP Life Cycle:
+        1. Problem Definition
+        2. Data Collection
+        3. Data Preprocessing
+        4. Feature Engineering
+        5. Model Selection and Training
+        6. Model Evaluation
+        7. Model Tuning
+        8. Deployment
+        9. Monitoring and Maintenance
+        """)
+    elif lifecycle_option == "Problem Definition":
+        st.write("""
+        #### 🔧 1. Problem Definition
+        - The first step in the NLP lifecycle is defining the problem. This means understanding the goal and figuring out how NLP can help solve the problem.
+        - Based on the problem, you will need to gather the data.
+        - **To better understand the problem, consider asking questions such as**:
+            - 🎯 What is the main goal of this analysis?
+            - 📝 What kind of text data are we working with (e.g., reviews, social media posts, documents)?
+            - 📊 What do we want the output to be (e.g., sentiment score, summary, or classification)?
+        **Example of a problem statement**: The goal could be to classify customer reviews as either positive or negative, or to find the main topics in product reviews.
+        """)
+    elif lifecycle_option == "Data Collection":
+        st.write("""
+       #### 2. Data Collection
+       Data collection is the second step in the NLP lifecycle. It involves gathering data from various sources based on the problem statement, so it can be analyzed and processed.
+       - **Sources for data collection**:
+            - 📚 The data should be collected based on a clear understanding of the problem statement.
+            - 🌐 From datasets available on websites like Kaggle.
+            - 🔌 Through APIs.
+            - 🕸️ Web scraping can also be used to gather data from websites using tools like Selenium or BeautifulSoup.
+            - ✋ Manually, when needed.
+            - In most cases, data is collected from websites, APIs, or through web scraping. However, manual collection may be necessary in rare cases.
+        **Example**: Scraping customer reviews from Amazon to analyze sentiment and feedback about a product.
+        """)
+    elif lifecycle_option == "Text Preprocessing":
+        st.write("""
+        #### 🧹 3. Text Preprocessing
+        Text preprocessing prepares raw text for further analysis. This stage involves cleaning and transforming the data into a structured format that machine learning models can understand.
+        - **Tokenization**: Splitting text into smaller units (e.g., words, phrases).
+        - **Stop Words Removal**: Removing common words that don’t contribute much information.
+        - **Lemmatization**: Converting words into their base or dictionary form.
+        - **Stemming**: Cutting off prefixes or suffixes from words.
+        - **Lowercasing**: Converting all characters in the text to lowercase.
+        **Example**: For the sentence "The quick brown fox is running fast", after preprocessing:
+        - Tokenization: ["The", "quick", "brown", "fox", "is", "running", "fast"]
+        - Stop Words Removal: ["quick", "brown", "fox", "running", "fast"]
+        - Lemmatization: ["quick", "brown", "fox", "run", "fast"]
+        """)
+    elif lifecycle_option == "Text Representation":
+        st.write("""
+        #### 📝 4. Text Representation
+        After preprocessing, the text data needs to be converted into a numerical format for use in machine learning models. There are several methods for text representation:
+        - **Bag of Words (BoW)**: Converts text into a matrix of word frequencies.
+        - **TF-IDF**: Weighs words based on their frequency in a specific document relative to their frequency across the entire dataset.
+        - **Word Embeddings**: Transforms words into dense vectors that capture semantic meaning.
+        **Example**: Using BoW to convert the sentence "I love NLP" into a vector representation:
+        - Vocabulary: ["I", "love", "NLP"]
+        - Vector: [1, 1, 1] (word frequency representation)
+        """)
+    elif lifecycle_option == "Model Training":
+        st.write("""
+        #### 🏋️‍♂️ 5. Model Training
+        In the model training stage, machine learning algorithms are trained on the preprocessed and represented text data. The choice of model depends on the task:
+        - **Text Classification**: Naive Bayes, Support Vector Machines (SVM), or neural networks.
+        - **Named Entity Recognition (NER)**: Conditional Random Fields (CRF), LSTMs, or transformers.
+        - **Sentiment Analysis**: Logistic regression, Naive Bayes, or transformer-based models like BERT.
+        **Example**: Training a Naive Bayes classifier to categorize news articles into topics such as "Sports", "Politics", etc.
+        """)
+    elif lifecycle_option == "Evaluation":
+        st.write("""
+        #### 🏅 6. Evaluation
+        After training the model, it's important to evaluate its performance using metrics such as accuracy, precision, recall, and F1-score.
+        - **Accuracy**: The percentage of correct predictions.
+        - **Precision**: The percentage of relevant instances among the retrieved instances.
+        - **Recall**: The percentage of relevant instances that were retrieved.
+        - **F1-score**: The harmonic mean of precision and recall.
+        **Example**: If a sentiment analysis model predicts positive sentiment in 80 out of 100 reviews, its accuracy is 80%.
+        """)
+    elif lifecycle_option == "Deployment":
+        st.write("""
+        #### 🚀 7. Deployment
+        The final step is deploying the model for real-time use. This involves integrating it into a system or application where it can process live data.
+        - **Real-time Applications**: Chatbots, sentiment analysis for social media monitoring, text summarization for news.
+        - **Maintenance**: Continuously monitor the model to ensure its performance remains high. Updates might be necessary if the language evolves or new data emerges.
+        **Example**: Deploying a chatbot to answer customer inquiries based on historical support tickets.
+        """)
+# Content for NLP Techniques
+# Content for "NLP Techniques"
+elif st.session_state.selected_page == "NLP Techniques":
+    technique_option = sidebar.radio("Select NLP Technique:", [
+        "NLP Techniques",
+        "Tokenization",
+        "Stop Words Removal",
+        "Lemmatization",
+        "Stemming",
+        "Bag of Words (BoW)",
+        "TF-IDF",
+        "Word Embeddings",
+        "Named Entity Recognition (NER)",
+        "Part-of-Speech (POS) Tagging",
+        "Sentiment Analysis"
+    ])
+    if technique_option == "NLP Techniques":
+        st.write("""
+        ### ⚙️ Techniques in NLP
+        NLP uses a variety of techniques to process and analyze text data. Some of the most common techniques include:
+        1. **Tokenization**: Breaking down text into smaller units (e.g., words, sentences).
+        2. **Part-of-Speech (POS) Tagging**: Identifying the grammatical roles of words in a sentence (e.g., noun, verb, adjective).
+        3. **Named Entity Recognition (NER)**: Identifying entities such as names, dates, locations, etc.
+        4. **Dependency Parsing**: Analyzing the syntactic structure of sentences.
+        5. **Sentiment Analysis**: Analyzing the sentiment of text (positive, negative, neutral).
+        6. **Word Embeddings**: Representing words as vectors in a continuous space (e.g., Word2Vec, GloVe).
+        **Example**: Sentiment analysis can be used to identify whether customer reviews are positive, negative, or neutral based on the words used in the text.
+        """)
+    elif technique_option == "Tokenization":
+        st.write("""
+        #### 1. Tokenization
+        Tokenization is the process of splitting text into smaller units, such as words, sentences, or subwords. This is a key preprocessing step for many NLP tasks.
+        - **Example**:
+          - Sentence: "Natural Language Processing is awesome!"
+          - Tokenized words: ["Natural", "Language", "Processing", "is", "awesome"]
+        """)
+    elif technique_option == "Stop Words Removal":
+        st.write("""
+        #### 2. Stop Words Removal
+        Stop words are commonly used words like "the", "is", "at", etc., that do not carry much information in many NLP tasks. Removing stop words helps reduce the dimensionality and noise in the data.
+        - **Example**: Removing "is" from the sentence "NLP is amazing!"
+        """)
+    elif technique_option == "Lemmatization":
+        st.write("""
+        #### 3. Lemmatization
+        Lemmatization is the process of converting words into their root or base form based on context. It is more sophisticated than stemming, as it considers the meaning of words.
+        - **Example**: "better" → "good", "running" → "run".
+        """)
+    elif technique_option == "Stemming":
+        st.write("""
+        #### 4. Stemming
+        Stemming is the process of reducing words to their root form by removing prefixes or suffixes. This technique may result in non-dictionary words.
+        - **Example**: "running" → "run", "happiness" → "happi".
+        """)
+    elif technique_option == "Bag of Words (BoW)":
+        st.write("""
+        #### 5. Bag of Words (BoW)
+        The Bag of Words model represents text as a set of individual words, disregarding grammar and word order but keeping multiplicity. It is a simple and widely used method for text representation.
+        - **Example**:
+          - Text: "I love NLP"
+          - BoW: {"I": 1, "love": 1, "NLP": 1}
+        """)
+    elif technique_option == "TF-IDF":
+        st.write("""
+        #### 6. TF-IDF (Term Frequency-Inverse Document Frequency)
+        TF-IDF helps determine the importance of a word in a document relative to the entire dataset. It reduces the weight of common words and increases the weight of rare but important words.
+        - **Example**: The word "data" might have a high TF-IDF score in a document about data analysis but a low score in a document about cooking.
+        """)
+    elif technique_option == "Word Embeddings":
+        st.write("""
+        #### 7. Word Embeddings
+        Word embeddings are vector representations of words that capture semantic relationships. Words with similar meanings have similar vectors. Common word embedding models include:
+        - **Word2Vec**
+        - **GloVe**
+        - **FastText**
+        **Example**: The words "king" and "queen" would have similar vector representations because they share semantic relationships.
+        """)
+    elif technique_option == "Named Entity Recognition (NER)":
+        st.write("""
+        #### 8. Named Entity Recognition (NER)
+        NER is the task of identifying named entities such as persons, organizations, locations, and dates in text. This technique is commonly used for information extraction.
+        - **Example**: "Barack Obama was born in Hawaii."
+          - Entities: ["Barack Obama" (Person), "Hawaii" (Location)]
+        """)
+    elif technique_option == "Part-of-Speech (POS) Tagging":
+        st.write("""
+        #### 9. Part-of-Speech (POS) Tagging
+        POS tagging involves assigning grammatical labels (such as noun, verb, adjective) to each word in a sentence.
+        - **Example**: "The cat sat on the mat."
+          - POS Tags: [("The", "DT"), ("cat", "NN"), ("sat", "VBD"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
+        """)
+    elif technique_option == "Sentiment Analysis":
+        st.write("""
+        #### 10. Sentiment Analysis
+        Sentiment analysis involves determining the sentiment of a piece of text, typically categorizing it as positive, negative, or neutral. This is commonly used for customer feedback and social media monitoring.
+        - **Example**: "I love this product!" → Positive Sentiment
+        """)