Spaces:

sree4411
/

NLP

Sleeping

NLP

File size: 5,993 Bytes

08134f2
1c1d531
 
d21c17e
08134f2
a069442
885b16a
a069442
 
885b16a
1c1d531
381b8fe
10e772a
1c1d531
 
 
 
 
 
0b2e3e9
a069442
381b8fe
885b16a
1c1d531
 
 
 
 
 
 
 
 
0b2e3e9
a069442
e2d36f6
885b16a
ac40de5
 
 
381b8fe
 
ac40de5
 
e2d36f6
ac40de5
885b16a
1c1d531
29c008d
 
 
 
 
 
105486b
1c1d531
 
 
29c008d
1c1d531
 
 
 
29c008d
1c1d531
 
29c008d
1c1d531
29c008d
0b2e3e9
e2d36f6
ac40de5
885b16a
1c1d531
29c008d
 
 
 
 
 
 
1c1d531
 
 
 
29c008d
1c1d531
 
29c008d
 
 
1c1d531
 
 
29c008d
 
0b2e3e9
d21c17e
 
885b16a
1c1d531
29c008d
 
 
 
 
 
 
1c1d531
 
 
29c008d
1c1d531
 
29c008d
 
1c1d531
 
29c008d
 
d21c17e
 
 
885b16a
1c1d531
29c008d
 
 
 
 
 
d21c17e
1c1d531
 
 
 
29c008d
1c1d531
 
29c008d
 
 
1c1d531
 
 
 
29c008d
1c1d531
d21c17e
 
1c1d531

import streamlit as st
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
from gensim.models import Word2Vec

# Title
st.title(":red[Introduction to NLP]")

# Section: What is NLP?
st.header(":blue[What is NLP?]")
st.write("""
Natural Language Processing (NLP) is a subfield of artificial intelligence that enables computers to process, understand, and generate human language.

### Applications of NLP:
- **Chatbots & Virtual Assistants** (e.g., Siri, Alexa)
- **Sentiment Analysis** (e.g., Product reviews, Social Media monitoring)
- **Machine Translation** (e.g., Google Translate)
- **Text Summarization** (e.g., News article summaries)
- **Speech Recognition** (e.g., Voice commands)
""")

# Section: NLP Terminologies
st.header(":blue[NLP Terminologies]")
st.write("""
- **Corpus**: A collection of text documents used for NLP tasks.
- **Tokenization**: Splitting text into individual words or phrases.
- **Stop Words**: Common words (e.g., "the", "is") that are often removed.
- **Stemming**: Reducing words to their base form (e.g., "running" → "run").
- **Lemmatization**: More advanced than stemming; it converts words to their dictionary form.
- **Named Entity Recognition (NER)**: Identifies entities like names, dates, and locations.
- **Sentiment Analysis**: Determines the sentiment (positive, negative, neutral) of a text.
- **n-grams**: Sequences of 'n' consecutive words (e.g., "New York" is a bi-gram).
""")

# Section: Text Representation Methods
st.header(":blue[Text Representation Methods]")
methods = [
    "Bag of Words",
    "TF-IDF",
    "One-Hot Encoding",
    "Word Embeddings (Word2Vec)"
]
selected_method = st.radio("Select a text representation method:", methods)

if selected_method == "Bag of Words":
    st.subheader(":blue[Bag of Words (BoW)]")
    st.write("""
    **Definition**: Bag of Words (BoW) is a simple text representation technique that converts text into numerical data by counting the occurrence of each word in a document. It ignores grammar, word order, and context.

    **How it works**:
    - Each unique word in a dataset becomes a feature.
    - The text is converted into a frequency-based numerical representation.
    - The more a word appears in a document, the higher its count.

    **Uses**:
    - Sentiment analysis
    - Document classification
    - Spam detection
    - Information retrieval

    **Advantages**:
    ✅ Simple and easy to implement  
    ✅ Works well with traditional machine learning models  

    **Disadvantages**:
    ❌ Ignores word order and meaning  
    ❌ High-dimensionality for large vocabularies  
    ❌ Cannot differentiate between synonyms (e.g., "happy" and "joyful")  
    """)

elif selected_method == "TF-IDF":
    st.subheader(":blue[Term Frequency-Inverse Document Frequency (TF-IDF)]")
    st.write("""
    **Definition**: TF-IDF is an advanced version of Bag of Words that assigns importance to words based on how frequently they appear in a document while reducing the importance of common words.

    **How it works**:
    - **Term Frequency (TF)**: Measures how often a word appears in a document.
    - **Inverse Document Frequency (IDF)**: Reduces the weight of words that are very common across all documents.
    - The final score is calculated as: **TF × IDF**.

    **Uses**:
    - Information retrieval (e.g., search engines)
    - Text classification
    - Keyword extraction
    - Document similarity detection

    **Advantages**:
    ✅ Reduces the impact of common words like "the", "is", etc.  
    ✅ Highlights important words in a document  
    ✅ Better than BoW for capturing relevance  

    **Disadvantages**:
    ❌ Still ignores word order  
    ❌ Cannot capture deep semantic meanings  
    ❌ Computationally expensive for very large datasets  
    """)

elif selected_method == "One-Hot Encoding":
    st.subheader(":blue[One-Hot Encoding]")
    st.write("""
    **Definition**: One-hot encoding is a simple representation method where each unique word in a vocabulary is represented as a binary vector.

    **How it works**:
    - Each word is assigned a unique index in a vocabulary.
    - A word is represented as a vector where all values are 0 except for the position of that word, which is 1.
    - For example, if the vocabulary consists of ["NLP", "is", "great"], then "NLP" is represented as **[1, 0, 0]**.

    **Uses**:
    - Simple NLP tasks
    - Word-level feature engineering
    - Early-stage text processing in machine learning models

    **Advantages**:
    ✅ Simple and easy to understand  
    ✅ Works well for small vocabulary sizes  

    **Disadvantages**:
    ❌ Inefficient for large vocabularies (results in sparse vectors)  
    ❌ Does not capture word meaning or relationships  
    """)

elif selected_method == "Word Embeddings (Word2Vec)":
    st.subheader(":blue[Word Embeddings (Word2Vec)]")
    st.write("""
    **Definition**: Word embeddings convert words into dense numerical vectors that capture semantic meaning. Unlike BoW and TF-IDF, word embeddings preserve relationships between words.

    **How it works**:
    - Words are represented as high-dimensional vectors (e.g., 100 or 300 dimensions).
    - Words with similar meanings have closer vectors.
    - It is trained using techniques like **CBOW (Continuous Bag of Words)** and **Skip-gram**.

    **Uses**:
    - Machine translation
    - Speech recognition
    - Sentiment analysis
    - Document clustering

    **Advantages**:
    ✅ Captures semantic relationships between words  
    ✅ Works well with deep learning models  
    ✅ Can detect synonyms and analogies (e.g., "king" - "man" + "woman" = "queen")  

    **Disadvantages**:
    ❌ Requires large datasets to train  
    ❌ Computationally expensive  
    ❌ Needs domain-specific tuning for best performance  
    """)

# Footer
st.write("---")
st.write("Developed with ❤️ using Streamlit for NLP enthusiasts.")