NLP / pages /4.Feature Engineering.py
UmaKumpatla's picture
Update pages/4.Feature Engineering.py
84b9be8 verified
import streamlit as st
# Function to display the Home Page
def show_home_page():
st.title("πŸ”¦ :red[Natural Language Processing (NLP)]")
st.markdown(
"""
### :green[Welcome to the NLP Guide]
Natural Language Processing (NLP) is a fascinating branch of Artificial Intelligence that focuses on the interaction between
computers and humans using natural language. It enables machines to read, understand, and generate human language in a meaningful way.
This guide explores key NLP concepts and techniques, from basic terminologies to advanced vectorization methods. Use the sidebar to explore each topic in detail.
#### :green[Applications of NLP:]
- Chatbots and virtual assistants (e.g., Alexa, Siri)
- Sentiment analysis
- Language translation tools (e.g., Google Translate)
- Text summarization and more!
"""
)
st.image("https://cdn-uploads.huggingface.co/production/uploads/66be28cc7e8987822d129400/1zCao_p5aQZr6zgYScaOB.png")
# Function to display specific topic pages
def show_page(page):
if page == "NLP Terminologies":
st.title("πŸ” :blue[NLP Terminologies]")
st.markdown(
"""
### :red[Key NLP Terms:]
- **Tokenization**: Splitting text into smaller units like words or sentences.
- **Stop Words**: Commonly used words (e.g., "the", "is") often removed during preprocessing.
- **Stemming**: Reducing words to their root form (e.g., "running" β†’ "run").
- **Lemmatization**: Converting words to their dictionary base form (e.g., "running" β†’ "run").
- **Corpus**: A large collection of text used for NLP training and analysis.
- **Vocabulary**: The set of unique words in a corpus.
- **n-grams**: Sequences of *n* words or characters in text.
- **POS Tagging**: Assigning parts of speech (e.g., noun, verb) to words.
- **NER (Named Entity Recognition)**: Identifying names, places, organizations, etc.
- **Parsing**: Analyzing the grammatical structure of a sentence.
"""
)
elif page == "One-Hot Vectorization":
st.title("πŸ”§ :green[One-Hot Vectorization]")
st.markdown(
"""
### :red[One-Hot Vectorization Explained]
One-Hot Vectorization is a simple representation where each word is encoded as a binary vector.
#### :red[How It Works:]
- Each unique word in the vocabulary is assigned an index.
- The vector for a word is all zeros except for a `1` at the index of that word.
#### :red[Example:]
Vocabulary: ["cat", "dog", "bird"]
- "cat" β†’ [1, 0, 0]
- "dog" β†’ [0, 1, 0]
- "bird" β†’ [0, 0, 1]
#### :red[Advantages:]
- Simple and intuitive to implement.
#### :red[Limitations:]
- High dimensionality for large vocabularies.
- Does not capture semantic relationships (e.g., "cat" and "kitten" have no connection).
#### :red[Applications:]
- Suitable for small datasets where simplicity is a priority.
"""
)
elif page == "Bag of Words":
st.title("πŸ”„ :green[Bag of Words (BoW)]")
st.markdown(
"""
### :orange[Bag of Words (BoW) Method]
Bag of Words is a way of representing text by counting word occurrences while ignoring word order.
#### :orange[How It Works:]
1. Create a vocabulary of all unique words in the text.
2. Count the frequency of each word in a document.
#### :orange[Example:]
Given two sentences:
- Sentence 1: "I love NLP."
- Sentence 2: "I love programming."
Vocabulary: ["I", "love", "NLP", "programming"]
- Sentence 1: [1, 1, 1, 0]
- Sentence 2: [1, 1, 0, 1]
#### :orange[Advantages:]
- Simple to implement and interpret.
#### :orange[Limitations:]
- High dimensionality for large vocabularies.
- Ignores word order and semantic meaning.
- Sensitive to noisy or frequent terms.
#### :orange[Applications:]
- Text classification and clustering.
"""
)
elif page == "TF-IDF Vectorizer":
st.title("πŸ”„ :blue[TF-IDF Vectorizer]")
st.markdown(
"""
### :green[TF-IDF (Term Frequency-Inverse Document Frequency)]
TF-IDF evaluates the importance of a word in a document relative to a collection of documents (corpus).
#### :rainbow[Formula:]
\[ \text{TF-IDF} = \text{TF} \times \text{IDF} \]
- **TF (Term Frequency)**: Frequency of a word in a document divided by the total words in the document.
- **IDF (Inverse Document Frequency)**: Logarithm of total documents divided by the number of documents containing the word.
#### :rainbow[Example:]
For the corpus:
- Document 1: "NLP is amazing."
- Document 2: "NLP is fun and amazing."
Words like "fun" and "amazing" will have higher weights than commonly occurring words like "is".
#### :rainbow[Advantages:]
- Highlights unique and relevant terms.
- Reduces the impact of frequent, less informative words.
#### :rainbow[Applications:]
- Information retrieval, search engines, and document classification.
"""
)
elif page == "Word2Vec":
st.title("🌐 :red[Word2Vec]")
st.markdown(
"""
### :green[Word2Vec]
Word2Vec creates dense vector representations of words, capturing semantic relationships using neural networks.
#### :green[Key Models:]
- **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
- **Skip-gram**: Predicts the context from a target word.
#### :green[Example:]
Word2Vec can capture relationships like:
- "king" - "man" + "woman" β‰ˆ "queen"
#### :green[Advantages:]
- Captures semantic meaning and relationships.
- Efficient for large datasets.
#### :green[Applications:]
- Sentiment analysis, recommendation systems, and machine translation.
#### :green[Limitations:]
- Computationally intensive for training on large datasets.
"""
)
elif page == "FastText":
st.title("πŸ”„ :red[FastText]")
st.markdown(
"""
### :blue[FastText]
FastText extends Word2Vec by representing words as character n-grams, enabling it to handle rare and out-of-vocabulary words.
#### :blue[Example:]
The word "playing" might be represented by subwords like "pla", "lay", "ayi", "ing".
#### :blue[Advantages:]
- Handles rare words and misspellings.
- Captures subword information (e.g., prefixes and suffixes).
#### :blue[Applications:]
- Multilingual text processing.
- Working with noisy or incomplete data.
#### :blue[Limitations:]
- Higher computational cost than Word2Vec.
"""
)
elif page == "Tokenization":
st.title("πŸ”’ :blue[Tokenization]")
st.markdown(
"""
### :red[Tokenization]
Tokenization is the process of splitting text into smaller units (tokens) such as words, phrases, or sentences.
#### :red[Types:]
- **Word Tokenization**: Splits text into words.
- **Sentence Tokenization**: Splits text into sentences.
#### :red[Example:]
Sentence: "NLP is exciting."
- Word Tokens: ["NLP", "is", "exciting", "."]
#### :red[Libraries:]
- NLTK
- SpaCy
- Hugging Face Transformers
#### :red[Challenges:]
- Handling complex text (e.g., abbreviations, contractions, multilingual data).
#### :red[Applications:]
- Preprocessing for machine learning models.
"""
)
elif page == "Stop Words":
st.title("πŸ” :green[Stop Words]")
st.markdown(
"""
### :rainbow[Stop Words]
Stop words are commonly used words in a language that are often removed during text preprocessing (e.g., "is", "the", "and").
#### :rainbow[Why Remove Stop Words?]
- To reduce noise and focus on meaningful terms in text.
#### :rainbow[Example Stop Words:]
- English: "is", "the", "and".
- Spanish: "es", "el", "y".
#### :rainbow[Challenges:]
- Some stop words might carry important context in specific use cases.
#### :rainbow[Applications:]
- Sentiment analysis, text classification, and search engines.
"""
)
# Sidebar navigation
st.sidebar.title("πŸ” NLP Topics")
menu_options = [
"Home",
"NLP Terminologies",
"One-Hot Vectorization",
"Bag of Words",
"TF-IDF Vectorizer",
"Word2Vec",
"FastText",
"Tokenization",
"Stop Words",
]
selected_page = st.sidebar.radio("Select a topic", menu_options)
# Display the selected page
if selected_page == "Home":
show_home_page()
else:
show_page(selected_page)