import streamlit as st
# Apply custom CSS styling
st.markdown("""
""", unsafe_allow_html=True)
# Page Configuration
st.title("Interactive NLP Guide")
# Sidebar Navigation
st.sidebar.title("Explore NLP Topics")
topics = [
"Introduction",
"Tokenization",
"One-Hot Vectorization",
"Bag of Words",
"TF-IDF Vectorizer",
"Word Embeddings",
]
selected_topic = st.sidebar.radio("Select a topic", topics)
# Content Based on Selection
if selected_topic == "Introduction":
st.markdown("
Natural Language Processing (NLP)
", unsafe_allow_html=True)
st.markdown("Introduction to NLP
", unsafe_allow_html=True)
st.markdown("""
Natural Language Processing (NLP) is a field at the intersection of linguistics and computer science, focusing on enabling computers to understand, interpret, and respond to human language.
Applications of NLP:
- Chatbots and Virtual Assistants (e.g., Alexa, Siri)
- Machine Translation (e.g., Google Translate)
- Text Summarization
- Sentiment Analysis
- Speech Recognition Systems
""", unsafe_allow_html=True)
elif selected_topic == "Tokenization":
st.markdown("Tokenization
", unsafe_allow_html=True)
st.markdown("What is Tokenization?
", unsafe_allow_html=True)
st.markdown("""
Tokenization is the process of breaking down a text into smaller units, such as sentences or words, called tokens. It is the first step in any NLP pipeline.
Types of Tokenization:
- Word Tokenization: Splits text into words (e.g., "I love NLP." → ["I", "love", "NLP"])
- Sentence Tokenization: Splits text into sentences (e.g., "NLP is fascinating. It's the future." → ["NLP is fascinating.", "It's the future."])
Code Example:
""", unsafe_allow_html=True)
st.code("""
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Natural Language Processing is exciting. Let's explore it!"
word_tokens = word_tokenize(text)
sentence_tokens = sent_tokenize(text)
print("Word Tokens:", word_tokens)
print("Sentence Tokens:", sentence_tokens)
""", language="python")
elif selected_topic == "One-Hot Vectorization":
st.markdown("One-Hot Vectorization
", unsafe_allow_html=True)
st.markdown("""
One-Hot Vectorization is a method to represent text where each unique word is converted into a unique binary vector.
How It Works:
- Each word in the vocabulary is assigned an index.
- The vector is all zeros except for a
1 at the word's index.
Example:
- Vocabulary: ["cat", "dog", "bird"]
- "cat" → [1, 0, 0]
- "dog" → [0, 1, 0]
Limitations:
- High dimensionality for large vocabularies.
- Does not capture semantic relationships between words.
""", unsafe_allow_html=True)
elif selected_topic == "Bag of Words":
st.markdown("Bag of Words (BoW)
", unsafe_allow_html=True)
st.markdown("""
Bag of Words represents text as word frequency counts, disregarding word order.
How It Works:
- Create a vocabulary of unique words.
- Count the frequency of each word in a document.
Example:
- Given Sentences:
- "I love NLP."
- "I love programming."
- Vocabulary: ["I", "love", "NLP", "programming"]
- Sentence 1: [1, 1, 1, 0]
- Sentence 2: [1, 1, 0, 1]
""", unsafe_allow_html=True)
elif selected_topic == "TF-IDF Vectorizer":
st.markdown("TF-IDF Vectorizer
", unsafe_allow_html=True)
st.markdown("""
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates the importance of a word in a document relative to a collection of documents (corpus).
Formula:
""", unsafe_allow_html=True)
st.latex(r'''
\text{TF-IDF} = \text{TF} \times \text{IDF}
''')
st.markdown("""
- Term Frequency (TF): Frequency of a word in a document.
- Inverse Document Frequency (IDF): Logarithm of the ratio of the total number of documents to the number of documents containing the word.
""", unsafe_allow_html=True)
elif selected_topic == "Word Embeddings":
st.markdown("Word Embeddings
", unsafe_allow_html=True)
st.markdown("""
Word Embeddings are dense vector representations of words that capture semantic meanings and relationships.
Key Features:
- Captures semantic relationships between words (e.g., "king" - "man" + "woman" = "queen").
- Efficient representation for large vocabularies.
Popular Word Embedding Models:
""", unsafe_allow_html=True)
# Footer
st.sidebar.markdown("---")
st.sidebar.markdown("Explore each topic to dive deeper into NLP concepts!")