import streamlit as st # Apply custom CSS styling st.markdown(""" """, unsafe_allow_html=True) # Page Configuration st.title("Interactive NLP Guide") # Sidebar Navigation st.sidebar.title("Explore NLP Topics") topics = [ "Introduction", "Tokenization", "One-Hot Vectorization", "Bag of Words", "TF-IDF Vectorizer", "Word Embeddings", ] selected_topic = st.sidebar.radio("Select a topic", topics) # Content Based on Selection if selected_topic == "Introduction": st.markdown("

Natural Language Processing (NLP)

", unsafe_allow_html=True) st.markdown("

Introduction to NLP

", unsafe_allow_html=True) st.markdown("""

Natural Language Processing (NLP) is a field at the intersection of linguistics and computer science, focusing on enabling computers to understand, interpret, and respond to human language.

Applications of NLP:

Chatbots and Virtual Assistants (e.g., Alexa, Siri)
Machine Translation (e.g., Google Translate)
Text Summarization
Sentiment Analysis
Speech Recognition Systems

""", unsafe_allow_html=True) elif selected_topic == "Tokenization": st.markdown("

Tokenization

", unsafe_allow_html=True) st.markdown("

What is Tokenization?

", unsafe_allow_html=True) st.markdown("""

Tokenization is the process of breaking down a text into smaller units, such as sentences or words, called tokens. It is the first step in any NLP pipeline.

Types of Tokenization:

Word Tokenization: Splits text into words (e.g., "I love NLP." → ["I", "love", "NLP"])
Sentence Tokenization: Splits text into sentences (e.g., "NLP is fascinating. It's the future." → ["NLP is fascinating.", "It's the future."])

Code Example:

""", unsafe_allow_html=True) st.code(""" from nltk.tokenize import word_tokenize, sent_tokenize text = "Natural Language Processing is exciting. Let's explore it!" word_tokens = word_tokenize(text) sentence_tokens = sent_tokenize(text) print("Word Tokens:", word_tokens) print("Sentence Tokens:", sentence_tokens) """, language="python") elif selected_topic == "One-Hot Vectorization": st.markdown("

One-Hot Vectorization

", unsafe_allow_html=True) st.markdown("""

One-Hot Vectorization is a method to represent text where each unique word is converted into a unique binary vector.

How It Works:

Each word in the vocabulary is assigned an index.
The vector is all zeros except for a 1 at the word's index.

Example:

Vocabulary: ["cat", "dog", "bird"]
"cat" → [1, 0, 0]
"dog" → [0, 1, 0]

Limitations:

High dimensionality for large vocabularies.
Does not capture semantic relationships between words.

""", unsafe_allow_html=True) elif selected_topic == "Bag of Words": st.markdown("

Bag of Words (BoW)

", unsafe_allow_html=True) st.markdown("""

Bag of Words represents text as word frequency counts, disregarding word order.

How It Works:

Create a vocabulary of unique words.
Count the frequency of each word in a document.

Example:

Given Sentences:
- "I love NLP."
- "I love programming."
Vocabulary: ["I", "love", "NLP", "programming"]
Sentence 1: [1, 1, 1, 0]
Sentence 2: [1, 1, 0, 1]

""", unsafe_allow_html=True) elif selected_topic == "TF-IDF Vectorizer": st.markdown("

TF-IDF Vectorizer

", unsafe_allow_html=True) st.markdown("""

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates the importance of a word in a document relative to a collection of documents (corpus).

Formula:

""", unsafe_allow_html=True) st.latex(r''' \text{TF-IDF} = \text{TF} \times \text{IDF} ''') st.markdown("""

Term Frequency (TF): Frequency of a word in a document.
Inverse Document Frequency (IDF): Logarithm of the ratio of the total number of documents to the number of documents containing the word.

""", unsafe_allow_html=True) elif selected_topic == "Word Embeddings": st.markdown("

Word Embeddings

", unsafe_allow_html=True) st.markdown("""

Word Embeddings are dense vector representations of words that capture semantic meanings and relationships.

Key Features:

Captures semantic relationships between words (e.g., "king" - "man" + "woman" = "queen").
Efficient representation for large vocabularies.

Popular Word Embedding Models:

Word2Vec
GloVe
FastText

""", unsafe_allow_html=True) # Footer st.sidebar.markdown("---") st.sidebar.markdown("Explore each topic to dive deeper into NLP concepts!")