From_Zero_to_ML_Hero / pages /9_natural_language_processing.py
DOMMETI's picture
Update pages/9_natural_language_processing.py
67db7cf verified
import streamlit as st
# Apply custom CSS styling
st.markdown("""
<style>
body {
background-color: #eef2f7;
}
h1 {
color: #00FFFF;
font-family: 'Roboto', sans-serif;
font-weight: 700;
text-align: center;
margin-bottom: 25px;
}
h2, h3 {
font-family: 'Roboto', sans-serif;
font-weight: 600;
}
h2 {
color: #FFFACD;
}
h3 {
color: #ba95b0;
}
p, ul, ol {
font-family: 'Georgia', serif;
line-height: 1.8;
color: #495057;
}
ul {
margin-left: 20px;
}
.icon-bullet {
list-style-type: none;
padding-left: 20px;
}
.icon-bullet li {
font-family: 'Georgia', serif;
font-size: 1.1em;
margin-bottom: 10px;
color: #495057;
}
.icon-bullet li::before {
content: "βœ”οΈ";
padding-right: 10px;
color: #00FFFF;
}
</style>
""", unsafe_allow_html=True)
# Page Configuration
st.title("Interactive NLP Guide")
# Sidebar Navigation
st.sidebar.title("Explore NLP Topics")
topics = [
"Introduction",
"Tokenization",
"One-Hot Vectorization",
"Bag of Words",
"TF-IDF Vectorizer",
"Word Embeddings",
]
selected_topic = st.sidebar.radio("Select a topic", topics)
# Content Based on Selection
if selected_topic == "Introduction":
st.markdown("<h1>Natural Language Processing (NLP)</h1>", unsafe_allow_html=True)
st.markdown("<h2>Introduction to NLP</h2>", unsafe_allow_html=True)
st.markdown("""
<p>Natural Language Processing (NLP) is a field at the intersection of linguistics and computer science, focusing on enabling computers to understand, interpret, and respond to human language.</p>
<h3>Applications of NLP:</h3>
<ul>
<li>Chatbots and Virtual Assistants (e.g., Alexa, Siri)</li>
<li>Machine Translation (e.g., Google Translate)</li>
<li>Text Summarization</li>
<li>Sentiment Analysis</li>
<li>Speech Recognition Systems</li>
</ul>
""", unsafe_allow_html=True)
elif selected_topic == "Tokenization":
st.markdown("<h1>Tokenization</h1>", unsafe_allow_html=True)
st.markdown("<h2>What is Tokenization?</h2>", unsafe_allow_html=True)
st.markdown("""
<p>Tokenization is the process of breaking down a text into smaller units, such as sentences or words, called tokens. It is the first step in any NLP pipeline.</p>
<h3>Types of Tokenization:</h3>
<ul>
<li><b>Word Tokenization:</b> Splits text into words (e.g., "I love NLP." β†’ ["I", "love", "NLP"])</li>
<li><b>Sentence Tokenization:</b> Splits text into sentences (e.g., "NLP is fascinating. It's the future." β†’ ["NLP is fascinating.", "It's the future."])</li>
</ul>
<h3>Code Example:</h3>
""", unsafe_allow_html=True)
st.code("""
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Natural Language Processing is exciting. Let's explore it!"
word_tokens = word_tokenize(text)
sentence_tokens = sent_tokenize(text)
print("Word Tokens:", word_tokens)
print("Sentence Tokens:", sentence_tokens)
""", language="python")
elif selected_topic == "One-Hot Vectorization":
st.markdown("<h1>One-Hot Vectorization</h1>", unsafe_allow_html=True)
st.markdown("""
<p>One-Hot Vectorization is a method to represent text where each unique word is converted into a unique binary vector.</p>
<h3>How It Works:</h3>
<ul>
<li>Each word in the vocabulary is assigned an index.</li>
<li>The vector is all zeros except for a <code>1</code> at the word's index.</li>
</ul>
<h3>Example:</h3>
<ul>
<li>Vocabulary: ["cat", "dog", "bird"]</li>
<li>"cat" β†’ [1, 0, 0]</li>
<li>"dog" β†’ [0, 1, 0]</li>
</ul>
<h3>Limitations:</h3>
<ul>
<li>High dimensionality for large vocabularies.</li>
<li>Does not capture semantic relationships between words.</li>
</ul>
""", unsafe_allow_html=True)
elif selected_topic == "Bag of Words":
st.markdown("<h1>Bag of Words (BoW)</h1>", unsafe_allow_html=True)
st.markdown("""
<p>Bag of Words represents text as word frequency counts, disregarding word order.</p>
<h3>How It Works:</h3>
<ul>
<li>Create a vocabulary of unique words.</li>
<li>Count the frequency of each word in a document.</li>
</ul>
<h3>Example:</h3>
<ul>
<li>Given Sentences:
<ul>
<li>"I love NLP."</li>
<li>"I love programming."</li>
</ul>
</li>
<li>Vocabulary: ["I", "love", "NLP", "programming"]</li>
<li>Sentence 1: [1, 1, 1, 0]</li>
<li>Sentence 2: [1, 1, 0, 1]</li>
</ul>
""", unsafe_allow_html=True)
elif selected_topic == "TF-IDF Vectorizer":
st.markdown("<h1>TF-IDF Vectorizer</h1>", unsafe_allow_html=True)
st.markdown("""
<p>TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates the importance of a word in a document relative to a collection of documents (corpus).</p>
<h3>Formula:</h3>
""", unsafe_allow_html=True)
st.latex(r'''
\text{TF-IDF} = \text{TF} \times \text{IDF}
''')
st.markdown("""
<ul>
<li><b>Term Frequency (TF):</b> Frequency of a word in a document.</li>
<li><b>Inverse Document Frequency (IDF):</b> Logarithm of the ratio of the total number of documents to the number of documents containing the word.</li>
</ul>
""", unsafe_allow_html=True)
elif selected_topic == "Word Embeddings":
st.markdown("<h1>Word Embeddings</h1>", unsafe_allow_html=True)
st.markdown("""
<p>Word Embeddings are dense vector representations of words that capture semantic meanings and relationships.</p>
<h3>Key Features:</h3>
<ul>
<li>Captures semantic relationships between words (e.g., "king" - "man" + "woman" = "queen").</li>
<li>Efficient representation for large vocabularies.</li>
</ul>
<h3>Popular Word Embedding Models:</h3>
<ul>
<li>Word2Vec</li>
<li>GloVe</li>
<li>FastText</li>
</ul>
""", unsafe_allow_html=True)
# Footer
st.sidebar.markdown("---")
st.sidebar.markdown("Explore each topic to dive deeper into NLP concepts!")