Spaces:

DOMMETI
/

From_Zero_to_ML_Hero

Sleeping

App Files Files Community

From_Zero_to_ML_Hero / pages /9_natural_language_processing.py

DOMMETI

Update pages/9_natural_language_processing.py

67db7cf verified about 1 year ago

raw

history blame contribute delete

6.34 kB

	import streamlit as st

	# Apply custom CSS styling
	st.markdown("""
	<style>
	body {
	background-color: #eef2f7;
	}
	h1 {
	color: #00FFFF;
	font-family: 'Roboto', sans-serif;
	font-weight: 700;
	text-align: center;
	margin-bottom: 25px;
	}
	h2, h3 {
	font-family: 'Roboto', sans-serif;
	font-weight: 600;
	}
	h2 {
	color: #FFFACD;
	}
	h3 {
	color: #ba95b0;
	}
	p, ul, ol {
	font-family: 'Georgia', serif;
	line-height: 1.8;
	color: #495057;
	}
	ul {
	margin-left: 20px;
	}
	.icon-bullet {
	list-style-type: none;
	padding-left: 20px;
	}
	.icon-bullet li {
	font-family: 'Georgia', serif;
	font-size: 1.1em;
	margin-bottom: 10px;
	color: #495057;
	}
	.icon-bullet li::before {
	content: "✔️";
	padding-right: 10px;
	color: #00FFFF;
	}
	</style>
	""", unsafe_allow_html=True)

	# Page Configuration
	st.title("Interactive NLP Guide")

	# Sidebar Navigation
	st.sidebar.title("Explore NLP Topics")
	topics = [
	"Introduction",
	"Tokenization",
	"One-Hot Vectorization",
	"Bag of Words",
	"TF-IDF Vectorizer",
	"Word Embeddings",
	]
	selected_topic = st.sidebar.radio("Select a topic", topics)

	# Content Based on Selection
	if selected_topic == "Introduction":
	st.markdown("<h1>Natural Language Processing (NLP)</h1>", unsafe_allow_html=True)
	st.markdown("<h2>Introduction to NLP</h2>", unsafe_allow_html=True)
	st.markdown("""
	<p>Natural Language Processing (NLP) is a field at the intersection of linguistics and computer science, focusing on enabling computers to understand, interpret, and respond to human language.</p>
	<h3>Applications of NLP:</h3>
	<ul>
	<li>Chatbots and Virtual Assistants (e.g., Alexa, Siri)</li>
	<li>Machine Translation (e.g., Google Translate)</li>
	<li>Text Summarization</li>
	<li>Sentiment Analysis</li>
	<li>Speech Recognition Systems</li>
	</ul>
	""", unsafe_allow_html=True)

	elif selected_topic == "Tokenization":
	st.markdown("<h1>Tokenization</h1>", unsafe_allow_html=True)
	st.markdown("<h2>What is Tokenization?</h2>", unsafe_allow_html=True)
	st.markdown("""
	<p>Tokenization is the process of breaking down a text into smaller units, such as sentences or words, called tokens. It is the first step in any NLP pipeline.</p>
	<h3>Types of Tokenization:</h3>
	<ul>
	<li><b>Word Tokenization:</b> Splits text into words (e.g., "I love NLP." → ["I", "love", "NLP"])</li>
	<li><b>Sentence Tokenization:</b> Splits text into sentences (e.g., "NLP is fascinating. It's the future." → ["NLP is fascinating.", "It's the future."])</li>
	</ul>
	<h3>Code Example:</h3>
	""", unsafe_allow_html=True)
	st.code("""
	from nltk.tokenize import word_tokenize, sent_tokenize
	text = "Natural Language Processing is exciting. Let's explore it!"
	word_tokens = word_tokenize(text)
	sentence_tokens = sent_tokenize(text)
	print("Word Tokens:", word_tokens)
	print("Sentence Tokens:", sentence_tokens)
	""", language="python")

	elif selected_topic == "One-Hot Vectorization":
	st.markdown("<h1>One-Hot Vectorization</h1>", unsafe_allow_html=True)
	st.markdown("""
	<p>One-Hot Vectorization is a method to represent text where each unique word is converted into a unique binary vector.</p>
	<h3>How It Works:</h3>
	<ul>
	<li>Each word in the vocabulary is assigned an index.</li>
	<li>The vector is all zeros except for a <code>1</code> at the word's index.</li>
	</ul>
	<h3>Example:</h3>
	<ul>
	<li>Vocabulary: ["cat", "dog", "bird"]</li>
	<li>"cat" → [1, 0, 0]</li>
	<li>"dog" → [0, 1, 0]</li>
	</ul>
	<h3>Limitations:</h3>
	<ul>
	<li>High dimensionality for large vocabularies.</li>
	<li>Does not capture semantic relationships between words.</li>
	</ul>
	""", unsafe_allow_html=True)

	elif selected_topic == "Bag of Words":
	st.markdown("<h1>Bag of Words (BoW)</h1>", unsafe_allow_html=True)
	st.markdown("""
	<p>Bag of Words represents text as word frequency counts, disregarding word order.</p>
	<h3>How It Works:</h3>
	<ul>
	<li>Create a vocabulary of unique words.</li>
	<li>Count the frequency of each word in a document.</li>
	</ul>
	<h3>Example:</h3>
	<ul>
	<li>Given Sentences:
	<ul>
	<li>"I love NLP."</li>
	<li>"I love programming."</li>
	</ul>
	</li>
	<li>Vocabulary: ["I", "love", "NLP", "programming"]</li>
	<li>Sentence 1: [1, 1, 1, 0]</li>
	<li>Sentence 2: [1, 1, 0, 1]</li>
	</ul>
	""", unsafe_allow_html=True)

	elif selected_topic == "TF-IDF Vectorizer":
	st.markdown("<h1>TF-IDF Vectorizer</h1>", unsafe_allow_html=True)
	st.markdown("""
	<p>TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates the importance of a word in a document relative to a collection of documents (corpus).</p>
	<h3>Formula:</h3>
	""", unsafe_allow_html=True)
	st.latex(r'''
	\text{TF-IDF} = \text{TF} \times \text{IDF}
	''')
	st.markdown("""
	<ul>
	<li><b>Term Frequency (TF):</b> Frequency of a word in a document.</li>
	<li><b>Inverse Document Frequency (IDF):</b> Logarithm of the ratio of the total number of documents to the number of documents containing the word.</li>
	</ul>
	""", unsafe_allow_html=True)

	elif selected_topic == "Word Embeddings":
	st.markdown("<h1>Word Embeddings</h1>", unsafe_allow_html=True)
	st.markdown("""
	<p>Word Embeddings are dense vector representations of words that capture semantic meanings and relationships.</p>
	<h3>Key Features:</h3>
	<ul>
	<li>Captures semantic relationships between words (e.g., "king" - "man" + "woman" = "queen").</li>
	<li>Efficient representation for large vocabularies.</li>
	</ul>
	<h3>Popular Word Embedding Models:</h3>
	<ul>
	<li>Word2Vec</li>
	<li>GloVe</li>
	<li>FastText</li>
	</ul>
	""", unsafe_allow_html=True)

	# Footer
	st.sidebar.markdown("---")
	st.sidebar.markdown("Explore each topic to dive deeper into NLP concepts!")