File size: 6,337 Bytes
2f145ad
 
b44a0c1
2f145ad
 
 
 
 
 
 
 
b44a0c1
2f145ad
 
 
b44a0c1
 
 
 
2f145ad
 
 
 
 
 
b44a0c1
2f145ad
 
b44a0c1
 
 
 
2f145ad
 
 
 
 
 
 
 
 
b44a0c1
2f145ad
 
 
 
 
 
 
 
 
b44a0c1
67db7cf
2f145ad
b44a0c1
2f145ad
b44a0c1
 
2f145ad
 
 
 
b44a0c1
2f145ad
b44a0c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
import streamlit as st

# Apply custom CSS styling
st.markdown("""
    <style>
    body {
        background-color: #eef2f7;
    }
    h1 {
        color: #00FFFF;
        font-family: 'Roboto', sans-serif;
        font-weight: 700;
        text-align: center;
        margin-bottom: 25px;
    }
    h2, h3 {
        font-family: 'Roboto', sans-serif;
        font-weight: 600;
    }
    h2 {
        color: #FFFACD;
    }
    h3 {
        color: #ba95b0;
    }
    p, ul, ol {
        font-family: 'Georgia', serif;
        line-height: 1.8;
        color: #495057;
    }
    ul {
        margin-left: 20px;
    }
    .icon-bullet {
        list-style-type: none;
        padding-left: 20px;
    }
    .icon-bullet li {
        font-family: 'Georgia', serif;
        font-size: 1.1em;
        margin-bottom: 10px;
        color: #495057;
    }
    .icon-bullet li::before {
        content: "✔️";
        padding-right: 10px;
        color: #00FFFF;
    }
    </style>
""", unsafe_allow_html=True)

# Page Configuration
st.title("Interactive NLP Guide")

# Sidebar Navigation
st.sidebar.title("Explore NLP Topics")
topics = [
    "Introduction",
    "Tokenization",
    "One-Hot Vectorization",
    "Bag of Words",
    "TF-IDF Vectorizer",
    "Word Embeddings",
]
selected_topic = st.sidebar.radio("Select a topic", topics)

# Content Based on Selection
if selected_topic == "Introduction":
    st.markdown("<h1>Natural Language Processing (NLP)</h1>", unsafe_allow_html=True)
    st.markdown("<h2>Introduction to NLP</h2>", unsafe_allow_html=True)
    st.markdown("""
    <p>Natural Language Processing (NLP) is a field at the intersection of linguistics and computer science, focusing on enabling computers to understand, interpret, and respond to human language.</p>
    <h3>Applications of NLP:</h3>
    <ul>
        <li>Chatbots and Virtual Assistants (e.g., Alexa, Siri)</li>
        <li>Machine Translation (e.g., Google Translate)</li>
        <li>Text Summarization</li>
        <li>Sentiment Analysis</li>
        <li>Speech Recognition Systems</li>
    </ul>
    """, unsafe_allow_html=True)

elif selected_topic == "Tokenization":
    st.markdown("<h1>Tokenization</h1>", unsafe_allow_html=True)
    st.markdown("<h2>What is Tokenization?</h2>", unsafe_allow_html=True)
    st.markdown("""
    <p>Tokenization is the process of breaking down a text into smaller units, such as sentences or words, called tokens. It is the first step in any NLP pipeline.</p>
    <h3>Types of Tokenization:</h3>
    <ul>
        <li><b>Word Tokenization:</b> Splits text into words (e.g., "I love NLP." → ["I", "love", "NLP"])</li>
        <li><b>Sentence Tokenization:</b> Splits text into sentences (e.g., "NLP is fascinating. It's the future." → ["NLP is fascinating.", "It's the future."])</li>
    </ul>
    <h3>Code Example:</h3>
    """, unsafe_allow_html=True)
    st.code("""
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Natural Language Processing is exciting. Let's explore it!"
word_tokens = word_tokenize(text)
sentence_tokens = sent_tokenize(text)
print("Word Tokens:", word_tokens)
print("Sentence Tokens:", sentence_tokens)
    """, language="python")

elif selected_topic == "One-Hot Vectorization":
    st.markdown("<h1>One-Hot Vectorization</h1>", unsafe_allow_html=True)
    st.markdown("""
    <p>One-Hot Vectorization is a method to represent text where each unique word is converted into a unique binary vector.</p>
    <h3>How It Works:</h3>
    <ul>
        <li>Each word in the vocabulary is assigned an index.</li>
        <li>The vector is all zeros except for a <code>1</code> at the word's index.</li>
    </ul>
    <h3>Example:</h3>
    <ul>
        <li>Vocabulary: ["cat", "dog", "bird"]</li>
        <li>"cat" → [1, 0, 0]</li>
        <li>"dog" → [0, 1, 0]</li>
    </ul>
    <h3>Limitations:</h3>
    <ul>
        <li>High dimensionality for large vocabularies.</li>
        <li>Does not capture semantic relationships between words.</li>
    </ul>
    """, unsafe_allow_html=True)

elif selected_topic == "Bag of Words":
    st.markdown("<h1>Bag of Words (BoW)</h1>", unsafe_allow_html=True)
    st.markdown("""
    <p>Bag of Words represents text as word frequency counts, disregarding word order.</p>
    <h3>How It Works:</h3>
    <ul>
        <li>Create a vocabulary of unique words.</li>
        <li>Count the frequency of each word in a document.</li>
    </ul>
    <h3>Example:</h3>
    <ul>
        <li>Given Sentences:
            <ul>
                <li>"I love NLP."</li>
                <li>"I love programming."</li>
            </ul>
        </li>
        <li>Vocabulary: ["I", "love", "NLP", "programming"]</li>
        <li>Sentence 1: [1, 1, 1, 0]</li>
        <li>Sentence 2: [1, 1, 0, 1]</li>
    </ul>
    """, unsafe_allow_html=True)

elif selected_topic == "TF-IDF Vectorizer":
    st.markdown("<h1>TF-IDF Vectorizer</h1>", unsafe_allow_html=True)
    st.markdown("""
    <p>TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates the importance of a word in a document relative to a collection of documents (corpus).</p>
    <h3>Formula:</h3>
    """, unsafe_allow_html=True)
    st.latex(r'''
    \text{TF-IDF} = \text{TF} \times \text{IDF}
    ''')
    st.markdown("""
    <ul>
        <li><b>Term Frequency (TF):</b> Frequency of a word in a document.</li>
        <li><b>Inverse Document Frequency (IDF):</b> Logarithm of the ratio of the total number of documents to the number of documents containing the word.</li>
    </ul>
    """, unsafe_allow_html=True)

elif selected_topic == "Word Embeddings":
    st.markdown("<h1>Word Embeddings</h1>", unsafe_allow_html=True)
    st.markdown("""
    <p>Word Embeddings are dense vector representations of words that capture semantic meanings and relationships.</p>
    <h3>Key Features:</h3>
    <ul>
        <li>Captures semantic relationships between words (e.g., "king" - "man" + "woman" = "queen").</li>
        <li>Efficient representation for large vocabularies.</li>
    </ul>
    <h3>Popular Word Embedding Models:</h3>
    <ul>
        <li>Word2Vec</li>
        <li>GloVe</li>
        <li>FastText</li>
    </ul>
    """, unsafe_allow_html=True)

# Footer
st.sidebar.markdown("---")
st.sidebar.markdown("Explore each topic to dive deeper into NLP concepts!")