Update pages/Basic_Terminologies.py
Browse files- pages/Basic_Terminologies.py +124 -0
pages/Basic_Terminologies.py
CHANGED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
|
| 3 |
+
st.markdown(
|
| 4 |
+
"""
|
| 5 |
+
<style>
|
| 6 |
+
/* App Background */
|
| 7 |
+
.stApp {
|
| 8 |
+
background: linear-gradient(to right , #EE82EE, #FFA500 ,#87CEEB); /* Gradient dark professional background */
|
| 9 |
+
color: #00FFFF;
|
| 10 |
+
padding: 20px;
|
| 11 |
+
}
|
| 12 |
+
/* Align content to the left */
|
| 13 |
+
.block-container {
|
| 14 |
+
text-align: left; /* Left align for content */
|
| 15 |
+
padding: 2rem; /* Padding for aesthetics */
|
| 16 |
+
}
|
| 17 |
+
|
| 18 |
+
/* Header and Subheader Text */
|
| 19 |
+
h1 {
|
| 20 |
+
color: #800080 !important; /* Custom styling for the main header */
|
| 21 |
+
font-family: 'Arial', sans-serif !important;
|
| 22 |
+
font-weight: bold !important;
|
| 23 |
+
text-align: center;
|
| 24 |
+
}
|
| 25 |
+
h2, h3, h4 {
|
| 26 |
+
color: #FFFF00 !important; /* Custom styling for subheaders */
|
| 27 |
+
font-family: 'Arial', sans-serif !important;
|
| 28 |
+
font-weight: bold !important;
|
| 29 |
+
}
|
| 30 |
+
/* Paragraph Text */
|
| 31 |
+
p {
|
| 32 |
+
color: #0000FF !important; /* Custom styling for paragraphs */
|
| 33 |
+
font-family: 'Arial', sans-serif !important;
|
| 34 |
+
line-height: 1.6;
|
| 35 |
+
}
|
| 36 |
+
</style>
|
| 37 |
+
""",
|
| 38 |
+
unsafe_allow_html=True
|
| 39 |
+
)
|
| 40 |
+
st.markdown(
|
| 41 |
+
"""
|
| 42 |
+
<h1 style="text-align: center;">Basic Terminology in NLP</h1>
|
| 43 |
+
""",
|
| 44 |
+
unsafe_allow_html=True
|
| 45 |
+
)
|
| 46 |
+
|
| 47 |
+
st.markdown(
|
| 48 |
+
"""
|
| 49 |
+
<h5>Before diving deep into the concepts of NLP we must know about the frequently used terminologies in NLP</h5>
|
| 50 |
+
<h5 style="color: ##00FF00;">1.Key Terminologies in NLP</h5>
|
| 51 |
+
<ul style="color: #008000; line-height: 1.8;">
|
| 52 |
+
<li><b>Corpus:</b> A collection of text documents. Example: {d1, d2, d3, ...}</li>
|
| 53 |
+
<li><b>Document:</b> A single unit of text (e.g., a sentence, paragraph, or article).</li>
|
| 54 |
+
<li><b>Paragraph:</b> A collection of sentences.</li>
|
| 55 |
+
<li><b>Sentence:</b> A collection of words forming a meaningful expression.</li>
|
| 56 |
+
<li><b>Word:</b> A collection of characters.</li>
|
| 57 |
+
<li><b>Character:</b> A basic unit like an alphabet, number, or special symbol.</li>
|
| 58 |
+
</ul>
|
| 59 |
+
""",
|
| 60 |
+
unsafe_allow_html=True
|
| 61 |
+
)
|
| 62 |
+
st.markdown(
|
| 63 |
+
"""
|
| 64 |
+
<h5 style="color: #00FFFF;">2.Tokenization</h5>
|
| 65 |
+
<p style="color: #FFA500;">Tokenization is the process of breaking down a large piece of text into smaller units called tokens. These tokens can be words, sentences, or subwords, depending on the granularity required for the task.</p>
|
| 66 |
+
<h6>Types of Tokenization:</h6>
|
| 67 |
+
<ul style="color: #d4e6f1; line-height: 1.8;">
|
| 68 |
+
<li><b>Sentence Tokenization:</b> Splitting text into sentences. <br> Example: "I love ice-cream. I love chocolate." → ["I love ice-cream", "I love chocolate"]</li>
|
| 69 |
+
<li><b>Word Tokenization:</b> Splitting sentences into words. <br> Example: "I love biryani" → ["I", "love", "biryani"]</li>
|
| 70 |
+
<li><b>Character Tokenization:</b> Splitting words into characters. <br> Example: "Love" → ["L", "o", "v","e"]</li>
|
| 71 |
+
</ul>
|
| 72 |
+
""",
|
| 73 |
+
unsafe_allow_html=True
|
| 74 |
+
)
|
| 75 |
+
st.markdown(
|
| 76 |
+
"""
|
| 77 |
+
<h5 style="color: #008080;">3.Stop Words</h5>
|
| 78 |
+
<p style="color: #000080;">Stop words are commonly used words in a language that carry little or no meaningful information for text analysis. </p>
|
| 79 |
+
<h6>Example:</h6>
|
| 80 |
+
<p style="color: #d4e6f1;">"In Hyderabad, we can eat famous biryani." <br> Stop words: ["in", "we", "can"]</p>
|
| 81 |
+
""",
|
| 82 |
+
unsafe_allow_html=True
|
| 83 |
+
)
|
| 84 |
+
st.markdown(
|
| 85 |
+
"""
|
| 86 |
+
<h5 style="color: #20B2AA;">4.Vectorization</h5>
|
| 87 |
+
<p style="color: #d4e6f1;">Vectorization is the process of converting text data into numerical representations so that machine learning models can process and analyze it.</p>
|
| 88 |
+
<h6>Types of Vectorization:</h6>
|
| 89 |
+
<ul style="color: #d4e6f1; line-height: 1.8;">
|
| 90 |
+
<li><b>One-Hot Encoding:</b> Represents each word as a binary vector.</li>
|
| 91 |
+
<li><b>Bag of Words (BoW):</b> Represents text based on word frequencies.</li>
|
| 92 |
+
<li><b>TF-IDF:</b> Adjusts word frequency by importance.</li>
|
| 93 |
+
<li><b>Word2Vec:</b> Embeds words in a vector space using deep learning.</li>
|
| 94 |
+
<li><b>GloVe:</b> Uses global co-occurrence statistics for embedding.</li>
|
| 95 |
+
<li><b>FastText:</b> Similar to Word2Vec but includes subword information.</li>
|
| 96 |
+
</ul>
|
| 97 |
+
""",
|
| 98 |
+
unsafe_allow_html=True
|
| 99 |
+
)
|
| 100 |
+
st.markdown(
|
| 101 |
+
"""
|
| 102 |
+
<h5 style="color: #20B2AA;">5. Stemming</h5>
|
| 103 |
+
<p style="color: #d4e6f1;">Stemming is the process of reducing words to their base or root form, often by removing prefixes or suffixes. It is a rule-based, heuristic approach to standardize words by removing derivational affixes.</p>
|
| 104 |
+
<h6>Example:</h6>
|
| 105 |
+
<ul style="color: #d4e6f1; line-height: 1.8;">
|
| 106 |
+
<li><b>Original Words:</b> "running", "runner", "runs"</li>
|
| 107 |
+
<li><b>Stemmed Form:</b> "run"</li>
|
| 108 |
+
</ul>
|
| 109 |
+
""",
|
| 110 |
+
unsafe_allow_html=True
|
| 111 |
+
)
|
| 112 |
+
st.markdown(
|
| 113 |
+
"""
|
| 114 |
+
<h5 style="color: #20B2AA;">6. Lemmatization</h5>
|
| 115 |
+
<p style="color: #d4e6f1;">Lemmatization is the process of reducing a word to its base or root form (called a lemma) using linguistic rules and a vocabulary (dictionary). Unlike stemming, lemmatization ensures that the resulting word is a valid word in the language.</p>
|
| 116 |
+
<h6>Example:</h6>
|
| 117 |
+
<ul style="color: #d4e6f1; line-height: 1.8;">
|
| 118 |
+
<li><b>Original Words:</b> "studying", "better", "carrying"</li>
|
| 119 |
+
<li><b>Lemmatized Form:</b> "study", "good", "carry"</li>
|
| 120 |
+
</ul>
|
| 121 |
+
<p style="color: #d4e6f1;">Lemmatization is more accurate than stemming but computationally more intensive as it requires a language dictionary.</p>
|
| 122 |
+
""",
|
| 123 |
+
unsafe_allow_html=True
|
| 124 |
+
)
|