Pasham123 commited on
Commit
3a9bb96
·
verified ·
1 Parent(s): 749d0b1

Update pages/Basic_Terminologies.py

Browse files
Files changed (1) hide show
  1. pages/Basic_Terminologies.py +124 -0
pages/Basic_Terminologies.py CHANGED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+
3
+ st.markdown(
4
+ """
5
+ <style>
6
+ /* App Background */
7
+ .stApp {
8
+ background: linear-gradient(to right , #EE82EE, #FFA500 ,#87CEEB); /* Gradient dark professional background */
9
+ color: #00FFFF;
10
+ padding: 20px;
11
+ }
12
+ /* Align content to the left */
13
+ .block-container {
14
+ text-align: left; /* Left align for content */
15
+ padding: 2rem; /* Padding for aesthetics */
16
+ }
17
+
18
+ /* Header and Subheader Text */
19
+ h1 {
20
+ color: #800080 !important; /* Custom styling for the main header */
21
+ font-family: 'Arial', sans-serif !important;
22
+ font-weight: bold !important;
23
+ text-align: center;
24
+ }
25
+ h2, h3, h4 {
26
+ color: #FFFF00 !important; /* Custom styling for subheaders */
27
+ font-family: 'Arial', sans-serif !important;
28
+ font-weight: bold !important;
29
+ }
30
+ /* Paragraph Text */
31
+ p {
32
+ color: #0000FF !important; /* Custom styling for paragraphs */
33
+ font-family: 'Arial', sans-serif !important;
34
+ line-height: 1.6;
35
+ }
36
+ </style>
37
+ """,
38
+ unsafe_allow_html=True
39
+ )
40
+ st.markdown(
41
+ """
42
+ <h1 style="text-align: center;">Basic Terminology in NLP</h1>
43
+ """,
44
+ unsafe_allow_html=True
45
+ )
46
+
47
+ st.markdown(
48
+ """
49
+ <h5>Before diving deep into the concepts of NLP we must know about the frequently used terminologies in NLP</h5>
50
+ <h5 style="color: ##00FF00;">1.Key Terminologies in NLP</h5>
51
+ <ul style="color: #008000; line-height: 1.8;">
52
+ <li><b>Corpus:</b> A collection of text documents. Example: {d1, d2, d3, ...}</li>
53
+ <li><b>Document:</b> A single unit of text (e.g., a sentence, paragraph, or article).</li>
54
+ <li><b>Paragraph:</b> A collection of sentences.</li>
55
+ <li><b>Sentence:</b> A collection of words forming a meaningful expression.</li>
56
+ <li><b>Word:</b> A collection of characters.</li>
57
+ <li><b>Character:</b> A basic unit like an alphabet, number, or special symbol.</li>
58
+ </ul>
59
+ """,
60
+ unsafe_allow_html=True
61
+ )
62
+ st.markdown(
63
+ """
64
+ <h5 style="color: #00FFFF;">2.Tokenization</h5>
65
+ <p style="color: #FFA500;">Tokenization is the process of breaking down a large piece of text into smaller units called tokens. These tokens can be words, sentences, or subwords, depending on the granularity required for the task.</p>
66
+ <h6>Types of Tokenization:</h6>
67
+ <ul style="color: #d4e6f1; line-height: 1.8;">
68
+ <li><b>Sentence Tokenization:</b> Splitting text into sentences. <br> Example: "I love ice-cream. I love chocolate." → ["I love ice-cream", "I love chocolate"]</li>
69
+ <li><b>Word Tokenization:</b> Splitting sentences into words. <br> Example: "I love biryani" → ["I", "love", "biryani"]</li>
70
+ <li><b>Character Tokenization:</b> Splitting words into characters. <br> Example: "Love" → ["L", "o", "v","e"]</li>
71
+ </ul>
72
+ """,
73
+ unsafe_allow_html=True
74
+ )
75
+ st.markdown(
76
+ """
77
+ <h5 style="color: #008080;">3.Stop Words</h5>
78
+ <p style="color: #000080;">Stop words are commonly used words in a language that carry little or no meaningful information for text analysis. </p>
79
+ <h6>Example:</h6>
80
+ <p style="color: #d4e6f1;">"In Hyderabad, we can eat famous biryani." <br> Stop words: ["in", "we", "can"]</p>
81
+ """,
82
+ unsafe_allow_html=True
83
+ )
84
+ st.markdown(
85
+ """
86
+ <h5 style="color: #20B2AA;">4.Vectorization</h5>
87
+ <p style="color: #d4e6f1;">Vectorization is the process of converting text data into numerical representations so that machine learning models can process and analyze it.</p>
88
+ <h6>Types of Vectorization:</h6>
89
+ <ul style="color: #d4e6f1; line-height: 1.8;">
90
+ <li><b>One-Hot Encoding:</b> Represents each word as a binary vector.</li>
91
+ <li><b>Bag of Words (BoW):</b> Represents text based on word frequencies.</li>
92
+ <li><b>TF-IDF:</b> Adjusts word frequency by importance.</li>
93
+ <li><b>Word2Vec:</b> Embeds words in a vector space using deep learning.</li>
94
+ <li><b>GloVe:</b> Uses global co-occurrence statistics for embedding.</li>
95
+ <li><b>FastText:</b> Similar to Word2Vec but includes subword information.</li>
96
+ </ul>
97
+ """,
98
+ unsafe_allow_html=True
99
+ )
100
+ st.markdown(
101
+ """
102
+ <h5 style="color: #20B2AA;">5. Stemming</h5>
103
+ <p style="color: #d4e6f1;">Stemming is the process of reducing words to their base or root form, often by removing prefixes or suffixes. It is a rule-based, heuristic approach to standardize words by removing derivational affixes.</p>
104
+ <h6>Example:</h6>
105
+ <ul style="color: #d4e6f1; line-height: 1.8;">
106
+ <li><b>Original Words:</b> "running", "runner", "runs"</li>
107
+ <li><b>Stemmed Form:</b> "run"</li>
108
+ </ul>
109
+ """,
110
+ unsafe_allow_html=True
111
+ )
112
+ st.markdown(
113
+ """
114
+ <h5 style="color: #20B2AA;">6. Lemmatization</h5>
115
+ <p style="color: #d4e6f1;">Lemmatization is the process of reducing a word to its base or root form (called a lemma) using linguistic rules and a vocabulary (dictionary). Unlike stemming, lemmatization ensures that the resulting word is a valid word in the language.</p>
116
+ <h6>Example:</h6>
117
+ <ul style="color: #d4e6f1; line-height: 1.8;">
118
+ <li><b>Original Words:</b> "studying", "better", "carrying"</li>
119
+ <li><b>Lemmatized Form:</b> "study", "good", "carry"</li>
120
+ </ul>
121
+ <p style="color: #d4e6f1;">Lemmatization is more accurate than stemming but computationally more intensive as it requires a language dictionary.</p>
122
+ """,
123
+ unsafe_allow_html=True
124
+ )