File size: 5,509 Bytes
3a9bb96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import streamlit as st 

st.markdown(
    """
    <style>
    /* App Background */
    .stApp {
        background: linear-gradient(to right , 	#EE82EE, #FFA500 ,#87CEEB); /* Gradient dark professional background */
        color: #00FFFF;
        padding: 20px;
    }
    /* Align content to the left */
    .block-container {
        text-align: left; /* Left align for content */
        padding: 2rem; /* Padding for aesthetics */
    }
    
    /* Header and Subheader Text */
    h1 {
        color: #800080 !important; /* Custom styling for the main header */
        font-family: 'Arial', sans-serif !important;
        font-weight: bold !important;
        text-align: center;
    }
    h2, h3, h4 {
        color: 	#FFFF00 !important; /* Custom styling for subheaders */
        font-family: 'Arial', sans-serif !important;
        font-weight: bold !important;
    }
    /* Paragraph Text */
    p {
        color: #0000FF !important; /* Custom styling for paragraphs */
        font-family: 'Arial', sans-serif !important;
        line-height: 1.6;
    }
    </style>
    """,
    unsafe_allow_html=True
)
st.markdown(
    """
    <h1 style="text-align: center;">Basic Terminology in NLP</h1>
    """,
    unsafe_allow_html=True
)

st.markdown(
    """
    <h5>Before diving deep into the concepts of NLP we must know about the frequently used terminologies in NLP</h5>
    <h5 style="color: ##00FF00;">1.Key Terminologies in NLP</h5>
    <ul style="color: #008000; line-height: 1.8;">
        <li><b>Corpus:</b> A collection of text documents. Example: {d1, d2, d3, ...}</li>
        <li><b>Document:</b> A single unit of text (e.g., a sentence, paragraph, or article).</li>
        <li><b>Paragraph:</b> A collection of sentences.</li>
        <li><b>Sentence:</b> A collection of words forming a meaningful expression.</li>
        <li><b>Word:</b> A collection of characters.</li>
        <li><b>Character:</b> A basic unit like an alphabet, number, or special symbol.</li>
    </ul>
    """,
    unsafe_allow_html=True
)
st.markdown(
    """
    <h5 style="color: #00FFFF;">2.Tokenization</h5>
    <p style="color: #FFA500;">Tokenization is the process of breaking down a large piece of text into smaller units called tokens. These tokens can be words, sentences, or subwords, depending on the granularity required for the task.</p>
    <h6>Types of Tokenization:</h6>
    <ul style="color: #d4e6f1; line-height: 1.8;">
        <li><b>Sentence Tokenization:</b> Splitting text into sentences. <br> Example: "I love ice-cream. I love chocolate." → ["I love  ice-cream", "I love chocolate"]</li>
        <li><b>Word Tokenization:</b> Splitting sentences into words. <br> Example: "I love biryani" → ["I", "love", "biryani"]</li>
        <li><b>Character Tokenization:</b> Splitting words into characters. <br> Example: "Love" → ["L", "o", "v","e"]</li>
    </ul>
    """,
    unsafe_allow_html=True
)
st.markdown(
    """
    <h5 style="color: #008080;">3.Stop Words</h5>
    <p style="color: #000080;">Stop words are commonly used words in a language that carry little or no meaningful information for text analysis. </p>
    <h6>Example:</h6>
    <p style="color: #d4e6f1;">"In Hyderabad, we can eat famous biryani." <br> Stop words: ["in", "we", "can"]</p>
    """,
    unsafe_allow_html=True
)
st.markdown(
    """
    <h5 style="color: #20B2AA;">4.Vectorization</h5>
    <p style="color: #d4e6f1;">Vectorization is the process of converting text data into numerical representations so that machine learning models can process and analyze it.</p>
    <h6>Types of Vectorization:</h6>
    <ul style="color: #d4e6f1; line-height: 1.8;">
        <li><b>One-Hot Encoding:</b> Represents each word as a binary vector.</li>
        <li><b>Bag of Words (BoW):</b> Represents text based on word frequencies.</li>
        <li><b>TF-IDF:</b> Adjusts word frequency by importance.</li>
        <li><b>Word2Vec:</b> Embeds words in a vector space using deep learning.</li>
        <li><b>GloVe:</b> Uses global co-occurrence statistics for embedding.</li>
        <li><b>FastText:</b> Similar to Word2Vec but includes subword information.</li>
    </ul>
    """,
    unsafe_allow_html=True
)
st.markdown(
    """
    <h5 style="color: #20B2AA;">5. Stemming</h5>
    <p style="color: #d4e6f1;">Stemming is the process of reducing words to their base or root form, often by removing prefixes or suffixes. It is a rule-based, heuristic approach to standardize words by removing derivational affixes.</p>
    <h6>Example:</h6>
    <ul style="color: #d4e6f1; line-height: 1.8;">
        <li><b>Original Words:</b> "running", "runner", "runs"</li>
        <li><b>Stemmed Form:</b> "run"</li>
    </ul>
    """,
    unsafe_allow_html=True
)
st.markdown(
    """
    <h5 style="color: #20B2AA;">6. Lemmatization</h5>
    <p style="color: #d4e6f1;">Lemmatization is the process of reducing a word to its base or root form (called a lemma) using linguistic rules and a vocabulary (dictionary). Unlike stemming, lemmatization ensures that the resulting word is a valid word in the language.</p>
    <h6>Example:</h6>
    <ul style="color: #d4e6f1; line-height: 1.8;">
        <li><b>Original Words:</b> "studying", "better", "carrying"</li>
        <li><b>Lemmatized Form:</b> "study", "good", "carry"</li>
    </ul>
    <p style="color: #d4e6f1;">Lemmatization is more accurate than stemming but computationally more intensive as it requires a language dictionary.</p>
    """,
    unsafe_allow_html=True
)