File size: 4,074 Bytes
475aeed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import streamlit as st

st.markdown("""
    <style>
    /* Set a soft background color */
    body {
        background-color: #eef2f7;
    }
    /* Style for main title */
    h1 {
        color: black;
        font-family: 'Roboto', sans-serif;
        font-weight: 700;
        text-align: center;
        margin-bottom: 25px;
    }
    /* Style for headers */
    h2 {
        color: red;
        font-family: 'Roboto', sans-serif;
        font-weight: 600;
        margin-top: 30px;
    }
    
    /* Style for subheaders */
     h3 {
        color: violet;
        font-family: 'Roboto', sans-serif;
        font-weight: 500;
        margin-top: 20px;
    }
    .custom-subheader {
        color: violet;
        font-family: 'Roboto', sans-serif;
        font-weight: 600;
        margin-bottom: 15px;
    }
    /* Paragraph styling */
    p {
        font-family: 'Georgia', serif;
        line-height: 1.8;
        color: black;
        margin-bottom: 20px;
    }
    /* List styling with checkmark bullets */
    .icon-bullet {
        list-style-type: none;
        padding-left: 20px;
    }
    .icon-bullet li {
        font-family: 'Georgia', serif;
        font-size: 1.1em;
        margin-bottom: 10px;
        color: black;
    }
    .icon-bullet li::before {
        content: "β—†";
        padding-right: 10px;
        color: black;
    }
    /* Sidebar styling */
    .sidebar .sidebar-content {
        background-color: #ffffff;
        border-radius: 10px;
        padding: 15px;
    }
    .sidebar h2 {
        color: #495057;
    }
    /* Custom button style */
    .streamlit-button {
        background-color: #00FFFF;
        color: #000000;
        font-weight: bold;
    }
    </style>
    """, unsafe_allow_html=True)


st.markdown("<h1 class='title'>πŸ“– NLP Terminology</h1>", unsafe_allow_html=True)
st.markdown("<p class='caption'>✨ Explore essential terms in Natural Language Processing and their meanings!...</p>", unsafe_allow_html=True)

st.header("πŸ“ Corpus")
st.markdown("- **A corpus** is a collection of documents.")

st.header("πŸ“„ Document")
st.markdown("- **A document** is a collection of sentences, paragraphs, single words, or even single characters.")

st.header("πŸ“ Paragraph")
st.markdown("- **A paragraph** consists of multiple sentences.")

st.header("πŸ“’ Sentence")
st.markdown("- **A sentence** is a collection of words.")

st.header("πŸ”€ Word")
st.markdown("- **Words** are made up of characters.")

st.header("πŸ”  Character")
st.markdown("- **A character** can be a number, alphabet, or special symbol.")

st.header("βœ‚οΈ Tokenization")
st.markdown("- **Tokenization** is a technique by using which we can convert a huge chunk into small entity where those small entities are known as tokens.")

st.subheader("πŸ› οΈ Types of Tokenization")
st.markdown("""
    - πŸ”Ή **Sentence Tokenization** – Splits text into sentences.
    - πŸ”Ή **Word Tokenization** – Splits sentences into words.
    - πŸ”Ή **Character Tokenization** – Splits words into individual characters.
""")

st.subheader("πŸ“ Sentence Tokenization")
st.markdown("- **Breaks a large text into meaningful sentence units.**")

st.subheader("πŸ“– Word Tokenization")
st.markdown("- **Splits a sentence into individual words.**")

st.subheader("πŸ”‘ Character Tokenization")
st.markdown("- **Breaks words into separate characters.**")

st.header("🚫 Stop Words")
st.markdown("- **Common words** (e.g., 'the', 'is', 'and') that do not add meaning to the text but maintain grammatical structure.")

st.header("πŸ“Š Vectorization")
st.markdown("- **Transforms text into numerical representation** for machine learning models.")

st.subheader("πŸ”’ Different Types of Vectorization Techniques")
st.markdown("""
    - 🎯 **One-Hot Encoding**
    - 🏷️ **Bag of Words (BoW)**
    - πŸ“Š **TF-IDF (Term Frequency-Inverse Document Frequency)**
    - 🧠 **Word2Vec**
    - 🌍 **GloVe**
    - ⚑ **FastText**
""")

st.success("πŸš€ Mastering these **NLP terminologies** will help you build powerful text-processing applications!")