Spaces:
Sleeping
Sleeping
Update pages/4.Feature Engineering.py
Browse files- pages/4.Feature Engineering.py +216 -0
pages/4.Feature Engineering.py
CHANGED
|
@@ -0,0 +1,216 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
|
| 3 |
+
# Function to display the Home Page
|
| 4 |
+
def show_home_page():
|
| 5 |
+
st.title("π¦ :red[Natural Language Processing (NLP)]")
|
| 6 |
+
st.markdown(
|
| 7 |
+
"""
|
| 8 |
+
### :green[Welcome to the NLP Guide]
|
| 9 |
+
Natural Language Processing (NLP) is a fascinating branch of Artificial Intelligence that focuses on the interaction between
|
| 10 |
+
computers and humans using natural language. It enables machines to read, understand, and generate human language in a meaningful way.
|
| 11 |
+
This guide explores key NLP concepts and techniques, from basic terminologies to advanced vectorization methods. Use the sidebar to explore each topic in detail.
|
| 12 |
+
|
| 13 |
+
#### :green[Applications of NLP:]
|
| 14 |
+
- Chatbots and virtual assistants (e.g., Alexa, Siri)
|
| 15 |
+
- Sentiment analysis
|
| 16 |
+
- Language translation tools (e.g., Google Translate)
|
| 17 |
+
- Text summarization and more!
|
| 18 |
+
"""
|
| 19 |
+
)
|
| 20 |
+
st.image("https://cdn-uploads.huggingface.co/production/uploads/66be28cc7e8987822d129400/1zCao_p5aQZr6zgYScaOB.png")
|
| 21 |
+
# Function to display specific topic pages
|
| 22 |
+
def show_page(page):
|
| 23 |
+
if page == "NLP Terminologies":
|
| 24 |
+
st.title("π :blue[NLP Terminologies]")
|
| 25 |
+
st.markdown(
|
| 26 |
+
"""
|
| 27 |
+
### :red[Key NLP Terms:]
|
| 28 |
+
- **Tokenization**: Splitting text into smaller units like words or sentences.
|
| 29 |
+
- **Stop Words**: Commonly used words (e.g., "the", "is") often removed during preprocessing.
|
| 30 |
+
- **Stemming**: Reducing words to their root form (e.g., "running" β "run").
|
| 31 |
+
- **Lemmatization**: Converting words to their dictionary base form (e.g., "running" β "run").
|
| 32 |
+
- **Corpus**: A large collection of text used for NLP training and analysis.
|
| 33 |
+
- **Vocabulary**: The set of unique words in a corpus.
|
| 34 |
+
- **n-grams**: Sequences of *n* words or characters in text.
|
| 35 |
+
- **POS Tagging**: Assigning parts of speech (e.g., noun, verb) to words.
|
| 36 |
+
- **NER (Named Entity Recognition)**: Identifying names, places, organizations, etc.
|
| 37 |
+
- **Parsing**: Analyzing the grammatical structure of a sentence.
|
| 38 |
+
"""
|
| 39 |
+
)
|
| 40 |
+
|
| 41 |
+
elif page == "One-Hot Vectorization":
|
| 42 |
+
st.title("π§ :green[One-Hot Vectorization]")
|
| 43 |
+
st.markdown(
|
| 44 |
+
"""
|
| 45 |
+
### :red[One-Hot Vectorization Explained]
|
| 46 |
+
One-Hot Vectorization is a simple representation where each word is encoded as a binary vector.
|
| 47 |
+
#### :red[How It Works:]
|
| 48 |
+
- Each unique word in the vocabulary is assigned an index.
|
| 49 |
+
- The vector for a word is all zeros except for a `1` at the index of that word.
|
| 50 |
+
#### :red[Example:]
|
| 51 |
+
Vocabulary: ["cat", "dog", "bird"]
|
| 52 |
+
- "cat" β [1, 0, 0]
|
| 53 |
+
- "dog" β [0, 1, 0]
|
| 54 |
+
- "bird" β [0, 0, 1]
|
| 55 |
+
#### :red[Advantages:]
|
| 56 |
+
- Simple and intuitive to implement.
|
| 57 |
+
#### :red[Limitations:]
|
| 58 |
+
- High dimensionality for large vocabularies.
|
| 59 |
+
- Does not capture semantic relationships (e.g., "cat" and "kitten" have no connection).
|
| 60 |
+
#### :red[Applications:]
|
| 61 |
+
- Suitable for small datasets where simplicity is a priority.
|
| 62 |
+
"""
|
| 63 |
+
)
|
| 64 |
+
|
| 65 |
+
elif page == "Bag of Words":
|
| 66 |
+
st.title("π :green[Bag of Words (BoW)]")
|
| 67 |
+
st.markdown(
|
| 68 |
+
"""
|
| 69 |
+
### :orange[Bag of Words (BoW) Method]
|
| 70 |
+
Bag of Words is a way of representing text by counting word occurrences while ignoring word order.
|
| 71 |
+
#### :orange[How It Works:]
|
| 72 |
+
1. Create a vocabulary of all unique words in the text.
|
| 73 |
+
2. Count the frequency of each word in a document.
|
| 74 |
+
#### :orange[Example:]
|
| 75 |
+
Given two sentences:
|
| 76 |
+
- Sentence 1: "I love NLP."
|
| 77 |
+
- Sentence 2: "I love programming."
|
| 78 |
+
Vocabulary: ["I", "love", "NLP", "programming"]
|
| 79 |
+
- Sentence 1: [1, 1, 1, 0]
|
| 80 |
+
- Sentence 2: [1, 1, 0, 1]
|
| 81 |
+
#### :orange[Advantages:]
|
| 82 |
+
- Simple to implement and interpret.
|
| 83 |
+
#### :orange[Limitations:]
|
| 84 |
+
- High dimensionality for large vocabularies.
|
| 85 |
+
- Ignores word order and semantic meaning.
|
| 86 |
+
- Sensitive to noisy or frequent terms.
|
| 87 |
+
#### :orange[Applications:]
|
| 88 |
+
- Text classification and clustering.
|
| 89 |
+
"""
|
| 90 |
+
)
|
| 91 |
+
|
| 92 |
+
elif page == "TF-IDF Vectorizer":
|
| 93 |
+
st.title("π :blue[TF-IDF Vectorizer]")
|
| 94 |
+
st.markdown(
|
| 95 |
+
"""
|
| 96 |
+
### :green[TF-IDF (Term Frequency-Inverse Document Frequency)]
|
| 97 |
+
TF-IDF evaluates the importance of a word in a document relative to a collection of documents (corpus).
|
| 98 |
+
#### :rainbow[Formula:]
|
| 99 |
+
\[ \text{TF-IDF} = \text{TF} \times \text{IDF} \]
|
| 100 |
+
- **TF (Term Frequency)**: Frequency of a word in a document divided by the total words in the document.
|
| 101 |
+
- **IDF (Inverse Document Frequency)**: Logarithm of total documents divided by the number of documents containing the word.
|
| 102 |
+
#### :rainbow[Example:]
|
| 103 |
+
For the corpus:
|
| 104 |
+
- Document 1: "NLP is amazing."
|
| 105 |
+
- Document 2: "NLP is fun and amazing."
|
| 106 |
+
Words like "fun" and "amazing" will have higher weights than commonly occurring words like "is".
|
| 107 |
+
#### :rainbow[Advantages:]
|
| 108 |
+
- Highlights unique and relevant terms.
|
| 109 |
+
- Reduces the impact of frequent, less informative words.
|
| 110 |
+
#### :rainbow[Applications:]
|
| 111 |
+
- Information retrieval, search engines, and document classification.
|
| 112 |
+
"""
|
| 113 |
+
)
|
| 114 |
+
|
| 115 |
+
elif page == "Word2Vec":
|
| 116 |
+
st.title("π :red[Word2Vec]")
|
| 117 |
+
st.markdown(
|
| 118 |
+
"""
|
| 119 |
+
### :green[Word2Vec]
|
| 120 |
+
Word2Vec creates dense vector representations of words, capturing semantic relationships using neural networks.
|
| 121 |
+
#### :green[Key Models:]
|
| 122 |
+
- **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
|
| 123 |
+
- **Skip-gram**: Predicts the context from a target word.
|
| 124 |
+
#### :green[Example:]
|
| 125 |
+
Word2Vec can capture relationships like:
|
| 126 |
+
- "king" - "man" + "woman" β "queen"
|
| 127 |
+
#### :green[Advantages:]
|
| 128 |
+
- Captures semantic meaning and relationships.
|
| 129 |
+
- Efficient for large datasets.
|
| 130 |
+
#### :green[Applications:]
|
| 131 |
+
- Sentiment analysis, recommendation systems, and machine translation.
|
| 132 |
+
#### :green[Limitations:]
|
| 133 |
+
- Computationally intensive for training on large datasets.
|
| 134 |
+
"""
|
| 135 |
+
)
|
| 136 |
+
|
| 137 |
+
elif page == "FastText":
|
| 138 |
+
st.title("π :red[FastText]")
|
| 139 |
+
st.markdown(
|
| 140 |
+
"""
|
| 141 |
+
### :blue[FastText]
|
| 142 |
+
FastText extends Word2Vec by representing words as character n-grams, enabling it to handle rare and out-of-vocabulary words.
|
| 143 |
+
#### :blue[Example:]
|
| 144 |
+
The word "playing" might be represented by subwords like "pla", "lay", "ayi", "ing".
|
| 145 |
+
#### :blue[Advantages:]
|
| 146 |
+
- Handles rare words and misspellings.
|
| 147 |
+
- Captures subword information (e.g., prefixes and suffixes).
|
| 148 |
+
#### :blue[Applications:]
|
| 149 |
+
- Multilingual text processing.
|
| 150 |
+
- Working with noisy or incomplete data.
|
| 151 |
+
#### :blue[Limitations:]
|
| 152 |
+
- Higher computational cost than Word2Vec.
|
| 153 |
+
"""
|
| 154 |
+
)
|
| 155 |
+
|
| 156 |
+
elif page == "Tokenization":
|
| 157 |
+
st.title("π’ :blue[Tokenization]")
|
| 158 |
+
st.markdown(
|
| 159 |
+
"""
|
| 160 |
+
### :red[Tokenization]
|
| 161 |
+
Tokenization is the process of splitting text into smaller units (tokens) such as words, phrases, or sentences.
|
| 162 |
+
#### :red[Types:]
|
| 163 |
+
- **Word Tokenization**: Splits text into words.
|
| 164 |
+
- **Sentence Tokenization**: Splits text into sentences.
|
| 165 |
+
#### :red[Example:]
|
| 166 |
+
Sentence: "NLP is exciting."
|
| 167 |
+
- Word Tokens: ["NLP", "is", "exciting", "."]
|
| 168 |
+
#### :red[Libraries:]
|
| 169 |
+
- NLTK
|
| 170 |
+
- SpaCy
|
| 171 |
+
- Hugging Face Transformers
|
| 172 |
+
#### :red[Challenges:]
|
| 173 |
+
- Handling complex text (e.g., abbreviations, contractions, multilingual data).
|
| 174 |
+
#### :red[Applications:]
|
| 175 |
+
- Preprocessing for machine learning models.
|
| 176 |
+
"""
|
| 177 |
+
)
|
| 178 |
+
|
| 179 |
+
elif page == "Stop Words":
|
| 180 |
+
st.title("π :green[Stop Words]")
|
| 181 |
+
st.markdown(
|
| 182 |
+
"""
|
| 183 |
+
### :rainbow[Stop Words]
|
| 184 |
+
Stop words are commonly used words in a language that are often removed during text preprocessing (e.g., "is", "the", "and").
|
| 185 |
+
#### :rainbow[Why Remove Stop Words?]
|
| 186 |
+
- To reduce noise and focus on meaningful terms in text.
|
| 187 |
+
#### :rainbow[Example Stop Words:]
|
| 188 |
+
- English: "is", "the", "and".
|
| 189 |
+
- Spanish: "es", "el", "y".
|
| 190 |
+
#### :rainbow[Challenges:]
|
| 191 |
+
- Some stop words might carry important context in specific use cases.
|
| 192 |
+
#### :rainbow[Applications:]
|
| 193 |
+
- Sentiment analysis, text classification, and search engines.
|
| 194 |
+
"""
|
| 195 |
+
)
|
| 196 |
+
|
| 197 |
+
# Sidebar navigation
|
| 198 |
+
st.sidebar.title("π NLP Topics")
|
| 199 |
+
menu_options = [
|
| 200 |
+
"Home",
|
| 201 |
+
"NLP Terminologies",
|
| 202 |
+
"One-Hot Vectorization",
|
| 203 |
+
"Bag of Words",
|
| 204 |
+
"TF-IDF Vectorizer",
|
| 205 |
+
"Word2Vec",
|
| 206 |
+
"FastText",
|
| 207 |
+
"Tokenization",
|
| 208 |
+
"Stop Words",
|
| 209 |
+
]
|
| 210 |
+
selected_page = st.sidebar.radio("Select a topic", menu_options)
|
| 211 |
+
|
| 212 |
+
# Display the selected page
|
| 213 |
+
if selected_page == "Home":
|
| 214 |
+
show_home_page()
|
| 215 |
+
else:
|
| 216 |
+
show_page(selected_page)
|