Spaces:

UmaKumpatla
/

NLP

Sleeping

App Files Files Community

NLP / pages /4.Feature Engineering.py

UmaKumpatla

Update pages/4.Feature Engineering.py

84b9be8 verified about 1 year ago

raw

history blame contribute delete

9.67 kB

	import streamlit as st

	# Function to display the Home Page
	def show_home_page():
	st.title("🔦 :red[Natural Language Processing (NLP)]")
	st.markdown(
	"""
	### :green[Welcome to the NLP Guide]
	Natural Language Processing (NLP) is a fascinating branch of Artificial Intelligence that focuses on the interaction between
	computers and humans using natural language. It enables machines to read, understand, and generate human language in a meaningful way.
	This guide explores key NLP concepts and techniques, from basic terminologies to advanced vectorization methods. Use the sidebar to explore each topic in detail.

	#### :green[Applications of NLP:]
	- Chatbots and virtual assistants (e.g., Alexa, Siri)
	- Sentiment analysis
	- Language translation tools (e.g., Google Translate)
	- Text summarization and more!
	"""
	)
	st.image("https://cdn-uploads.huggingface.co/production/uploads/66be28cc7e8987822d129400/1zCao_p5aQZr6zgYScaOB.png")
	# Function to display specific topic pages
	def show_page(page):
	if page == "NLP Terminologies":
	st.title("🔍 :blue[NLP Terminologies]")
	st.markdown(
	"""
	### :red[Key NLP Terms:]
	- Tokenization: Splitting text into smaller units like words or sentences.
	- Stop Words: Commonly used words (e.g., "the", "is") often removed during preprocessing.
	- Stemming: Reducing words to their root form (e.g., "running" → "run").
	- Lemmatization: Converting words to their dictionary base form (e.g., "running" → "run").
	- Corpus: A large collection of text used for NLP training and analysis.
	- Vocabulary: The set of unique words in a corpus.
	- n-grams: Sequences of n words or characters in text.
	- POS Tagging: Assigning parts of speech (e.g., noun, verb) to words.
	- NER (Named Entity Recognition): Identifying names, places, organizations, etc.
	- Parsing: Analyzing the grammatical structure of a sentence.
	"""
	)

	elif page == "One-Hot Vectorization":
	st.title("🔧 :green[One-Hot Vectorization]")
	st.markdown(
	"""
	### :red[One-Hot Vectorization Explained]
	One-Hot Vectorization is a simple representation where each word is encoded as a binary vector.
	#### :red[How It Works:]
	- Each unique word in the vocabulary is assigned an index.
	- The vector for a word is all zeros except for a `1` at the index of that word.
	#### :red[Example:]
	Vocabulary: ["cat", "dog", "bird"]
	- "cat" → [1, 0, 0]
	- "dog" → [0, 1, 0]
	- "bird" → [0, 0, 1]
	#### :red[Advantages:]
	- Simple and intuitive to implement.
	#### :red[Limitations:]
	- High dimensionality for large vocabularies.
	- Does not capture semantic relationships (e.g., "cat" and "kitten" have no connection).
	#### :red[Applications:]
	- Suitable for small datasets where simplicity is a priority.
	"""
	)

	elif page == "Bag of Words":
	st.title("🔄 :green[Bag of Words (BoW)]")
	st.markdown(
	"""
	### :orange[Bag of Words (BoW) Method]
	Bag of Words is a way of representing text by counting word occurrences while ignoring word order.
	#### :orange[How It Works:]
	1. Create a vocabulary of all unique words in the text.
	2. Count the frequency of each word in a document.
	#### :orange[Example:]
	Given two sentences:
	- Sentence 1: "I love NLP."
	- Sentence 2: "I love programming."
	Vocabulary: ["I", "love", "NLP", "programming"]
	- Sentence 1: [1, 1, 1, 0]
	- Sentence 2: [1, 1, 0, 1]
	#### :orange[Advantages:]
	- Simple to implement and interpret.
	#### :orange[Limitations:]
	- High dimensionality for large vocabularies.
	- Ignores word order and semantic meaning.
	- Sensitive to noisy or frequent terms.
	#### :orange[Applications:]
	- Text classification and clustering.
	"""
	)

	elif page == "TF-IDF Vectorizer":
	st.title("🔄 :blue[TF-IDF Vectorizer]")
	st.markdown(
	"""
	### :green[TF-IDF (Term Frequency-Inverse Document Frequency)]
	TF-IDF evaluates the importance of a word in a document relative to a collection of documents (corpus).
	#### :rainbow[Formula:]
	\[ \text{TF-IDF} = \text{TF} \times \text{IDF} \]
	- TF (Term Frequency): Frequency of a word in a document divided by the total words in the document.
	- IDF (Inverse Document Frequency): Logarithm of total documents divided by the number of documents containing the word.
	#### :rainbow[Example:]
	For the corpus:
	- Document 1: "NLP is amazing."
	- Document 2: "NLP is fun and amazing."
	Words like "fun" and "amazing" will have higher weights than commonly occurring words like "is".
	#### :rainbow[Advantages:]
	- Highlights unique and relevant terms.
	- Reduces the impact of frequent, less informative words.
	#### :rainbow[Applications:]
	- Information retrieval, search engines, and document classification.
	"""
	)

	elif page == "Word2Vec":
	st.title("🌐 :red[Word2Vec]")
	st.markdown(
	"""
	### :green[Word2Vec]
	Word2Vec creates dense vector representations of words, capturing semantic relationships using neural networks.
	#### :green[Key Models:]
	- CBOW (Continuous Bag of Words): Predicts the target word from its context.
	- Skip-gram: Predicts the context from a target word.
	#### :green[Example:]
	Word2Vec can capture relationships like:
	- "king" - "man" + "woman" ≈ "queen"
	#### :green[Advantages:]
	- Captures semantic meaning and relationships.
	- Efficient for large datasets.
	#### :green[Applications:]
	- Sentiment analysis, recommendation systems, and machine translation.
	#### :green[Limitations:]
	- Computationally intensive for training on large datasets.
	"""
	)

	elif page == "FastText":
	st.title("🔄 :red[FastText]")
	st.markdown(
	"""
	### :blue[FastText]
	FastText extends Word2Vec by representing words as character n-grams, enabling it to handle rare and out-of-vocabulary words.
	#### :blue[Example:]
	The word "playing" might be represented by subwords like "pla", "lay", "ayi", "ing".
	#### :blue[Advantages:]
	- Handles rare words and misspellings.
	- Captures subword information (e.g., prefixes and suffixes).
	#### :blue[Applications:]
	- Multilingual text processing.
	- Working with noisy or incomplete data.
	#### :blue[Limitations:]
	- Higher computational cost than Word2Vec.
	"""
	)

	elif page == "Tokenization":
	st.title("🔢 :blue[Tokenization]")
	st.markdown(
	"""
	### :red[Tokenization]
	Tokenization is the process of splitting text into smaller units (tokens) such as words, phrases, or sentences.
	#### :red[Types:]
	- Word Tokenization: Splits text into words.
	- Sentence Tokenization: Splits text into sentences.
	#### :red[Example:]
	Sentence: "NLP is exciting."
	- Word Tokens: ["NLP", "is", "exciting", "."]
	#### :red[Libraries:]
	- NLTK
	- SpaCy
	- Hugging Face Transformers
	#### :red[Challenges:]
	- Handling complex text (e.g., abbreviations, contractions, multilingual data).
	#### :red[Applications:]
	- Preprocessing for machine learning models.
	"""
	)

	elif page == "Stop Words":
	st.title("🔐 :green[Stop Words]")
	st.markdown(
	"""
	### :rainbow[Stop Words]
	Stop words are commonly used words in a language that are often removed during text preprocessing (e.g., "is", "the", "and").
	#### :rainbow[Why Remove Stop Words?]
	- To reduce noise and focus on meaningful terms in text.
	#### :rainbow[Example Stop Words:]
	- English: "is", "the", "and".
	- Spanish: "es", "el", "y".
	#### :rainbow[Challenges:]
	- Some stop words might carry important context in specific use cases.
	#### :rainbow[Applications:]
	- Sentiment analysis, text classification, and search engines.
	"""
	)

	# Sidebar navigation
	st.sidebar.title("🔍 NLP Topics")
	menu_options = [
	"Home",
	"NLP Terminologies",
	"One-Hot Vectorization",
	"Bag of Words",
	"TF-IDF Vectorizer",
	"Word2Vec",
	"FastText",
	"Tokenization",
	"Stop Words",
	]
	selected_page = st.sidebar.radio("Select a topic", menu_options)

	# Display the selected page
	if selected_page == "Home":
	show_home_page()
	else:
	show_page(selected_page)