NLP / pages /Introduction.py
Rajesh6's picture
Update pages/Introduction.py
99c7e6f verified
import streamlit as st
st.header("Introduction to Natural Language Processing (NLP)")
st.markdown("<p>Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that focuses on the interaction between computers and human language. The goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is valuable and meaningful.</p>",unsafe_allow_html= True)
st.subheader("What is NLP?")
st.markdown("<p>Natural language processing (NLP) is a field of computer science and a subfield of artificial intelligence that aims to make computers understand human language. NLP uses computational linguistics, which is the study of how language works, and various models based on statistics, machine learning, and deep learning. These technologies allow computers to analyze and process text or voice data, and to grasp their full meaning, including the speaker’s or writer’s intentions and emotions. </p>",unsafe_allow_html= True)
st.image("NLP.jpg")
st.markdown("<p>NLP powers many applications that use language, such as text translation, voice recognition, text summarization, and chatbots. You may have used some of these applications yourself, such as voice-operated GPS systems, digital assistants, speech-to-text software, and customer service bots. NLP also helps businesses improve their efficiency, productivity, and performance by simplifying complex tasks that involve language. </p>",unsafe_allow_html= True)
st.subheader("NLP Techniques")
st.markdown("<p>NLP encompasses a wide array of techniques that aimed at enabling computers to process and understand human language. These tasks can be categorized into several broad areas, each addressing different aspects of language processing. Here are some of the key NLP techniques:</p>",unsafe_allow_html= True)
st.markdown('<p style="color:;"><b>1. Text Processing and Preprocessing In NLP</b></p>', unsafe_allow_html=True)
st.write("Before performing any analysis or modeling, raw text data must be cleaned and prepared.")
st.markdown('<p style="color:;"><b>a. Tokenization</b></p>', unsafe_allow_html=True)
st.write("Splits text into smaller units like words or sentences.")
st.write("**Types:**")
st.write("**(i) Word Tokenization:** Breaking text into words.")
st.write("Example: _'I love NLP'_ → [‘I’, ‘love’, ‘NLP’]")
st.write("**(ii) Sentence Tokenization:** Breaking text into sentences.")
st.write("Example: _'I love NLP. It’s fascinating!'_ → [‘I love NLP.’, ‘It’s fascinating!’]")
st.markdown('<p style="color:;"><b>b. Stopword Removal</b></p>', unsafe_allow_html=True)
st.write("Removes common words like “the,” “and,” “is” that do not contribute much to analysis.")
st.markdown('<p style="color:;"><b>c. Stemming and Lemmatization</b></p>', unsafe_allow_html=True)
st.write("Stemming: Reduces words to their base or root form by chopping off suffixes (may not produce valid words).")
st.write("Example: _“running” _ → “run”")
st.write("Lemmatization: Converts words to their base form using vocabulary and grammar")
st.write("Example: _“good” _ → “better”")
st.markdown('<p style="color:;"><b>d. Part-of-Speech (POS) Tagging</b></p>', unsafe_allow_html=True)
st.write("Labels words with their grammatical roles (noun, verb, adjective, etc.)")
st.write("Example: _The cat sleeps”_ → [“The/DET”, “cat/NOUN”, “sleeps/VERB”]")
st.markdown('<p style="color:;"><b>e. Named Entity Recognition (NER)</b></p>', unsafe_allow_html=True)
st.write("Identifies and classifies entities in text (e.g., names, dates, locations)")
st.write("Example: _ “Barack Obama was born in Hawaii. _ ” → [Barack Obama: PERSON, Hawaii: LOCATION]")
st.markdown('<p style="color:;"><b>f. Text Normalization</b></p>', unsafe_allow_html=True)
st.write("Converts text to a standard format (lowercasing, removing punctuation, etc.).")
st.markdown('<p style="color:;"><b>2. Feature Extraction Techniques</b></p>', unsafe_allow_html=True)
st.write("Text needs to be transformed into numerical representations for machine learning models.")
st.markdown('<p style="color:;"><b>a. Bag of Words (BoW)</b></p>', unsafe_allow_html=True)
st.write("Represents text as a vector of word frequencies or occurrences, ignoring grammar and order")
st.write("Examples:")
st.write("Text: “I love NLP” and “NLP is great”")
st.write("Vocabulary: [“I”, “love”, “NLP”, “is”, “great”]")
st.write("Vector for “I love NLP”: [1, 1, 1, 0, 0]")
st.markdown('<p style="color:;"><b>b. Term Frequency-Inverse Document Frequency (TF-IDF)</b></p>', unsafe_allow_html=True)
st.write("The **TF-IDF Vectorizer** is a popular technique in Natural Language Processing (NLP) used to convert text into numerical values that can be used by machine learning models. It stands for Term Frequency-Inverse Document Frequency and helps highlight the importance of words in a document relative to a collection of documents (called a corpus).")
st.write('**Term Frequency (TF)** \n - Measures how often a word appears in a single document. \n - Formula: \n _TF_ = Number of times the word appears in the document / Total number of words in the document' )
st.write('**Inverse Document Frequency (IDF)** \n Measures how unique or rare a word is across all documents in the corpus. \n - Formula: \n _IDF_ = log(Total no.of documents / No of Documnets containing the word) \n Words that appear in many documents (like "the" or "and") will have a low IDF value, while unique words (like "NLP") will have a higher IDF.')
st.write('**TF - IDF Score:** \n - Combines TF and IDF to calculate the importance of a word in a document. \n - Formula: \n _TF - IDF = TF x IDF_ \n Words that are frequent in a document but rare in the overall corpus get a higher score.')
st.write("""
**Example**
**Consider these two documents:**
- "I love NLP"
- "NLP is amazing"
**Step 1: Calculate TF**
- "NLP" appears once in each document, so its TF is **1/3** in both.
- Words like "love" and "amazing" also have a TF of **1/3**.
**Step 2: Calculate IDF**
- "NLP" appears in both documents, so its IDF is **log(2/2) = 0**.
- "love" and "amazing" appear in only one document each, so their IDF is **log(2/1) = 0.69**.
**Step 3: Compute TF-IDF**
- "NLP" gets a TF-IDF score of **1/3 × 0 = 0** (not unique).
- "love" and "amazing" get scores of **1/3 × 0.69 = 0.23** (more unique).
""")
st.markdown('<p style="color:;"><b>c. Word Embeddings</b></p>', unsafe_allow_html=True)
st.write("Word embeddings are a type of representation for text where words are converted into dense numerical vectors. These vectors capture the semantic meaning of words and their relationships with other words in a way that computers can understand.")
st.write("""
**Word Embedding Techniques**
**1. Word2Vec**
Developed by Google, it uses two main approaches:
- **CBOW (Continuous Bag of Words):** Predicts a word based on its context.
- **Skip-Gram:** Predicts the context given a word.
**2. GloVe (Global Vectors)**
Developed by Stanford, it captures word relationships by analyzing co-occurrence statistics of words in a large corpus.
**3. FastText**
Developed by Facebook, it extends Word2Vec by considering subword information, making it better at handling rare and misspelled words.
**4. Transformers (Contextual Embeddings)**
Models like **BERT**, **ELMo**, and **GPT** generate embeddings based on the context in which a word appears, capturing nuanced meanings.
""")
st.subheader("Future of NLP")
st.write("""
The future of Natural Language Processing (NLP) is exciting, with advancements that aim to make machines understand and interact with human language more effectively. Here are key areas shaping the future of NLP: \n
**1.Context-Aware Models:**
- Enhanced Understanding of Context: Models like GPT and BERT have already revolutionized NLP. Future advancements will further refine their ability to comprehend nuanced context, sarcasm, and idioms.
**2.Real-Time Multilingual NLP**
- Instant Translations: Real-time and accurate translation across diverse languages, including low-resource ones.
- Language Independence: NLP systems capable of handling any language seamlessly.
**3.Conversational AI**
- Human-like Conversations: Chatbots and virtual assistants will become more natural, empathetic, and intuitive in conversations.
- Emotion Recognition: Understanding and responding to user emotions effectively.
**4. Zero-shot and Few-shot Learning**
- Minimal Data Requirement: Models will handle new tasks or languages with little to no additional training, making NLP accessible across domains with limited data.
**5. Multimodal Learning**
- Beyond Text: Integrating text with images, audio, and video for richer applications like understanding memes, videos, or interactive media.
The future of NLP is about creating systems that communicate more naturally, inclusively, and intelligently, enabling transformative applications in every aspect of life.
""")