Spaces:
Sleeping
Sleeping
Update stages/data_collection.py
Browse files- stages/data_collection.py +130 -4
stages/data_collection.py
CHANGED
|
@@ -1,5 +1,131 @@
|
|
| 1 |
-
import streamlit as st
|
| 2 |
-
|
| 3 |
-
def main():
|
| 4 |
-
st.title("Data Collection
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
main()
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
|
| 3 |
+
def main():
|
| 4 |
+
st.title("Step 2: Data Collection :rocket:")
|
| 5 |
+
|
| 6 |
+
st.markdown(
|
| 7 |
+
"""
|
| 8 |
+
Data collection is the **foundation** of any NLP project. It involves gathering text data that your model will analyze and learn from. Think of it as collecting the raw materials needed to build something incredible!
|
| 9 |
+
|
| 10 |
+
To make data collection **efficient**, follow a step-by-step approach, starting with the easiest methods before progressing to more complex options.
|
| 11 |
+
"""
|
| 12 |
+
)
|
| 13 |
+
st.divider()
|
| 14 |
+
|
| 15 |
+
# Section 1: Pre-Existing Datasets
|
| 16 |
+
st.subheader("1. Start with Pre-Existing Datasets :books:")
|
| 17 |
+
st.write(
|
| 18 |
+
"""
|
| 19 |
+
The fastest way to gather data is to use **publicly available datasets**. These are ready-made datasets curated for various NLP tasks like sentiment analysis, chatbots, or classification.
|
| 20 |
+
"""
|
| 21 |
+
)
|
| 22 |
+
st.markdown(
|
| 23 |
+
"""
|
| 24 |
+
**Where to find datasets:**
|
| 25 |
+
- [Google Dataset Search](https://datasetsearch.research.google.com/)
|
| 26 |
+
- Platforms like [Kaggle](https://www.kaggle.com/), **UCI Machine Learning Repository**, or [Hugging Face Datasets](https://huggingface.co/datasets).
|
| 27 |
+
- NLP libraries like **NLTK** and **SpaCy** provide built-in datasets.
|
| 28 |
+
|
| 29 |
+
**Why start here?**
|
| 30 |
+
- Saves time and effort.
|
| 31 |
+
- Datasets are often clean, well-structured, and ready to use.
|
| 32 |
+
|
| 33 |
+
**Example:**
|
| 34 |
+
- For a sentiment analysis project, the **IMDB Reviews dataset** from Kaggle contains pre-labeled positive and negative movie reviews.
|
| 35 |
+
"""
|
| 36 |
+
)
|
| 37 |
+
st.divider()
|
| 38 |
+
|
| 39 |
+
# Section 2: Using APIs
|
| 40 |
+
st.subheader("2. Use APIs for Structured Data :wrench:")
|
| 41 |
+
st.write(
|
| 42 |
+
"""
|
| 43 |
+
If pre-existing datasets don't meet your needs, the next step is to use **APIs** to gather structured and specific data. APIs allow you to access text from platforms like social media, news websites, or product reviews.
|
| 44 |
+
"""
|
| 45 |
+
)
|
| 46 |
+
st.markdown(
|
| 47 |
+
"""
|
| 48 |
+
**Popular APIs for text data:**
|
| 49 |
+
- **Social Media:** Twitter API, Reddit API.
|
| 50 |
+
- **News Articles:** NewsAPI, NYTimes API.
|
| 51 |
+
- **E-Commerce Reviews:** Amazon API, Yelp API.
|
| 52 |
+
- **Beginner-Friendly:** [RapidAPI](https://rapidapi.com/) for easy integration.
|
| 53 |
+
|
| 54 |
+
**Why use APIs?**
|
| 55 |
+
- Provides clean, structured data.
|
| 56 |
+
- Allows you to collect **targeted data** like tweets containing specific keywords or customer reviews for a product.
|
| 57 |
+
|
| 58 |
+
**Example:**
|
| 59 |
+
- Use the **Twitter API** to collect tweets related to a trending topic for sentiment analysis.
|
| 60 |
+
"""
|
| 61 |
+
)
|
| 62 |
+
st.divider()
|
| 63 |
+
|
| 64 |
+
# Section 3: Web Scraping
|
| 65 |
+
st.subheader("3. Web Scraping for Unstructured Data :spider_web:")
|
| 66 |
+
st.write(
|
| 67 |
+
"""
|
| 68 |
+
If APIs are unavailable or insufficient, you can turn to **web scraping**. This involves extracting text data directly from websites like blogs, news portals, or e-commerce platforms.
|
| 69 |
+
"""
|
| 70 |
+
)
|
| 71 |
+
st.markdown(
|
| 72 |
+
"""
|
| 73 |
+
**Tools for Web Scraping:**
|
| 74 |
+
- **BeautifulSoup** and **Scrapy** for static websites.
|
| 75 |
+
- **Selenium** for dynamic websites that require interaction (like loading reviews).
|
| 76 |
+
|
| 77 |
+
**Why use web scraping?**
|
| 78 |
+
- Offers control over the exact data you collect.
|
| 79 |
+
- Ideal for extracting large amounts of **unstructured data**.
|
| 80 |
+
|
| 81 |
+
**Challenges to consider:**
|
| 82 |
+
- Websites may block scrapers with CAPTCHAs or security restrictions.
|
| 83 |
+
- Data might require extra cleaning and preprocessing.
|
| 84 |
+
|
| 85 |
+
**Example:**
|
| 86 |
+
- Scrape customer reviews from an e-commerce site or extract news articles for a topic modeling project.
|
| 87 |
+
"""
|
| 88 |
+
)
|
| 89 |
+
st.divider()
|
| 90 |
+
|
| 91 |
+
# Section 4: Manual Data Collection
|
| 92 |
+
st.subheader("4. Collect Data on Your Own :scroll:")
|
| 93 |
+
st.write(
|
| 94 |
+
"""
|
| 95 |
+
When existing datasets, APIs, or web scraping don't provide the data you need, the final step is to **collect or generate the data manually**. This method is more time-consuming but gives you full control over the data quality and labeling.
|
| 96 |
+
"""
|
| 97 |
+
)
|
| 98 |
+
st.markdown(
|
| 99 |
+
"""
|
| 100 |
+
**Methods for manual data collection:**
|
| 101 |
+
- **Surveys & Feedback:** Gather text responses directly from users.
|
| 102 |
+
- **Manual Annotation:** Label raw text data yourself for classification tasks.
|
| 103 |
+
- **Synthetic Data Generation:** Use text augmentation techniques like paraphrasing or synonym replacement.
|
| 104 |
+
|
| 105 |
+
**Why use this method?**
|
| 106 |
+
- Provides fully customized and niche data.
|
| 107 |
+
- Essential for projects requiring **domain-specific** or highly tailored data.
|
| 108 |
+
|
| 109 |
+
**Example:**
|
| 110 |
+
- Manually collecting medical advice conversations for building a healthcare chatbot.
|
| 111 |
+
"""
|
| 112 |
+
)
|
| 113 |
+
st.divider()
|
| 114 |
+
|
| 115 |
+
# Summary Section
|
| 116 |
+
st.subheader("Summary: Data Collection Hierarchy :ruler:")
|
| 117 |
+
st.markdown(
|
| 118 |
+
"""
|
| 119 |
+
To collect high-quality data efficiently, follow this structured approach:
|
| 120 |
+
|
| 121 |
+
1. **Start with Pre-Existing Datasets**: Ready-made, clean, and fast to use.
|
| 122 |
+
2. **Use APIs**: Gather structured and specific data.
|
| 123 |
+
3. **Web Scraping**: Extract unstructured data when APIs fall short.
|
| 124 |
+
4. **Manual Collection**: Create or annotate custom data for full control.
|
| 125 |
+
|
| 126 |
+
**Friendly Tip :bulb::**
|
| 127 |
+
Always start simple! Use pre-existing datasets or APIs first, and only move to web scraping or manual collection if necessary. The goal is to get **clean, diverse, and meaningful data** because **better data = better NLP models**.
|
| 128 |
+
"""
|
| 129 |
+
)
|
| 130 |
+
st.divider()
|
| 131 |
main()
|