NeonSamurai commited on
Commit
1643e79
·
verified ·
1 Parent(s): 8fa8d68

Update stages/data_collection.py

Browse files
Files changed (1) hide show
  1. stages/data_collection.py +130 -4
stages/data_collection.py CHANGED
@@ -1,5 +1,131 @@
1
- import streamlit as st
2
-
3
- def main():
4
- st.title("Data Collection..!!")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  main()
 
1
+ import streamlit as st
2
+
3
+ def main():
4
+ st.title("Step 2: Data Collection :rocket:")
5
+
6
+ st.markdown(
7
+ """
8
+ Data collection is the **foundation** of any NLP project. It involves gathering text data that your model will analyze and learn from. Think of it as collecting the raw materials needed to build something incredible!
9
+
10
+ To make data collection **efficient**, follow a step-by-step approach, starting with the easiest methods before progressing to more complex options.
11
+ """
12
+ )
13
+ st.divider()
14
+
15
+ # Section 1: Pre-Existing Datasets
16
+ st.subheader("1. Start with Pre-Existing Datasets :books:")
17
+ st.write(
18
+ """
19
+ The fastest way to gather data is to use **publicly available datasets**. These are ready-made datasets curated for various NLP tasks like sentiment analysis, chatbots, or classification.
20
+ """
21
+ )
22
+ st.markdown(
23
+ """
24
+ **Where to find datasets:**
25
+ - [Google Dataset Search](https://datasetsearch.research.google.com/)
26
+ - Platforms like [Kaggle](https://www.kaggle.com/), **UCI Machine Learning Repository**, or [Hugging Face Datasets](https://huggingface.co/datasets).
27
+ - NLP libraries like **NLTK** and **SpaCy** provide built-in datasets.
28
+
29
+ **Why start here?**
30
+ - Saves time and effort.
31
+ - Datasets are often clean, well-structured, and ready to use.
32
+
33
+ **Example:**
34
+ - For a sentiment analysis project, the **IMDB Reviews dataset** from Kaggle contains pre-labeled positive and negative movie reviews.
35
+ """
36
+ )
37
+ st.divider()
38
+
39
+ # Section 2: Using APIs
40
+ st.subheader("2. Use APIs for Structured Data :wrench:")
41
+ st.write(
42
+ """
43
+ If pre-existing datasets don't meet your needs, the next step is to use **APIs** to gather structured and specific data. APIs allow you to access text from platforms like social media, news websites, or product reviews.
44
+ """
45
+ )
46
+ st.markdown(
47
+ """
48
+ **Popular APIs for text data:**
49
+ - **Social Media:** Twitter API, Reddit API.
50
+ - **News Articles:** NewsAPI, NYTimes API.
51
+ - **E-Commerce Reviews:** Amazon API, Yelp API.
52
+ - **Beginner-Friendly:** [RapidAPI](https://rapidapi.com/) for easy integration.
53
+
54
+ **Why use APIs?**
55
+ - Provides clean, structured data.
56
+ - Allows you to collect **targeted data** like tweets containing specific keywords or customer reviews for a product.
57
+
58
+ **Example:**
59
+ - Use the **Twitter API** to collect tweets related to a trending topic for sentiment analysis.
60
+ """
61
+ )
62
+ st.divider()
63
+
64
+ # Section 3: Web Scraping
65
+ st.subheader("3. Web Scraping for Unstructured Data :spider_web:")
66
+ st.write(
67
+ """
68
+ If APIs are unavailable or insufficient, you can turn to **web scraping**. This involves extracting text data directly from websites like blogs, news portals, or e-commerce platforms.
69
+ """
70
+ )
71
+ st.markdown(
72
+ """
73
+ **Tools for Web Scraping:**
74
+ - **BeautifulSoup** and **Scrapy** for static websites.
75
+ - **Selenium** for dynamic websites that require interaction (like loading reviews).
76
+
77
+ **Why use web scraping?**
78
+ - Offers control over the exact data you collect.
79
+ - Ideal for extracting large amounts of **unstructured data**.
80
+
81
+ **Challenges to consider:**
82
+ - Websites may block scrapers with CAPTCHAs or security restrictions.
83
+ - Data might require extra cleaning and preprocessing.
84
+
85
+ **Example:**
86
+ - Scrape customer reviews from an e-commerce site or extract news articles for a topic modeling project.
87
+ """
88
+ )
89
+ st.divider()
90
+
91
+ # Section 4: Manual Data Collection
92
+ st.subheader("4. Collect Data on Your Own :scroll:")
93
+ st.write(
94
+ """
95
+ When existing datasets, APIs, or web scraping don't provide the data you need, the final step is to **collect or generate the data manually**. This method is more time-consuming but gives you full control over the data quality and labeling.
96
+ """
97
+ )
98
+ st.markdown(
99
+ """
100
+ **Methods for manual data collection:**
101
+ - **Surveys & Feedback:** Gather text responses directly from users.
102
+ - **Manual Annotation:** Label raw text data yourself for classification tasks.
103
+ - **Synthetic Data Generation:** Use text augmentation techniques like paraphrasing or synonym replacement.
104
+
105
+ **Why use this method?**
106
+ - Provides fully customized and niche data.
107
+ - Essential for projects requiring **domain-specific** or highly tailored data.
108
+
109
+ **Example:**
110
+ - Manually collecting medical advice conversations for building a healthcare chatbot.
111
+ """
112
+ )
113
+ st.divider()
114
+
115
+ # Summary Section
116
+ st.subheader("Summary: Data Collection Hierarchy :ruler:")
117
+ st.markdown(
118
+ """
119
+ To collect high-quality data efficiently, follow this structured approach:
120
+
121
+ 1. **Start with Pre-Existing Datasets**: Ready-made, clean, and fast to use.
122
+ 2. **Use APIs**: Gather structured and specific data.
123
+ 3. **Web Scraping**: Extract unstructured data when APIs fall short.
124
+ 4. **Manual Collection**: Create or annotate custom data for full control.
125
+
126
+ **Friendly Tip :bulb::**
127
+ Always start simple! Use pre-existing datasets or APIs first, and only move to web scraping or manual collection if necessary. The goal is to get **clean, diverse, and meaningful data** because **better data = better NLP models**.
128
+ """
129
+ )
130
+ st.divider()
131
  main()