LovnishVerma commited on
Commit
7eedd44
·
verified ·
1 Parent(s): c62b5b9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md CHANGED
@@ -145,3 +145,59 @@ This platform enforces secure engineering best practices:
145
 
146
  ## 📄 License
147
  Distributed under the MIT License. See [LICENSE](LICENSE) for more details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
  ## 📄 License
147
  Distributed under the MIT License. See [LICENSE](LICENSE) for more details.
148
+
149
+
150
+ # Test
151
+
152
+
153
+ Test on **Aaj Tak (Hindi News)**. News websites are excellent targets for scraping because they are rich in constantly updating structured data.
154
+
155
+ Here are the three most valuable things you can scrape from this page, along with exactly how to fill out your configuration form for each!
156
+
157
+ ### 💡 Option 1: The Easiest & Cleanest Data (JSON-LD Metadata)
158
+
159
+ News sites embed hidden structured data for Google. This page has a massive `ItemList` JSON-LD block containing the top 20+ trending headlines and their exact URLs. This is the cleanest way to get the top stories without dealing with messy HTML tags.
160
+
161
+ * **Job Name:** AajTak Top Stories (JSON)
162
+ * **Target URL:** [https://www.aajtak.in/](https://www.aajtak.in/)
163
+ * **Scrape Type:** Static (Requests + BS4)
164
+ * **Extraction Type:** JSON-LD Schema
165
+ * **CSS Selector:** *(Leave blank)*
166
+ * **Delay Between Requests (s):** 2
167
+
168
+ ### 💡 Option 2: Scraping All Article Headlines (Text)
169
+
170
+ If you want to pull the visible text of every news headline on the page (Top stories, Sports, Entertainment, Tech, etc.).
171
+
172
+ * **Job Name:** AajTak All Headlines
173
+ * **Target URL:** [https://www.aajtak.in/](https://www.aajtak.in/)
174
+ * **Scrape Type:** Static (Requests + BS4)
175
+ * **Extraction Type:** Text Content
176
+ * **CSS Selector:** `.title h3, .fv-cap h3, .sstitle-listing h3, .title-big h3`
177
+ * **Delay Between Requests (s):** 2
178
+
179
+ ### 💡 Option 3: Scraping Article URLs (Links)
180
+
181
+ If your goal is to build a crawler that finds news articles to scrape their full text later, you need the URLs of the articles.
182
+
183
+ * **Job Name:** AajTak Article Links
184
+ * **Target URL:** [https://www.aajtak.in/](https://www.aajtak.in/)
185
+ * **Scrape Type:** Static (Requests + BS4)
186
+ * **Extraction Type:** HTML Attributes
187
+ * **CSS Selector:** `li[data-tb-region-item] a`
188
+ * **Attribute Name:** `href`
189
+
190
+ ### 💡 Option 4: Scraping Thumbnail Images
191
+
192
+ *Note: Because this site uses "lazy loading" for performance, the actual image URL isn't in the standard `src` attribute; it's hidden in `data-src`.*
193
+
194
+ * **Job Name:** AajTak Thumbnails
195
+ * **Target URL:** [https://www.aajtak.in/](https://www.aajtak.in/)
196
+ * **Scrape Type:** Static (Requests + BS4)
197
+ * **Extraction Type:** HTML Attributes
198
+ * **CSS Selector:** `img.lazyload`
199
+ * **Attribute Name:** `data-src`
200
+
201
+ ---
202
+
203
+