Spaces:
Paused
Paused
Update README.md
Browse files
README.md
CHANGED
|
@@ -145,3 +145,59 @@ This platform enforces secure engineering best practices:
|
|
| 145 |
|
| 146 |
## 📄 License
|
| 147 |
Distributed under the MIT License. See [LICENSE](LICENSE) for more details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
## 📄 License
|
| 147 |
Distributed under the MIT License. See [LICENSE](LICENSE) for more details.
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
# Test
|
| 151 |
+
|
| 152 |
+
|
| 153 |
+
Test on **Aaj Tak (Hindi News)**. News websites are excellent targets for scraping because they are rich in constantly updating structured data.
|
| 154 |
+
|
| 155 |
+
Here are the three most valuable things you can scrape from this page, along with exactly how to fill out your configuration form for each!
|
| 156 |
+
|
| 157 |
+
### 💡 Option 1: The Easiest & Cleanest Data (JSON-LD Metadata)
|
| 158 |
+
|
| 159 |
+
News sites embed hidden structured data for Google. This page has a massive `ItemList` JSON-LD block containing the top 20+ trending headlines and their exact URLs. This is the cleanest way to get the top stories without dealing with messy HTML tags.
|
| 160 |
+
|
| 161 |
+
* **Job Name:** AajTak Top Stories (JSON)
|
| 162 |
+
* **Target URL:** [https://www.aajtak.in/](https://www.aajtak.in/)
|
| 163 |
+
* **Scrape Type:** Static (Requests + BS4)
|
| 164 |
+
* **Extraction Type:** JSON-LD Schema
|
| 165 |
+
* **CSS Selector:** *(Leave blank)*
|
| 166 |
+
* **Delay Between Requests (s):** 2
|
| 167 |
+
|
| 168 |
+
### 💡 Option 2: Scraping All Article Headlines (Text)
|
| 169 |
+
|
| 170 |
+
If you want to pull the visible text of every news headline on the page (Top stories, Sports, Entertainment, Tech, etc.).
|
| 171 |
+
|
| 172 |
+
* **Job Name:** AajTak All Headlines
|
| 173 |
+
* **Target URL:** [https://www.aajtak.in/](https://www.aajtak.in/)
|
| 174 |
+
* **Scrape Type:** Static (Requests + BS4)
|
| 175 |
+
* **Extraction Type:** Text Content
|
| 176 |
+
* **CSS Selector:** `.title h3, .fv-cap h3, .sstitle-listing h3, .title-big h3`
|
| 177 |
+
* **Delay Between Requests (s):** 2
|
| 178 |
+
|
| 179 |
+
### 💡 Option 3: Scraping Article URLs (Links)
|
| 180 |
+
|
| 181 |
+
If your goal is to build a crawler that finds news articles to scrape their full text later, you need the URLs of the articles.
|
| 182 |
+
|
| 183 |
+
* **Job Name:** AajTak Article Links
|
| 184 |
+
* **Target URL:** [https://www.aajtak.in/](https://www.aajtak.in/)
|
| 185 |
+
* **Scrape Type:** Static (Requests + BS4)
|
| 186 |
+
* **Extraction Type:** HTML Attributes
|
| 187 |
+
* **CSS Selector:** `li[data-tb-region-item] a`
|
| 188 |
+
* **Attribute Name:** `href`
|
| 189 |
+
|
| 190 |
+
### 💡 Option 4: Scraping Thumbnail Images
|
| 191 |
+
|
| 192 |
+
*Note: Because this site uses "lazy loading" for performance, the actual image URL isn't in the standard `src` attribute; it's hidden in `data-src`.*
|
| 193 |
+
|
| 194 |
+
* **Job Name:** AajTak Thumbnails
|
| 195 |
+
* **Target URL:** [https://www.aajtak.in/](https://www.aajtak.in/)
|
| 196 |
+
* **Scrape Type:** Static (Requests + BS4)
|
| 197 |
+
* **Extraction Type:** HTML Attributes
|
| 198 |
+
* **CSS Selector:** `img.lazyload`
|
| 199 |
+
* **Attribute Name:** `data-src`
|
| 200 |
+
|
| 201 |
+
---
|
| 202 |
+
|
| 203 |
+
|