OpenMark / docs /data-collection.md
codingwithadi's picture
Upload folder using huggingface_hub
81598c5 verified
# Data Collection Guide
Everything you need to collect your saved content from each source before running the ingest pipeline.
---
## 1. Raindrop.io
OpenMark pulls **all your Raindrop collections automatically** via the official REST API. You just need a token.
**Steps:**
1. Go to [app.raindrop.io/settings/integrations](https://app.raindrop.io/settings/integrations)
2. Under "For Developers" β†’ click **Create new app**
3. Copy the **Test token** (permanent, no expiry)
4. Add to `.env`:
```
RAINDROP_TOKEN=your-token-here
```
The pipeline fetches every collection, every sub-collection, and every unsorted raindrop automatically. No manual export needed.
---
## 2. Browser Bookmarks (Edge / Chrome / Firefox)
Export your bookmarks as an HTML file in the Netscape bookmark format (all browsers support this).
**Edge:**
`Settings β†’ Favourites β†’ Β·Β·Β· (three dots) β†’ Export favourites` β†’ save as `favorites.html`
**Chrome:**
`Bookmarks Manager (Ctrl+Shift+O) β†’ Β·Β·Β· β†’ Export bookmarks` β†’ save as `bookmarks.html`
**Firefox:**
`Bookmarks β†’ Manage Bookmarks β†’ Import and Backup β†’ Export Bookmarks to HTML`
**After exporting:**
- Place the HTML file(s) in your `raindrop-mission` folder (or wherever `RAINDROP_MISSION_DIR` points)
- The pipeline (`merge.py`) looks for `favorites_*.html` and `bookmarks_*.html` patterns
- It parses the Netscape format and extracts URLs + titles + folder structure
> **Tip:** Export fresh before every ingest to capture new bookmarks.
---
## 3. LinkedIn Saved Posts
LinkedIn has no public API for saved posts. OpenMark uses LinkedIn's internal **Voyager GraphQL API** β€” the same API the LinkedIn web app uses internally.
**This is the exact endpoint used:**
```
https://www.linkedin.com/voyager/api/graphql
?variables=(start:0,count:10,paginationToken:null,
query:(flagshipSearchIntent:SEARCH_MY_ITEMS_SAVED_POSTS))
&queryId=voyagerSearchDashClusters.05111e1b90ee7fea15bebe9f9410ced9
```
**How to get your session cookie:**
1. Log into LinkedIn in your browser
2. Open DevTools (`F12`) β†’ **Application** tab β†’ **Cookies** β†’ `https://www.linkedin.com`
3. Find the cookie named `li_at` β€” copy its value
4. Also find `JSESSIONID` β€” copy its value (used as CSRF token, format: `ajax:XXXXXXXXXXXXXXXXXX`)
**Run the fetch script:**
```bash
python raindrop-mission/linkedin_fetch.py
```
Paste your `li_at` value when prompted.
**Output:** `raindrop-mission/linkedin_saved.json` β€” 1,260 saved posts with author, content, and URL.
**Pagination:** LinkedIn returns 10 posts per page. The script detects end of results when no `nextPageToken` is returned. With 1,260 posts that's ~133 pages.
> **Important:** The `queryId` (`voyagerSearchDashClusters.05111e1b90ee7fea15bebe9f9410ced9`) is hardcoded in LinkedIn's JavaScript bundle and can change with LinkedIn deployments. If the script returns 0 results, intercept a fresh request from your browser's Network tab β€” filter for `voyagerSearchDashClusters`, copy the new `queryId`.
> **Personal use only.** This method is not officially supported by LinkedIn. Do not use for scraping at scale.
---
## 4. YouTube
Uses the official **YouTube Data API v3** via OAuth 2.0. Collects liked videos, watch later playlist, and any saved playlists.
**One-time setup:**
1. Go to [Google Cloud Console](https://console.cloud.google.com/)
2. Create a new project (e.g. "OpenMark")
3. Enable **YouTube Data API v3** (APIs & Services β†’ Enable APIs)
4. Create credentials: **OAuth 2.0 Client ID** β†’ Desktop App
5. Download the JSON file β€” rename it to `client_secret.json` and place it in `raindrop-mission/`
6. Go to **OAuth consent screen** β†’ Test users β†’ add your Google account email
**Run the fetch script:**
```bash
python raindrop-mission/youtube_fetch.py
```
A browser window opens for Google sign-in. After auth, a token is cached locally β€” you won't need to auth again.
**Output:** `raindrop-mission/youtube_MASTER.json` with:
- `liked_videos` β€” videos you've liked (up to ~3,200 via API limit)
- `watch_later` β€” requires Google Takeout (see below)
- `playlists` β€” saved playlists
**Watch Later via Google Takeout:**
YouTube's API does not expose Watch Later directly. Export it via [takeout.google.com](https://takeout.google.com):
- Select only **YouTube** β†’ **Playlists** β†’ Download
- Extract the CSV file named `Watch later-videos.csv`
- Place it in `raindrop-mission/`
- The `youtube_organize.py` script fetches video titles via API and includes them in `youtube_MASTER.json`
---
## 5. daily.dev Bookmarks
daily.dev does not provide a public API. Use the included browser console script to extract bookmarks directly from the page.
**Steps:**
1. Go to [app.daily.dev](https://app.daily.dev) β†’ **Bookmarks**
2. Scroll all the way down to load all bookmarks
3. Open DevTools β†’ **Console** tab
4. Paste and run `raindrop-mission/dailydev_console_script.js`
5. The script copies a JSON array to your clipboard
6. Paste into a file named `dailydev_bookmarks.json` in `raindrop-mission/`
> The script filters for `/posts/` URLs only β€” it ignores profile links, squad links, and other noise.
---
## Summary
| Source | Method | Output file |
|--------|--------|-------------|
| Raindrop | REST API (auto) | pulled live |
| Edge/Chrome bookmarks | HTML export | `favorites.html` / `bookmarks.html` |
| LinkedIn saved posts | Voyager GraphQL + session cookie | `linkedin_saved.json` |
| YouTube liked/playlists | YouTube Data API v3 + OAuth | `youtube_MASTER.json` |
| YouTube watch later | Google Takeout CSV | included in `youtube_MASTER.json` |
| daily.dev bookmarks | Browser console script | `dailydev_bookmarks.json` |
Once all files are in place, run:
```bash
python scripts/ingest.py
```