Spaces:
Running
Running
File size: 5,780 Bytes
81598c5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | # Data Collection Guide
Everything you need to collect your saved content from each source before running the ingest pipeline.
---
## 1. Raindrop.io
OpenMark pulls **all your Raindrop collections automatically** via the official REST API. You just need a token.
**Steps:**
1. Go to [app.raindrop.io/settings/integrations](https://app.raindrop.io/settings/integrations)
2. Under "For Developers" β click **Create new app**
3. Copy the **Test token** (permanent, no expiry)
4. Add to `.env`:
```
RAINDROP_TOKEN=your-token-here
```
The pipeline fetches every collection, every sub-collection, and every unsorted raindrop automatically. No manual export needed.
---
## 2. Browser Bookmarks (Edge / Chrome / Firefox)
Export your bookmarks as an HTML file in the Netscape bookmark format (all browsers support this).
**Edge:**
`Settings β Favourites β Β·Β·Β· (three dots) β Export favourites` β save as `favorites.html`
**Chrome:**
`Bookmarks Manager (Ctrl+Shift+O) β Β·Β·Β· β Export bookmarks` β save as `bookmarks.html`
**Firefox:**
`Bookmarks β Manage Bookmarks β Import and Backup β Export Bookmarks to HTML`
**After exporting:**
- Place the HTML file(s) in your `raindrop-mission` folder (or wherever `RAINDROP_MISSION_DIR` points)
- The pipeline (`merge.py`) looks for `favorites_*.html` and `bookmarks_*.html` patterns
- It parses the Netscape format and extracts URLs + titles + folder structure
> **Tip:** Export fresh before every ingest to capture new bookmarks.
---
## 3. LinkedIn Saved Posts
LinkedIn has no public API for saved posts. OpenMark uses LinkedIn's internal **Voyager GraphQL API** β the same API the LinkedIn web app uses internally.
**This is the exact endpoint used:**
```
https://www.linkedin.com/voyager/api/graphql
?variables=(start:0,count:10,paginationToken:null,
query:(flagshipSearchIntent:SEARCH_MY_ITEMS_SAVED_POSTS))
&queryId=voyagerSearchDashClusters.05111e1b90ee7fea15bebe9f9410ced9
```
**How to get your session cookie:**
1. Log into LinkedIn in your browser
2. Open DevTools (`F12`) β **Application** tab β **Cookies** β `https://www.linkedin.com`
3. Find the cookie named `li_at` β copy its value
4. Also find `JSESSIONID` β copy its value (used as CSRF token, format: `ajax:XXXXXXXXXXXXXXXXXX`)
**Run the fetch script:**
```bash
python raindrop-mission/linkedin_fetch.py
```
Paste your `li_at` value when prompted.
**Output:** `raindrop-mission/linkedin_saved.json` β 1,260 saved posts with author, content, and URL.
**Pagination:** LinkedIn returns 10 posts per page. The script detects end of results when no `nextPageToken` is returned. With 1,260 posts that's ~133 pages.
> **Important:** The `queryId` (`voyagerSearchDashClusters.05111e1b90ee7fea15bebe9f9410ced9`) is hardcoded in LinkedIn's JavaScript bundle and can change with LinkedIn deployments. If the script returns 0 results, intercept a fresh request from your browser's Network tab β filter for `voyagerSearchDashClusters`, copy the new `queryId`.
> **Personal use only.** This method is not officially supported by LinkedIn. Do not use for scraping at scale.
---
## 4. YouTube
Uses the official **YouTube Data API v3** via OAuth 2.0. Collects liked videos, watch later playlist, and any saved playlists.
**One-time setup:**
1. Go to [Google Cloud Console](https://console.cloud.google.com/)
2. Create a new project (e.g. "OpenMark")
3. Enable **YouTube Data API v3** (APIs & Services β Enable APIs)
4. Create credentials: **OAuth 2.0 Client ID** β Desktop App
5. Download the JSON file β rename it to `client_secret.json` and place it in `raindrop-mission/`
6. Go to **OAuth consent screen** β Test users β add your Google account email
**Run the fetch script:**
```bash
python raindrop-mission/youtube_fetch.py
```
A browser window opens for Google sign-in. After auth, a token is cached locally β you won't need to auth again.
**Output:** `raindrop-mission/youtube_MASTER.json` with:
- `liked_videos` β videos you've liked (up to ~3,200 via API limit)
- `watch_later` β requires Google Takeout (see below)
- `playlists` β saved playlists
**Watch Later via Google Takeout:**
YouTube's API does not expose Watch Later directly. Export it via [takeout.google.com](https://takeout.google.com):
- Select only **YouTube** β **Playlists** β Download
- Extract the CSV file named `Watch later-videos.csv`
- Place it in `raindrop-mission/`
- The `youtube_organize.py` script fetches video titles via API and includes them in `youtube_MASTER.json`
---
## 5. daily.dev Bookmarks
daily.dev does not provide a public API. Use the included browser console script to extract bookmarks directly from the page.
**Steps:**
1. Go to [app.daily.dev](https://app.daily.dev) β **Bookmarks**
2. Scroll all the way down to load all bookmarks
3. Open DevTools β **Console** tab
4. Paste and run `raindrop-mission/dailydev_console_script.js`
5. The script copies a JSON array to your clipboard
6. Paste into a file named `dailydev_bookmarks.json` in `raindrop-mission/`
> The script filters for `/posts/` URLs only β it ignores profile links, squad links, and other noise.
---
## Summary
| Source | Method | Output file |
|--------|--------|-------------|
| Raindrop | REST API (auto) | pulled live |
| Edge/Chrome bookmarks | HTML export | `favorites.html` / `bookmarks.html` |
| LinkedIn saved posts | Voyager GraphQL + session cookie | `linkedin_saved.json` |
| YouTube liked/playlists | YouTube Data API v3 + OAuth | `youtube_MASTER.json` |
| YouTube watch later | Google Takeout CSV | included in `youtube_MASTER.json` |
| daily.dev bookmarks | Browser console script | `dailydev_bookmarks.json` |
Once all files are in place, run:
```bash
python scripts/ingest.py
```
|