Spaces:
Running
Running
| # Data Collection Guide | |
| Everything you need to collect your saved content from each source before running the ingest pipeline. | |
| --- | |
| ## 1. Raindrop.io | |
| OpenMark pulls **all your Raindrop collections automatically** via the official REST API. You just need a token. | |
| **Steps:** | |
| 1. Go to [app.raindrop.io/settings/integrations](https://app.raindrop.io/settings/integrations) | |
| 2. Under "For Developers" β click **Create new app** | |
| 3. Copy the **Test token** (permanent, no expiry) | |
| 4. Add to `.env`: | |
| ``` | |
| RAINDROP_TOKEN=your-token-here | |
| ``` | |
| The pipeline fetches every collection, every sub-collection, and every unsorted raindrop automatically. No manual export needed. | |
| --- | |
| ## 2. Browser Bookmarks (Edge / Chrome / Firefox) | |
| Export your bookmarks as an HTML file in the Netscape bookmark format (all browsers support this). | |
| **Edge:** | |
| `Settings β Favourites β Β·Β·Β· (three dots) β Export favourites` β save as `favorites.html` | |
| **Chrome:** | |
| `Bookmarks Manager (Ctrl+Shift+O) β Β·Β·Β· β Export bookmarks` β save as `bookmarks.html` | |
| **Firefox:** | |
| `Bookmarks β Manage Bookmarks β Import and Backup β Export Bookmarks to HTML` | |
| **After exporting:** | |
| - Place the HTML file(s) in your `raindrop-mission` folder (or wherever `RAINDROP_MISSION_DIR` points) | |
| - The pipeline (`merge.py`) looks for `favorites_*.html` and `bookmarks_*.html` patterns | |
| - It parses the Netscape format and extracts URLs + titles + folder structure | |
| > **Tip:** Export fresh before every ingest to capture new bookmarks. | |
| --- | |
| ## 3. LinkedIn Saved Posts | |
| LinkedIn has no public API for saved posts. OpenMark uses LinkedIn's internal **Voyager GraphQL API** β the same API the LinkedIn web app uses internally. | |
| **This is the exact endpoint used:** | |
| ``` | |
| https://www.linkedin.com/voyager/api/graphql | |
| ?variables=(start:0,count:10,paginationToken:null, | |
| query:(flagshipSearchIntent:SEARCH_MY_ITEMS_SAVED_POSTS)) | |
| &queryId=voyagerSearchDashClusters.05111e1b90ee7fea15bebe9f9410ced9 | |
| ``` | |
| **How to get your session cookie:** | |
| 1. Log into LinkedIn in your browser | |
| 2. Open DevTools (`F12`) β **Application** tab β **Cookies** β `https://www.linkedin.com` | |
| 3. Find the cookie named `li_at` β copy its value | |
| 4. Also find `JSESSIONID` β copy its value (used as CSRF token, format: `ajax:XXXXXXXXXXXXXXXXXX`) | |
| **Run the fetch script:** | |
| ```bash | |
| python raindrop-mission/linkedin_fetch.py | |
| ``` | |
| Paste your `li_at` value when prompted. | |
| **Output:** `raindrop-mission/linkedin_saved.json` β 1,260 saved posts with author, content, and URL. | |
| **Pagination:** LinkedIn returns 10 posts per page. The script detects end of results when no `nextPageToken` is returned. With 1,260 posts that's ~133 pages. | |
| > **Important:** The `queryId` (`voyagerSearchDashClusters.05111e1b90ee7fea15bebe9f9410ced9`) is hardcoded in LinkedIn's JavaScript bundle and can change with LinkedIn deployments. If the script returns 0 results, intercept a fresh request from your browser's Network tab β filter for `voyagerSearchDashClusters`, copy the new `queryId`. | |
| > **Personal use only.** This method is not officially supported by LinkedIn. Do not use for scraping at scale. | |
| --- | |
| ## 4. YouTube | |
| Uses the official **YouTube Data API v3** via OAuth 2.0. Collects liked videos, watch later playlist, and any saved playlists. | |
| **One-time setup:** | |
| 1. Go to [Google Cloud Console](https://console.cloud.google.com/) | |
| 2. Create a new project (e.g. "OpenMark") | |
| 3. Enable **YouTube Data API v3** (APIs & Services β Enable APIs) | |
| 4. Create credentials: **OAuth 2.0 Client ID** β Desktop App | |
| 5. Download the JSON file β rename it to `client_secret.json` and place it in `raindrop-mission/` | |
| 6. Go to **OAuth consent screen** β Test users β add your Google account email | |
| **Run the fetch script:** | |
| ```bash | |
| python raindrop-mission/youtube_fetch.py | |
| ``` | |
| A browser window opens for Google sign-in. After auth, a token is cached locally β you won't need to auth again. | |
| **Output:** `raindrop-mission/youtube_MASTER.json` with: | |
| - `liked_videos` β videos you've liked (up to ~3,200 via API limit) | |
| - `watch_later` β requires Google Takeout (see below) | |
| - `playlists` β saved playlists | |
| **Watch Later via Google Takeout:** | |
| YouTube's API does not expose Watch Later directly. Export it via [takeout.google.com](https://takeout.google.com): | |
| - Select only **YouTube** β **Playlists** β Download | |
| - Extract the CSV file named `Watch later-videos.csv` | |
| - Place it in `raindrop-mission/` | |
| - The `youtube_organize.py` script fetches video titles via API and includes them in `youtube_MASTER.json` | |
| --- | |
| ## 5. daily.dev Bookmarks | |
| daily.dev does not provide a public API. Use the included browser console script to extract bookmarks directly from the page. | |
| **Steps:** | |
| 1. Go to [app.daily.dev](https://app.daily.dev) β **Bookmarks** | |
| 2. Scroll all the way down to load all bookmarks | |
| 3. Open DevTools β **Console** tab | |
| 4. Paste and run `raindrop-mission/dailydev_console_script.js` | |
| 5. The script copies a JSON array to your clipboard | |
| 6. Paste into a file named `dailydev_bookmarks.json` in `raindrop-mission/` | |
| > The script filters for `/posts/` URLs only β it ignores profile links, squad links, and other noise. | |
| --- | |
| ## Summary | |
| | Source | Method | Output file | | |
| |--------|--------|-------------| | |
| | Raindrop | REST API (auto) | pulled live | | |
| | Edge/Chrome bookmarks | HTML export | `favorites.html` / `bookmarks.html` | | |
| | LinkedIn saved posts | Voyager GraphQL + session cookie | `linkedin_saved.json` | | |
| | YouTube liked/playlists | YouTube Data API v3 + OAuth | `youtube_MASTER.json` | | |
| | YouTube watch later | Google Takeout CSV | included in `youtube_MASTER.json` | | |
| | daily.dev bookmarks | Browser console script | `dailydev_bookmarks.json` | | |
| Once all files are in place, run: | |
| ```bash | |
| python scripts/ingest.py | |
| ``` | |