Spaces:

codingwithadi
/

OpenMark

Running

App Files Files Community

OpenMark / docs /data-collection.md

codingwithadi

Upload folder using huggingface_hub

81598c5 verified 1 day ago

preview code

raw

history blame contribute delete

5.78 kB

	# Data Collection Guide

	Everything you need to collect your saved content from each source before running the ingest pipeline.

	---

	## 1. Raindrop.io

	OpenMark pulls all your Raindrop collections automatically via the official REST API. You just need a token.

	Steps:
	1. Go to [app.raindrop.io/settings/integrations](https://app.raindrop.io/settings/integrations)
	2. Under "For Developers" → click Create new app
	3. Copy the Test token (permanent, no expiry)
	4. Add to `.env`:
	```
	RAINDROP_TOKEN=your-token-here
	```

	The pipeline fetches every collection, every sub-collection, and every unsorted raindrop automatically. No manual export needed.

	---

	## 2. Browser Bookmarks (Edge / Chrome / Firefox)

	Export your bookmarks as an HTML file in the Netscape bookmark format (all browsers support this).

	Edge:
	`Settings → Favourites → ··· (three dots) → Export favourites` → save as `favorites.html`

	Chrome:
	`Bookmarks Manager (Ctrl+Shift+O) → ··· → Export bookmarks` → save as `bookmarks.html`

	Firefox:
	`Bookmarks → Manage Bookmarks → Import and Backup → Export Bookmarks to HTML`

	After exporting:
	- Place the HTML file(s) in your `raindrop-mission` folder (or wherever `RAINDROP_MISSION_DIR` points)
	- The pipeline (`merge.py`) looks for `favorites_.html` and `bookmarks_.html` patterns
	- It parses the Netscape format and extracts URLs + titles + folder structure

	> Tip: Export fresh before every ingest to capture new bookmarks.

	---

	## 3. LinkedIn Saved Posts

	LinkedIn has no public API for saved posts. OpenMark uses LinkedIn's internal Voyager GraphQL API — the same API the LinkedIn web app uses internally.

	This is the exact endpoint used:
	```
	https://www.linkedin.com/voyager/api/graphql
	?variables=(start:0,count:10,paginationToken:null,
	query:(flagshipSearchIntent:SEARCH_MY_ITEMS_SAVED_POSTS))
	&queryId=voyagerSearchDashClusters.05111e1b90ee7fea15bebe9f9410ced9
	```

	How to get your session cookie:

	1. Log into LinkedIn in your browser
	2. Open DevTools (`F12`) → Application tab → Cookies → `https://www.linkedin.com`
	3. Find the cookie named `li_at` — copy its value
	4. Also find `JSESSIONID` — copy its value (used as CSRF token, format: `ajax:XXXXXXXXXXXXXXXXXX`)

	Run the fetch script:
	```bash
	python raindrop-mission/linkedin_fetch.py
	```
	Paste your `li_at` value when prompted.

	Output: `raindrop-mission/linkedin_saved.json` — 1,260 saved posts with author, content, and URL.

	Pagination: LinkedIn returns 10 posts per page. The script detects end of results when no `nextPageToken` is returned. With 1,260 posts that's ~133 pages.

	> Important: The `queryId` (`voyagerSearchDashClusters.05111e1b90ee7fea15bebe9f9410ced9`) is hardcoded in LinkedIn's JavaScript bundle and can change with LinkedIn deployments. If the script returns 0 results, intercept a fresh request from your browser's Network tab — filter for `voyagerSearchDashClusters`, copy the new `queryId`.

	> Personal use only. This method is not officially supported by LinkedIn. Do not use for scraping at scale.

	---

	## 4. YouTube

	Uses the official YouTube Data API v3 via OAuth 2.0. Collects liked videos, watch later playlist, and any saved playlists.

	One-time setup:

	1. Go to [Google Cloud Console](https://console.cloud.google.com/)
	2. Create a new project (e.g. "OpenMark")
	3. Enable YouTube Data API v3 (APIs & Services → Enable APIs)
	4. Create credentials: OAuth 2.0 Client ID → Desktop App
	5. Download the JSON file — rename it to `client_secret.json` and place it in `raindrop-mission/`
	6. Go to OAuth consent screen → Test users → add your Google account email

	Run the fetch script:
	```bash
	python raindrop-mission/youtube_fetch.py
	```
	A browser window opens for Google sign-in. After auth, a token is cached locally — you won't need to auth again.

	Output: `raindrop-mission/youtube_MASTER.json` with:
	- `liked_videos` — videos you've liked (up to ~3,200 via API limit)
	- `watch_later` — requires Google Takeout (see below)
	- `playlists` — saved playlists

	Watch Later via Google Takeout:
	YouTube's API does not expose Watch Later directly. Export it via [takeout.google.com](https://takeout.google.com):
	- Select only YouTube → Playlists → Download
	- Extract the CSV file named `Watch later-videos.csv`
	- Place it in `raindrop-mission/`
	- The `youtube_organize.py` script fetches video titles via API and includes them in `youtube_MASTER.json`

	---

	## 5. daily.dev Bookmarks

	daily.dev does not provide a public API. Use the included browser console script to extract bookmarks directly from the page.

	Steps:
	1. Go to [app.daily.dev](https://app.daily.dev) → Bookmarks
	2. Scroll all the way down to load all bookmarks
	3. Open DevTools → Console tab
	4. Paste and run `raindrop-mission/dailydev_console_script.js`
	5. The script copies a JSON array to your clipboard
	6. Paste into a file named `dailydev_bookmarks.json` in `raindrop-mission/`

	> The script filters for `/posts/` URLs only — it ignores profile links, squad links, and other noise.

	---

	## Summary

	\| Source \| Method \| Output file \|
	\|--------\|--------\|-------------\|
	\| Raindrop \| REST API (auto) \| pulled live \|
	\| Edge/Chrome bookmarks \| HTML export \| `favorites.html` / `bookmarks.html` \|
	\| LinkedIn saved posts \| Voyager GraphQL + session cookie \| `linkedin_saved.json` \|
	\| YouTube liked/playlists \| YouTube Data API v3 + OAuth \| `youtube_MASTER.json` \|
	\| YouTube watch later \| Google Takeout CSV \| included in `youtube_MASTER.json` \|
	\| daily.dev bookmarks \| Browser console script \| `dailydev_bookmarks.json` \|

	Once all files are in place, run:
	```bash
	python scripts/ingest.py
	```