Spaces:
Running
Running
| # Quickstart — Paper Espresso Retrievers | |
| This directory contains the three data retrievers. All of them call the | |
| **Gemini API** (for summaries / trending) and read & write **HuggingFace** | |
| datasets, so both keys must be configured before running anything. | |
| --- | |
| ## 1. Configure keys in `.env` | |
| Create a file named `.env` in the **project root** (one level above `src/`, | |
| i.e. `Paper_Espresso/.env`). The retrievers load it automatically on import. | |
| ```dotenv | |
| GEMINI_API_KEY=AIza............................ | |
| HF_TOKEN=hf_.................................. | |
| ``` | |
| Rules: | |
| - One `KEY=value` per line. **No quotes, no spaces around `=`, no trailing | |
| spaces/newlines.** | |
| - `GEMINI_API_KEY` — from <https://aistudio.google.com/apikey>. The free tier | |
| is limited (~250 requests/day for `gemini-3.1-pro`); batch runs over a date | |
| range will exhaust it quickly. Use a billed / higher-quota key for bulk work. | |
| - `HF_TOKEN` — from <https://huggingface.co/settings/tokens>. It **must have | |
| Write permission** on the target datasets. Reading public datasets may | |
| succeed even with a bad token, but pushing will fail. | |
| Both are also read from real environment variables if set; the environment | |
| takes precedence over `.env`. | |
| > **Note:** Missing or invalid keys now raise a hard error (`RuntimeError: | |
| > GEMINI_API_KEY not set` / `HF_TOKEN not set`, or an HF auth error) instead | |
| > of failing silently. There is no offline mode — a valid `HF_TOKEN` is | |
| > required even to read the cache. Use `--no-push` for a dry run that still | |
| > needs the keys but does not write back to HuggingFace. | |
| Verify the keys work: | |
| ```bash | |
| # Gemini | |
| uv run python -c "import os;from pathlib import Path;[os.environ.__setitem__(*l.split('=',1)) for l in Path('.env').read_text().splitlines() if l.startswith('GEMINI_API_KEY=')];from google import genai;print(genai.Client(api_key=os.environ['GEMINI_API_KEY']).models.generate_content(model='gemini-3.1-pro-preview',contents='ping').text)" | |
| # HuggingFace | |
| uv run python -c "import os;from pathlib import Path;[os.environ.__setitem__(*l.split('=',1)) for l in Path('.env').read_text().splitlines() if l.startswith('HF_TOKEN=')];from huggingface_hub import HfApi;print(HfApi(token=os.environ['HF_TOKEN']).whoami()['name'])" | |
| ``` | |
| --- | |
| ## 2. Run the three retrievers | |
| Run all commands from the **project root**. `uv` resolves dependencies | |
| automatically. | |
| ### Daily — `daily_retrieve.py` | |
| Fetches the HuggingFace daily papers list, summarizes each paper with Gemini, | |
| generates a daily trending digest, and pushes to `Elfsong/hf_paper_summary` | |
| and `Elfsong/hf_paper_daily_trending`. | |
| ```bash | |
| uv run python src/daily_retrieve.py # yesterday | |
| uv run python src/daily_retrieve.py --date 2026-03-25 # single day | |
| uv run python src/daily_retrieve.py --date 2026-03-01 --end 2026-03-31 # date range (inclusive) | |
| uv run python src/daily_retrieve.py --date 2026-03-01 --end 2026-03-31 --workers 4 | |
| uv run python src/daily_retrieve.py --date 2026-03-25 --recollect # force re-summarize, ignore cache | |
| uv run python src/daily_retrieve.py --date 2026-03-25 --no-push # dry run, don't push to HF | |
| ``` | |
| | Flag | Meaning | | |
| | ------------ | -------------------------------------------------------- | | |
| | `--date` | Start date `YYYY-MM-DD` (default: yesterday, UTC) | | |
| | `--end` | End date `YYYY-MM-DD`, inclusive — enables range mode | | |
| | `--workers` | Parallel workers for a range (default: 1) | | |
| | `--recollect`| Re-summarize all papers, ignoring local/HF cache | | |
| | `--no-push` | Skip pushing to HuggingFace | | |
| ### Monthly — `monthly_retrieve.py` | |
| Aggregates a month of daily data already on HuggingFace into monthly | |
| summaries / trending / insights. Run **after** the daily retriever has | |
| populated the month. | |
| ```bash | |
| uv run python src/monthly_retrieve.py # last completed month | |
| uv run python src/monthly_retrieve.py --month 2026-03 # specific month | |
| uv run python src/monthly_retrieve.py --month 2026-02 --no-push # dry run | |
| ``` | |
| | Flag | Meaning | | |
| | ----------- | ---------------------------------------------------- | | |
| | `--month` | Month `YYYY-MM` (default: last completed month, UTC) | | |
| | `--no-push` | Skip pushing results to HuggingFace | | |
| ### Lifecycle — `lifecycle_retrieve.py` | |
| Computes bimonthly Gartner-style hype-cycle snapshots for research topics | |
| from all available paper data, and pushes to `Elfsong/hf_paper_lifecycle`. | |
| ```bash | |
| uv run python src/lifecycle_retrieve.py # latest bimonthly snapshot | |
| uv run python src/lifecycle_retrieve.py --snapshot 2025-06 # specific snapshot (even month) | |
| uv run python src/lifecycle_retrieve.py --all # all missing snapshots | |
| uv run python src/lifecycle_retrieve.py --snapshot 2025-06 --force # overwrite existing | |
| uv run python src/lifecycle_retrieve.py --all --no-push # dry run | |
| ``` | |
| | Flag | Meaning | | |
| | ------------ | --------------------------------------------------------- | | |
| | `--snapshot` | Snapshot month `YYYY-MM` (even month; default: latest) | | |
| | `--all` | Compute all missing bimonthly snapshots | | |
| | `--force` | Re-compute and overwrite existing snapshots | | |
| | `--no-push` | Skip pushing results to HuggingFace | | |
| --- | |
| ## Typical pipeline order | |
| ``` | |
| daily_retrieve.py -> per-day summaries + trending (run first, per day) | |
| monthly_retrieve.py -> monthly aggregation (after a month is filled) | |
| lifecycle_retrieve.py -> bimonthly hype-cycle snapshots (after enough history) | |
| ``` | |
| ## Troubleshooting | |
| | Symptom (log line) | Cause | Fix | | |
| | ------------------------------------------ | -------------------------------------------------------- | ---------------------------------------------------------------- | | |
| | `trending retry ... (ClientError: 429 ...RESOURCE_EXHAUSTED)` | Gemini daily quota exhausted | Wait for reset (~hours) or use a billed / higher-quota key | | |
| | `RuntimeError: GEMINI_API_KEY not set` | `.env` missing the key or env var empty | Add `GEMINI_API_KEY=` to `.env` | | |
| | `push retry ...` / `Invalid user token` | `HF_TOKEN` missing, expired, revoked, or no Write scope | Regenerate a Write token and paste it into `.env` | | |
| | `RuntimeError: failed to list HF repo ...` | Network/permission error reaching HuggingFace | Check connectivity and that the token can access the dataset | | |