Spaces:
Running
Quickstart — Paper Espresso Retrievers
This directory contains the three data retrievers. All of them call the Gemini API (for summaries / trending) and read & write HuggingFace datasets, so both keys must be configured before running anything.
1. Configure keys in .env
Create a file named .env in the project root (one level above src/,
i.e. Paper_Espresso/.env). The retrievers load it automatically on import.
GEMINI_API_KEY=AIza............................
HF_TOKEN=hf_..................................
Rules:
- One
KEY=valueper line. No quotes, no spaces around=, no trailing spaces/newlines. GEMINI_API_KEY— from https://aistudio.google.com/apikey. The free tier is limited (~250 requests/day forgemini-3.1-pro); batch runs over a date range will exhaust it quickly. Use a billed / higher-quota key for bulk work.HF_TOKEN— from https://huggingface.co/settings/tokens. It must have Write permission on the target datasets. Reading public datasets may succeed even with a bad token, but pushing will fail.
Both are also read from real environment variables if set; the environment
takes precedence over .env.
Note: Missing or invalid keys now raise a hard error (
RuntimeError: GEMINI_API_KEY not set/HF_TOKEN not set, or an HF auth error) instead of failing silently. There is no offline mode — a validHF_TOKENis required even to read the cache. Use--no-pushfor a dry run that still needs the keys but does not write back to HuggingFace.
Verify the keys work:
# Gemini
uv run python -c "import os;from pathlib import Path;[os.environ.__setitem__(*l.split('=',1)) for l in Path('.env').read_text().splitlines() if l.startswith('GEMINI_API_KEY=')];from google import genai;print(genai.Client(api_key=os.environ['GEMINI_API_KEY']).models.generate_content(model='gemini-3.1-pro-preview',contents='ping').text)"
# HuggingFace
uv run python -c "import os;from pathlib import Path;[os.environ.__setitem__(*l.split('=',1)) for l in Path('.env').read_text().splitlines() if l.startswith('HF_TOKEN=')];from huggingface_hub import HfApi;print(HfApi(token=os.environ['HF_TOKEN']).whoami()['name'])"
2. Run the three retrievers
Run all commands from the project root. uv resolves dependencies
automatically.
Daily — daily_retrieve.py
Fetches the HuggingFace daily papers list, summarizes each paper with Gemini,
generates a daily trending digest, and pushes to Elfsong/hf_paper_summary
and Elfsong/hf_paper_daily_trending.
uv run python src/daily_retrieve.py # yesterday
uv run python src/daily_retrieve.py --date 2026-03-25 # single day
uv run python src/daily_retrieve.py --date 2026-03-01 --end 2026-03-31 # date range (inclusive)
uv run python src/daily_retrieve.py --date 2026-03-01 --end 2026-03-31 --workers 4
uv run python src/daily_retrieve.py --date 2026-03-25 --recollect # force re-summarize, ignore cache
uv run python src/daily_retrieve.py --date 2026-03-25 --no-push # dry run, don't push to HF
| Flag | Meaning |
|---|---|
--date |
Start date YYYY-MM-DD (default: yesterday, UTC) |
--end |
End date YYYY-MM-DD, inclusive — enables range mode |
--workers |
Parallel workers for a range (default: 1) |
--recollect |
Re-summarize all papers, ignoring local/HF cache |
--no-push |
Skip pushing to HuggingFace |
Monthly — monthly_retrieve.py
Aggregates a month of daily data already on HuggingFace into monthly summaries / trending / insights. Run after the daily retriever has populated the month.
uv run python src/monthly_retrieve.py # last completed month
uv run python src/monthly_retrieve.py --month 2026-03 # specific month
uv run python src/monthly_retrieve.py --month 2026-02 --no-push # dry run
| Flag | Meaning |
|---|---|
--month |
Month YYYY-MM (default: last completed month, UTC) |
--no-push |
Skip pushing results to HuggingFace |
Lifecycle — lifecycle_retrieve.py
Computes bimonthly Gartner-style hype-cycle snapshots for research topics
from all available paper data, and pushes to Elfsong/hf_paper_lifecycle.
uv run python src/lifecycle_retrieve.py # latest bimonthly snapshot
uv run python src/lifecycle_retrieve.py --snapshot 2025-06 # specific snapshot (even month)
uv run python src/lifecycle_retrieve.py --all # all missing snapshots
uv run python src/lifecycle_retrieve.py --snapshot 2025-06 --force # overwrite existing
uv run python src/lifecycle_retrieve.py --all --no-push # dry run
| Flag | Meaning |
|---|---|
--snapshot |
Snapshot month YYYY-MM (even month; default: latest) |
--all |
Compute all missing bimonthly snapshots |
--force |
Re-compute and overwrite existing snapshots |
--no-push |
Skip pushing results to HuggingFace |
Typical pipeline order
daily_retrieve.py -> per-day summaries + trending (run first, per day)
monthly_retrieve.py -> monthly aggregation (after a month is filled)
lifecycle_retrieve.py -> bimonthly hype-cycle snapshots (after enough history)
Troubleshooting
| Symptom (log line) | Cause | Fix |
|---|---|---|
trending retry ... (ClientError: 429 ...RESOURCE_EXHAUSTED) |
Gemini daily quota exhausted | Wait for reset (~hours) or use a billed / higher-quota key |
RuntimeError: GEMINI_API_KEY not set |
.env missing the key or env var empty |
Add GEMINI_API_KEY= to .env |
push retry ... / Invalid user token |
HF_TOKEN missing, expired, revoked, or no Write scope |
Regenerate a Write token and paste it into .env |
RuntimeError: failed to list HF repo ... |
Network/permission error reaching HuggingFace | Check connectivity and that the token can access the dataset |