Spaces:

Elfsong
/

Paper_Espresso

Running

App Files Files Community

Paper_Espresso / src /README.md

elfsong

feat: enforce HF_TOKEN validation and improve error handling in retriever scripts

9cac6e0 17 days ago

preview code

raw

history blame contribute delete

6.89 kB

Quickstart — Paper Espresso Retrievers

This directory contains the three data retrievers. All of them call the Gemini API (for summaries / trending) and read & write HuggingFace datasets, so both keys must be configured before running anything.

1. Configure keys in `.env`

Create a file named .env in the project root (one level above src/, i.e. Paper_Espresso/.env). The retrievers load it automatically on import.

GEMINI_API_KEY=AIza............................
HF_TOKEN=hf_..................................

Rules:

One KEY=value per line. No quotes, no spaces around =, no trailing spaces/newlines.
GEMINI_API_KEY — from https://aistudio.google.com/apikey. The free tier is limited (~250 requests/day for gemini-3.1-pro); batch runs over a date range will exhaust it quickly. Use a billed / higher-quota key for bulk work.
HF_TOKEN — from https://huggingface.co/settings/tokens. It must have Write permission on the target datasets. Reading public datasets may succeed even with a bad token, but pushing will fail.

Both are also read from real environment variables if set; the environment takes precedence over .env.

Note: Missing or invalid keys now raise a hard error (RuntimeError: GEMINI_API_KEY not set / HF_TOKEN not set, or an HF auth error) instead of failing silently. There is no offline mode — a valid HF_TOKEN is required even to read the cache. Use --no-push for a dry run that still needs the keys but does not write back to HuggingFace.

Verify the keys work:

# Gemini
uv run python -c "import os;from pathlib import Path;[os.environ.__setitem__(*l.split('=',1)) for l in Path('.env').read_text().splitlines() if l.startswith('GEMINI_API_KEY=')];from google import genai;print(genai.Client(api_key=os.environ['GEMINI_API_KEY']).models.generate_content(model='gemini-3.1-pro-preview',contents='ping').text)"

# HuggingFace
uv run python -c "import os;from pathlib import Path;[os.environ.__setitem__(*l.split('=',1)) for l in Path('.env').read_text().splitlines() if l.startswith('HF_TOKEN=')];from huggingface_hub import HfApi;print(HfApi(token=os.environ['HF_TOKEN']).whoami()['name'])"

2. Run the three retrievers

Run all commands from the project root. uv resolves dependencies automatically.

Daily — `daily_retrieve.py`

Fetches the HuggingFace daily papers list, summarizes each paper with Gemini, generates a daily trending digest, and pushes to Elfsong/hf_paper_summary and Elfsong/hf_paper_daily_trending.

uv run python src/daily_retrieve.py                                      # yesterday
uv run python src/daily_retrieve.py --date 2026-03-25                    # single day
uv run python src/daily_retrieve.py --date 2026-03-01 --end 2026-03-31   # date range (inclusive)
uv run python src/daily_retrieve.py --date 2026-03-01 --end 2026-03-31 --workers 4
uv run python src/daily_retrieve.py --date 2026-03-25 --recollect        # force re-summarize, ignore cache
uv run python src/daily_retrieve.py --date 2026-03-25 --no-push          # dry run, don't push to HF

Flag	Meaning
`--date`	Start date `YYYY-MM-DD` (default: yesterday, UTC)
`--end`	End date `YYYY-MM-DD`, inclusive — enables range mode
`--workers`	Parallel workers for a range (default: 1)
`--recollect`	Re-summarize all papers, ignoring local/HF cache
`--no-push`	Skip pushing to HuggingFace

Monthly — `monthly_retrieve.py`

Aggregates a month of daily data already on HuggingFace into monthly summaries / trending / insights. Run after the daily retriever has populated the month.

uv run python src/monthly_retrieve.py                       # last completed month
uv run python src/monthly_retrieve.py --month 2026-03        # specific month
uv run python src/monthly_retrieve.py --month 2026-02 --no-push   # dry run

Flag	Meaning
`--month`	Month `YYYY-MM` (default: last completed month, UTC)
`--no-push`	Skip pushing results to HuggingFace

Lifecycle — `lifecycle_retrieve.py`

Computes bimonthly Gartner-style hype-cycle snapshots for research topics from all available paper data, and pushes to Elfsong/hf_paper_lifecycle.

uv run python src/lifecycle_retrieve.py                      # latest bimonthly snapshot
uv run python src/lifecycle_retrieve.py --snapshot 2025-06   # specific snapshot (even month)
uv run python src/lifecycle_retrieve.py --all                # all missing snapshots
uv run python src/lifecycle_retrieve.py --snapshot 2025-06 --force      # overwrite existing
uv run python src/lifecycle_retrieve.py --all --no-push      # dry run

Flag	Meaning
`--snapshot`	Snapshot month `YYYY-MM` (even month; default: latest)
`--all`	Compute all missing bimonthly snapshots
`--force`	Re-compute and overwrite existing snapshots
`--no-push`	Skip pushing results to HuggingFace

Typical pipeline order

daily_retrieve.py      ->  per-day summaries + trending  (run first, per day)
monthly_retrieve.py    ->  monthly aggregation           (after a month is filled)
lifecycle_retrieve.py  ->  bimonthly hype-cycle snapshots (after enough history)

Troubleshooting

Symptom (log line)	Cause	Fix
`trending retry ... (ClientError: 429 ...RESOURCE_EXHAUSTED)`	Gemini daily quota exhausted	Wait for reset (~hours) or use a billed / higher-quota key
`RuntimeError: GEMINI_API_KEY not set`	`.env` missing the key or env var empty	Add `GEMINI_API_KEY=` to `.env`
`push retry ...` / `Invalid user token`	`HF_TOKEN` missing, expired, revoked, or no Write scope	Regenerate a Write token and paste it into `.env`
`RuntimeError: failed to list HF repo ...`	Network/permission error reaching HuggingFace	Check connectivity and that the token can access the dataset

Quickstart — Paper Espresso Retrievers

1. Configure keys in .env