# Quickstart — Paper Espresso Retrievers

This directory contains the three data retrievers. All of them call the
**Gemini API** (for summaries / trending) and read & write **HuggingFace**
datasets, so both keys must be configured before running anything.

---

## 1. Configure keys in `.env`

Create a file named `.env` in the **project root** (one level above `src/`,
i.e. `Paper_Espresso/.env`). The retrievers load it automatically on import.

```dotenv
GEMINI_API_KEY=AIza............................
HF_TOKEN=hf_..................................
```

Rules:

- One `KEY=value` per line. **No quotes, no spaces around `=`, no trailing
  spaces/newlines.**
- `GEMINI_API_KEY` — from <https://aistudio.google.com/apikey>. The free tier
  is limited (~250 requests/day for `gemini-3.1-pro`); batch runs over a date
  range will exhaust it quickly. Use a billed / higher-quota key for bulk work.
- `HF_TOKEN` — from <https://huggingface.co/settings/tokens>. It **must have
  Write permission** on the target datasets. Reading public datasets may
  succeed even with a bad token, but pushing will fail.

Both are also read from real environment variables if set; the environment
takes precedence over `.env`.

> **Note:** Missing or invalid keys now raise a hard error (`RuntimeError:
> GEMINI_API_KEY not set` / `HF_TOKEN not set`, or an HF auth error) instead
> of failing silently. There is no offline mode — a valid `HF_TOKEN` is
> required even to read the cache. Use `--no-push` for a dry run that still
> needs the keys but does not write back to HuggingFace.

Verify the keys work:

```bash
# Gemini
uv run python -c "import os;from pathlib import Path;[os.environ.__setitem__(*l.split('=',1)) for l in Path('.env').read_text().splitlines() if l.startswith('GEMINI_API_KEY=')];from google import genai;print(genai.Client(api_key=os.environ['GEMINI_API_KEY']).models.generate_content(model='gemini-3.1-pro-preview',contents='ping').text)"

# HuggingFace
uv run python -c "import os;from pathlib import Path;[os.environ.__setitem__(*l.split('=',1)) for l in Path('.env').read_text().splitlines() if l.startswith('HF_TOKEN=')];from huggingface_hub import HfApi;print(HfApi(token=os.environ['HF_TOKEN']).whoami()['name'])"
```

---

## 2. Run the three retrievers

Run all commands from the **project root**. `uv` resolves dependencies
automatically.

### Daily — `daily_retrieve.py`

Fetches the HuggingFace daily papers list, summarizes each paper with Gemini,
generates a daily trending digest, and pushes to `Elfsong/hf_paper_summary`
and `Elfsong/hf_paper_daily_trending`.

```bash
uv run python src/daily_retrieve.py                                      # yesterday
uv run python src/daily_retrieve.py --date 2026-03-25                    # single day
uv run python src/daily_retrieve.py --date 2026-03-01 --end 2026-03-31   # date range (inclusive)
uv run python src/daily_retrieve.py --date 2026-03-01 --end 2026-03-31 --workers 4
uv run python src/daily_retrieve.py --date 2026-03-25 --recollect        # force re-summarize, ignore cache
uv run python src/daily_retrieve.py --date 2026-03-25 --no-push          # dry run, don't push to HF
```

| Flag         | Meaning                                                  |
| ------------ | -------------------------------------------------------- |
| `--date`     | Start date `YYYY-MM-DD` (default: yesterday, UTC)        |
| `--end`      | End date `YYYY-MM-DD`, inclusive — enables range mode    |
| `--workers`  | Parallel workers for a range (default: 1)                |
| `--recollect`| Re-summarize all papers, ignoring local/HF cache         |
| `--no-push`  | Skip pushing to HuggingFace                              |

### Monthly — `monthly_retrieve.py`

Aggregates a month of daily data already on HuggingFace into monthly
summaries / trending / insights. Run **after** the daily retriever has
populated the month.

```bash
uv run python src/monthly_retrieve.py                       # last completed month
uv run python src/monthly_retrieve.py --month 2026-03        # specific month
uv run python src/monthly_retrieve.py --month 2026-02 --no-push   # dry run
```

| Flag        | Meaning                                              |
| ----------- | ---------------------------------------------------- |
| `--month`   | Month `YYYY-MM` (default: last completed month, UTC) |
| `--no-push` | Skip pushing results to HuggingFace                  |

### Lifecycle — `lifecycle_retrieve.py`

Computes bimonthly Gartner-style hype-cycle snapshots for research topics
from all available paper data, and pushes to `Elfsong/hf_paper_lifecycle`.

```bash
uv run python src/lifecycle_retrieve.py                      # latest bimonthly snapshot
uv run python src/lifecycle_retrieve.py --snapshot 2025-06   # specific snapshot (even month)
uv run python src/lifecycle_retrieve.py --all                # all missing snapshots
uv run python src/lifecycle_retrieve.py --snapshot 2025-06 --force      # overwrite existing
uv run python src/lifecycle_retrieve.py --all --no-push      # dry run
```

| Flag         | Meaning                                                   |
| ------------ | --------------------------------------------------------- |
| `--snapshot` | Snapshot month `YYYY-MM` (even month; default: latest)    |
| `--all`      | Compute all missing bimonthly snapshots                   |
| `--force`    | Re-compute and overwrite existing snapshots               |
| `--no-push`  | Skip pushing results to HuggingFace                       |

---

## Typical pipeline order

```
daily_retrieve.py      ->  per-day summaries + trending  (run first, per day)
monthly_retrieve.py    ->  monthly aggregation           (after a month is filled)
lifecycle_retrieve.py  ->  bimonthly hype-cycle snapshots (after enough history)
```

## Troubleshooting

| Symptom (log line)                         | Cause                                                    | Fix                                                              |
| ------------------------------------------ | -------------------------------------------------------- | ---------------------------------------------------------------- |
| `trending retry ... (ClientError: 429 ...RESOURCE_EXHAUSTED)` | Gemini daily quota exhausted                  | Wait for reset (~hours) or use a billed / higher-quota key        |
| `RuntimeError: GEMINI_API_KEY not set`     | `.env` missing the key or env var empty                  | Add `GEMINI_API_KEY=` to `.env`                                   |
| `push retry ...` / `Invalid user token`    | `HF_TOKEN` missing, expired, revoked, or no Write scope   | Regenerate a Write token and paste it into `.env`                 |
| `RuntimeError: failed to list HF repo ...` | Network/permission error reaching HuggingFace            | Check connectivity and that the token can access the dataset      |