pdf-extractor / AGENTS.md
github-actions[bot]
Sync from GitHub
8e52fc5
# Repository Guidelines
## Project Structure & Module Organization
- `extractor.py` contains PDF rendering, template selection, and OpenAI calls.
- `templates/` holds JSON extraction templates referenced by `TEMPLATE_REGISTRY`.
- `src/streamlit_app.py` is the Hugging Face Space UI entrypoint.
- `Dockerfile` builds the Space image (Streamlit on port 8501).
- `.streamlit/config.toml` contains Space-friendly Streamlit server settings.
- `README.md` includes Space metadata front matter and usage notes.
- The UI relies on filename keywords to select templates (see `TEMPLATE_REGISTRY`).
- Sample PDFs are fetched from the HF dataset set by `SAMPLE_DATASET_REPO`.
## Build, Test, and Development Commands
- Install dependencies with `python -m pip install -r requirements.txt`.
- Local CLI extraction prompts for a PDF path and prints JSON:
- `python extractor.py`
- Run the Space UI locally:
- `streamlit run src/streamlit_app.py`
- Quick import sanity check:
- `python -c "import extractor; print(extractor.DEFAULT_MODEL)"`
## Coding Style & Naming Conventions
- Keep 2-space indentation in `extractor.py`.
- Use snake_case for functions/variables, UPPER_SNAKE for constants, and add type hints for new functions.
- Template JSON filenames should be snake_case and registered via lowercase filename keywords in `TEMPLATE_REGISTRY`.
## Testing Guidelines
- No automated test suite exists yet. If adding tests, use `pytest` under `tests/`.
- Validate that model output matches the exact template schema and that filename keywords map to the right template.
## Commit & Pull Request Guidelines
- No established commit convention; use short, imperative subjects.
- PRs should include the document type, template files touched, example filename keyword, and any config/env changes.
## Security & Configuration Tips
- Set `OPENAI_API_KEY` for local runs and the Space; optionally override `EXTRACTOR_MODEL_ALIAS`.
- Avoid committing sensitive PDFs or output data; use redacted samples for demos.
## Automation
- `.github/workflows/sync-hf.yml` pushes `main` to the HF Space on each commit using `HF_TOKEN`.
- Treat GitHub as the source of truth; direct edits on HF may be overwritten.
- The workflow force-pushes a fresh snapshot to avoid blocked legacy binaries in history.