Spaces:
Running
Running
| # Repository Guidelines | |
| ## Project Structure & Module Organization | |
| - `extractor.py` contains PDF rendering, template selection, and OpenAI calls. | |
| - `templates/` holds JSON extraction templates referenced by `TEMPLATE_REGISTRY`. | |
| - `src/streamlit_app.py` is the Hugging Face Space UI entrypoint. | |
| - `Dockerfile` builds the Space image (Streamlit on port 8501). | |
| - `.streamlit/config.toml` contains Space-friendly Streamlit server settings. | |
| - `README.md` includes Space metadata front matter and usage notes. | |
| - The UI relies on filename keywords to select templates (see `TEMPLATE_REGISTRY`). | |
| - Sample PDFs are fetched from the HF dataset set by `SAMPLE_DATASET_REPO`. | |
| ## Build, Test, and Development Commands | |
| - Install dependencies with `python -m pip install -r requirements.txt`. | |
| - Local CLI extraction prompts for a PDF path and prints JSON: | |
| - `python extractor.py` | |
| - Run the Space UI locally: | |
| - `streamlit run src/streamlit_app.py` | |
| - Quick import sanity check: | |
| - `python -c "import extractor; print(extractor.DEFAULT_MODEL)"` | |
| ## Coding Style & Naming Conventions | |
| - Keep 2-space indentation in `extractor.py`. | |
| - Use snake_case for functions/variables, UPPER_SNAKE for constants, and add type hints for new functions. | |
| - Template JSON filenames should be snake_case and registered via lowercase filename keywords in `TEMPLATE_REGISTRY`. | |
| ## Testing Guidelines | |
| - No automated test suite exists yet. If adding tests, use `pytest` under `tests/`. | |
| - Validate that model output matches the exact template schema and that filename keywords map to the right template. | |
| ## Commit & Pull Request Guidelines | |
| - No established commit convention; use short, imperative subjects. | |
| - PRs should include the document type, template files touched, example filename keyword, and any config/env changes. | |
| ## Security & Configuration Tips | |
| - Set `OPENAI_API_KEY` for local runs and the Space; optionally override `EXTRACTOR_MODEL_ALIAS`. | |
| - Avoid committing sensitive PDFs or output data; use redacted samples for demos. | |
| ## Automation | |
| - `.github/workflows/sync-hf.yml` pushes `main` to the HF Space on each commit using `HF_TOKEN`. | |
| - Treat GitHub as the source of truth; direct edits on HF may be overwritten. | |
| - The workflow force-pushes a fresh snapshot to avoid blocked legacy binaries in history. | |