Spaces:
Running
Running
Repository Guidelines
Project Structure & Module Organization
extractor.pycontains PDF rendering, template selection, and OpenAI calls.templates/holds JSON extraction templates referenced byTEMPLATE_REGISTRY.src/streamlit_app.pyis the Hugging Face Space UI entrypoint.Dockerfilebuilds the Space image (Streamlit on port 8501)..streamlit/config.tomlcontains Space-friendly Streamlit server settings.README.mdincludes Space metadata front matter and usage notes.- The UI relies on filename keywords to select templates (see
TEMPLATE_REGISTRY). - Sample PDFs are fetched from the HF dataset set by
SAMPLE_DATASET_REPO.
Build, Test, and Development Commands
- Install dependencies with
python -m pip install -r requirements.txt. - Local CLI extraction prompts for a PDF path and prints JSON:
python extractor.py
- Run the Space UI locally:
streamlit run src/streamlit_app.py
- Quick import sanity check:
python -c "import extractor; print(extractor.DEFAULT_MODEL)"
Coding Style & Naming Conventions
- Keep 2-space indentation in
extractor.py. - Use snake_case for functions/variables, UPPER_SNAKE for constants, and add type hints for new functions.
- Template JSON filenames should be snake_case and registered via lowercase filename keywords in
TEMPLATE_REGISTRY.
Testing Guidelines
- No automated test suite exists yet. If adding tests, use
pytestundertests/. - Validate that model output matches the exact template schema and that filename keywords map to the right template.
Commit & Pull Request Guidelines
- No established commit convention; use short, imperative subjects.
- PRs should include the document type, template files touched, example filename keyword, and any config/env changes.
Security & Configuration Tips
- Set
OPENAI_API_KEYfor local runs and the Space; optionally overrideEXTRACTOR_MODEL_ALIAS. - Avoid committing sensitive PDFs or output data; use redacted samples for demos.
Automation
.github/workflows/sync-hf.ymlpushesmainto the HF Space on each commit usingHF_TOKEN.- Treat GitHub as the source of truth; direct edits on HF may be overwritten.
- The workflow force-pushes a fresh snapshot to avoid blocked legacy binaries in history.