pdf-extractor / AGENTS.md
github-actions[bot]
Sync from GitHub
8e52fc5

Repository Guidelines

Project Structure & Module Organization

  • extractor.py contains PDF rendering, template selection, and OpenAI calls.
  • templates/ holds JSON extraction templates referenced by TEMPLATE_REGISTRY.
  • src/streamlit_app.py is the Hugging Face Space UI entrypoint.
  • Dockerfile builds the Space image (Streamlit on port 8501).
  • .streamlit/config.toml contains Space-friendly Streamlit server settings.
  • README.md includes Space metadata front matter and usage notes.
  • The UI relies on filename keywords to select templates (see TEMPLATE_REGISTRY).
  • Sample PDFs are fetched from the HF dataset set by SAMPLE_DATASET_REPO.

Build, Test, and Development Commands

  • Install dependencies with python -m pip install -r requirements.txt.
  • Local CLI extraction prompts for a PDF path and prints JSON:
    • python extractor.py
  • Run the Space UI locally:
    • streamlit run src/streamlit_app.py
  • Quick import sanity check:
    • python -c "import extractor; print(extractor.DEFAULT_MODEL)"

Coding Style & Naming Conventions

  • Keep 2-space indentation in extractor.py.
  • Use snake_case for functions/variables, UPPER_SNAKE for constants, and add type hints for new functions.
  • Template JSON filenames should be snake_case and registered via lowercase filename keywords in TEMPLATE_REGISTRY.

Testing Guidelines

  • No automated test suite exists yet. If adding tests, use pytest under tests/.
  • Validate that model output matches the exact template schema and that filename keywords map to the right template.

Commit & Pull Request Guidelines

  • No established commit convention; use short, imperative subjects.
  • PRs should include the document type, template files touched, example filename keyword, and any config/env changes.

Security & Configuration Tips

  • Set OPENAI_API_KEY for local runs and the Space; optionally override EXTRACTOR_MODEL_ALIAS.
  • Avoid committing sensitive PDFs or output data; use redacted samples for demos.

Automation

  • .github/workflows/sync-hf.yml pushes main to the HF Space on each commit using HF_TOKEN.
  • Treat GitHub as the source of truth; direct edits on HF may be overwritten.
  • The workflow force-pushes a fresh snapshot to avoid blocked legacy binaries in history.