poster-sentry / README.md

Updated README with logo, pipeline diagram, related models, grant acknowledgment

0e48d34 verified 4 days ago

7.52 kB

	---
	license: mit
	language:
	- en
	tags:
	- document-classification
	- scientific-posters
	- multimodal
	- model2vec
	- poster-detection
	- machine-actionable
	- FAIR-data
	- posters-science
	- quality-control
	library_name: model2vec
	pipeline_tag: text-classification
	thumbnail: PosterSentry.png
	---

	<div align="center">
	<img src="PosterSentry.png" alt="PosterSentry Logo" width="400"/>
	</div>

	# PosterSentry — Multimodal Scientific Poster Classifier

	## Model Description

	PosterSentry is a lightweight, CPU-optimized multimodal classifier that determines whether a PDF is a scientific poster or a non-poster (paper, proceedings, newsletter, abstract book, etc.).

	Part of the quality control pipeline for [posters.science](https://posters.science), a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).

	Developed by the [FAIR Data Innovations Hub](https://fairdataihub.org/) at the California Medical Innovations Institute (CalMI²).

	## Related Models & Tools

	\| Resource \| Description \| Link \|
	\|----------\|-------------\|------\|
	\| PosterSentry \| Multimodal poster classifier (this model) \| [fairdataihub/poster-sentry](https://huggingface.co/fairdataihub/poster-sentry) \|
	\| Llama-3.1-8B-Poster-Extraction \| Poster → structured JSON extraction \| [fairdataihub/Llama-3.1-8B-Poster-Extraction](https://huggingface.co/fairdataihub/Llama-3.1-8B-Poster-Extraction) \|
	\| poster2json \| Python library for poster extraction \| [PyPI](https://pypi.org/project/poster2json/) · [Docs](https://fairdataihub.github.io/poster2json/) · [GitHub](https://github.com/fairdataihub/poster2json) \|
	\| poster-json-schema \| DataCite-based poster metadata schema \| [GitHub](https://github.com/fairdataihub/poster-json-schema) \|
	\| Platform \| posters.science \| [posters.science](https://posters.science) \|

	### Pipeline Position

	PosterSentry sits at the front of the posters.science pipeline — it screens incoming PDFs before the expensive Llama-based extraction:

	```
	PDF Input
	│
	▼
	┌──────────────┐ ┌───────────────────────────────────┐ ┌──────────────┐
	│ PosterSentry │ ──► │ Llama-3.1-8B-Poster-Extraction │ ──► │ poster2json │
	│ (classify) │ │ (extract structured metadata) │ │ (validate) │
	└──────────────┘ └───────────────────────────────────┘ └──────────────┘
	poster? ✓ raw text → JSON schema FAIR output
	```

	## Architecture

	Three feature channels concatenated into a 542-dimensional vector, fed to a single LogisticRegression:

	\| Channel \| Features \| Dimension \| Signal \|
	\|---------\|----------\|-----------\|--------\|
	\| Text \| model2vec (potion-base-32M) embedding \| 512 \| Semantic content \|
	\| Visual \| Color stats, edge density, FFT spatial complexity, whitespace \| 15 \| Visual layout \|
	\| Structural \| Page count, area, font diversity, text blocks, density \| 15 \| PDF geometry \|

	Each classifier head is a single linear layer stored as a numpy `.npz` file (10 KB). Inference is pure numpy — no torch required at prediction time.

	## Performance

	Validated on 3,606 real scientific documents:

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| 87.3% \|
	\| F1 (poster) \| 87.1% \|
	\| F1 (non-poster) \| 87.4% \|
	\| Precision (poster) \| 88.2% \|
	\| Recall (poster) \| 85.9% \|
	\| Inference speed \| ~300 docs/sec (CPU) \|

	### Top Features by Importance

	\| Rank \| Feature \| Coefficient \| Signal \|
	\|------\|---------\|------------\|--------\|
	\| 1 \| `size_per_page_kb` \| +7.65 \| Posters are dense, high-res single pages \|
	\| 2 \| `page_count` \| -5.49 \| More pages = not a poster \|
	\| 3 \| `file_size_kb` \| -5.44 \| Multi-page docs are bigger overall \|
	\| 4 \| `img_height` \| +1.38 \| Posters are large-format \|
	\| 5 \| `page_height_pt` \| +1.38 \| Large physical dimensions \|
	\| 6 \| `avg_font_size` \| -1.10 \| Papers use smaller fonts \|
	\| 7 \| `is_landscape` \| +0.98 \| Some posters are landscape \|
	\| 8 \| `color_diversity` \| +0.95 \| Posters are visually rich \|
	\| 9 \| `edge_density` \| +0.79 \| More visual edges in posters \|
	\| 10 \| `text_block_count` \| +0.75 \| Multi-column poster layouts \|

	## Training Data

	Trained on 3,606 real documents — zero synthetic data:

	\| Class \| Count \| Source \|
	\|-------\|-------\|--------\|
	\| Poster \| 1,803 \| Verified scientific posters from Zenodo & Figshare \|
	\| Non-poster \| 1,803 \| Multi-page papers, proceedings, newsletters, abstract books \|

	Sampled from the [posters.science](https://posters.science) corpus of 30,000+ classified PDFs (28,111 posters, 2,036 non-posters from Zenodo and Figshare).

	Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data)

	## Usage

	### Python API

	```python
	from poster_sentry import PosterSentry

	sentry = PosterSentry()
	sentry.initialize()

	# Classify a PDF (uses text + visual + structural features)
	result = sentry.classify("document.pdf")
	print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}")
	# {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}

	# Batch classification
	results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"])
	```

	### Installation

	```bash
	pip install git+https://github.com/fairdataihub/poster-repo-qc.git

	# Or install from source
	git clone https://github.com/fairdataihub/poster-repo-qc.git
	cd poster-repo-qc
	pip install -e ".[train]"
	```

	### Training

	```bash
	python scripts/train_poster_sentry.py --n-per-class 2000
	```

	Training completes in ~40 minutes on CPU (PDF rendering is the bottleneck, not the classifier).

	## Model Specifications

	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Embedding backbone \| minishlab/potion-base-32M (model2vec StaticModel) \|
	\| Embedding dimension \| 512 \|
	\| Visual features \| 15 (color, edge, FFT, whitespace) \|
	\| Structural features \| 15 (page geometry, fonts, text blocks) \|
	\| Total input dimension \| 542 \|
	\| Classifier \| LogisticRegression (sklearn) + StandardScaler \|
	\| Head file size \| 10 KB (.npz) \|
	\| Precision \| float32 \|
	\| GPU required \| No (CPU-only) \|
	\| License \| MIT \|

	## System Requirements

	- CPU: Any modern CPU (no GPU needed)
	- RAM: ≥4GB
	- Python: ≥3.10
	- Dependencies: numpy, model2vec, scikit-learn, PyMuPDF, Pillow

	## Citation

	```bibtex
	@software{poster_sentry_2026,
	title = {PosterSentry: Multimodal Scientific Poster Classifier},
	author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
	year = {2026},
	url = {https://huggingface.co/fairdataihub/poster-sentry},
	note = {Part of the posters.science initiative}
	}
	```

	## License

	This model is released under the [MIT License](https://opensource.org/licenses/MIT).

	## Acknowledgments

	- [FAIR Data Innovations Hub](https://fairdataihub.org/) at California Medical Innovations Institute (CalMI²)
	- [posters.science](https://posters.science) platform
	- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
	- HuggingFace for model hosting infrastructure
	- Funded by The Navigation Fund ([10.71707/rk36-9x79](https://doi.org/10.71707/rk36-9x79)) — "Poster Sharing and Discovery Made Easy"