fairdataihub
/

poster-sentry

Text Classification

document-classification

scientific-posters

poster-detection

machine-actionable

posters-science

quality-control

Model card Files Files and versions

jimnoneill commited on Feb 12

Commit

b9d33f4

·

verified ·

1 Parent(s): 635afb8

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +83 -0

README.md ADDED Viewed

	@@ -0,0 +1,83 @@

+---
+license: mit
+language:
+  - en
+tags:
+  - document-classification
+  - scientific-posters
+  - multimodal
+  - model2vec
+  - poster-detection
+library_name: model2vec
+pipeline_tag: text-classification
+---
+# PosterSentry — Multimodal Scientific Poster Classifier
+PosterSentry classifies PDFs as **scientific posters** vs **non-posters** (papers, proceedings,
+abstracts, newsletters) using a multimodal approach that combines text embeddings with visual
+and structural features from the PDF.
+Part of the [posters.science](https://posters.science) initiative at
+[FAIR Data Innovations Hub](https://fairdataihub.org).
+## Architecture
+Three feature channels concatenated into a 542-dimensional vector:
+| Channel | Features | Dimension | Signal |
+|---------|----------|-----------|--------|
+| **Text** | model2vec (potion-base-32M) embedding | 512 | Semantic content |
+| **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |
+| **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |
+Single LogisticRegression classifier with StandardScaler normalization.
+## Performance
+| Metric | Value |
+|--------|-------|
+| Accuracy | **87.3%** |
+| F1 (poster) | 87.1% |
+| F1 (non-poster) | 87.4% |
+| Inference | ~300 docs/sec (CPU) |
+### Top Features by Importance
+1. `size_per_page_kb` (+7.65) — Posters are dense, high-res single pages
+2. `page_count` (-5.49) — More pages = not a poster
+3. `file_size_kb` (-5.44) — Multi-page docs are bigger overall
+4. `img_height` (+1.38) — Posters are large-format
+5. `color_diversity` (+0.95) — Posters are visually rich
+## Training Data
+Trained on **3,606 real documents** (zero synthetic data):
+- **1,803 verified scientific posters** from Zenodo & Figshare (sampled from 28K+ corpus)
+- **1,803 verified non-posters** — multi-page papers, proceedings, newsletters
+See [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data).
+## Usage
+```python
+from poster_sentry import PosterSentry
+sentry = PosterSentry()
+sentry.initialize()
+result = sentry.classify("document.pdf")
+# {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}
+```
+## Citation
+```bibtex
+@software{poster_sentry_2026,
+  title = {PosterSentry: Multimodal Scientific Poster Classifier},
+  author = {O'Neill, Jamey and FAIR Data Innovations Hub},
+  year = {2026},
+  url = {https://huggingface.co/fairdataihub/poster-sentry},
+  note = {Part of the posters.science initiative}
+}
+```