fairdataihub
/

poster-sentry

@@ -102,10 +102,10 @@ Trained on **3,606 real documents** — zero synthetic data:
 | Class | Count | Source |
 |-------|-------|--------|
-| **Poster** | 1,803 | Verified scientific posters from Zenodo & Figshare |
-| **Non-poster** | 1,803 | Multi-page papers, proceedings, newsletters, abstract books |
-Sampled from the [posters.science](https://posters.science) corpus of **30,000+ classified PDFs** (28,111 posters, 2,036 non-posters from Zenodo and Figshare).
 Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data)

 | Class | Count | Source |
 |-------|-------|--------|
+| **Poster** | 1,803 | Repository-labeled posters from Zenodo & Figshare |
+| **Non-poster** | 1,803 | Manually confirmed non-posters (papers, proceedings, newsletters, abstract books) |
+Balanced subset sampled from 30,000+ PDFs scraped from Zenodo and Figshare. Poster samples are drawn from records whose uploaders tagged them as "poster" in repository metadata. Non-poster samples were flagged by a structural classifier and then manually confirmed. When PosterSentry was applied to the full corpus, ~20% of repository-labeled "posters" were reclassified as non-posters, indicating meaningful label noise in the source data.
 Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data)