jimnoneill commited on
Commit
9633129
·
1 Parent(s): e95d35a

Soften training data verbiage: posters are repository-labeled, not verified

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -102,10 +102,10 @@ Trained on **3,606 real documents** — zero synthetic data:
102
 
103
  | Class | Count | Source |
104
  |-------|-------|--------|
105
- | **Poster** | 1,803 | Verified scientific posters from Zenodo & Figshare |
106
- | **Non-poster** | 1,803 | Multi-page papers, proceedings, newsletters, abstract books |
107
 
108
- Sampled from the [posters.science](https://posters.science) corpus of **30,000+ classified PDFs** (28,111 posters, 2,036 non-posters from Zenodo and Figshare).
109
 
110
  Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data)
111
 
 
102
 
103
  | Class | Count | Source |
104
  |-------|-------|--------|
105
+ | **Poster** | 1,803 | Repository-labeled posters from Zenodo & Figshare |
106
+ | **Non-poster** | 1,803 | Manually confirmed non-posters (papers, proceedings, newsletters, abstract books) |
107
 
108
+ Balanced subset sampled from 30,000+ PDFs scraped from Zenodo and Figshare. Poster samples are drawn from records whose uploaders tagged them as "poster" in repository metadata. Non-poster samples were flagged by a structural classifier and then manually confirmed. When PosterSentry was applied to the full corpus, ~20% of repository-labeled "posters" were reclassified as non-posters, indicating meaningful label noise in the source data.
109
 
110
  Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data)
111