hynky commited on
Commit
810bdb8
·
verified ·
1 Parent(s): 0a236e2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -43
README.md CHANGED
@@ -1,20 +1,19 @@
1
-
2
  ---
3
  language:
4
  - en
5
  license: apache-2.0
6
  datasets:
7
- - HuggingFaceFW/finepdfs_fw_edu_labeled
8
  ---
9
 
10
- # FinePDFs-Edu classifier (English)
11
 
12
  ## Model summary
13
- This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on 0 [annotations](https://huggingface.co/datasets/HuggingFaceFW/finepdfs_fw_edu_labeled) generated by [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) for web samples from [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) dataset.
 
14
 
15
- We used this classifier to build [FinePDFs-Edu](https://huggingface.co/datasets/HuggingFaceFW/finepdfs-edu) dataset.
16
  ### How to use in transformers
17
- To load the FinePDFs-Edu classifier, use the following code:
18
 
19
  ```python
20
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
@@ -83,42 +82,33 @@ print(max(scores))
83
  ```
84
 
85
  ## Training
86
- The classifier was trained on 11153120 pairs of web samples and their scores from 0 to 5, generated by Qwen3-235B-A22B-Instruct-2507. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational.
87
 
88
  Below is the prompt used for Qwen3-235B-A22B-Instruct-2507 annotations:
89
  ```
90
- Below is an extract from a PDF file. Evaluate whether the extract has a high educational
91
- value and could be useful in an educational setting for teaching from primary school to
92
- grade school levels using the additive 5-point scoring system described below. Points are
93
- accumulated based on the satisfaction of each criterion:
94
- - Add 1 point if the extract provides some basic information relevant to educational topics, even if it includes some irrelevant or non-academic content like advertisements and
95
- promotional material.
96
- - Add another point if the extract addresses certain elements pertinent to education but
97
- does not align closely with educational standards. It might mix educational content with
98
- non-educational material, offering a superficial overview of potentially useful topics, or
99
- presenting information in a disorganized manner and incoherent writing style.
100
- - Award a third point if the extract is appropriate for educational use and introduces key
101
- concepts relevant to school curricula. It is coherent though it may not be comprehensive
102
- or could include some extraneous information. It may resemble an introductory section of
103
- a textbook or a basic tutorial that is suitable for learning but has notable limitations like
104
- treating concepts that are too complex for grade school students.
105
- - Grant a fourth point if the extract highly relevant and beneficial for educational purposes
106
- for a level not higher than grade school, exhibiting a clear and consistent writing style. It
107
- could be similar to a chapter from a textbook or a tutorial, offering substantial educational
108
- content, including exercises and solutions, with minimal irrelevant information, and the
109
- concepts aren’t too advanced for grade school students. The content is coherent, focused,
110
- and valuable for structured learning.
111
- - Bestow a fifth point if the extract is outstanding in its educational value, perfectly suited for
112
- teaching either at primary school or grade school. It follows detailed reasoning, the writing
113
- style is easy to follow and offers profound and thorough insights into the subject matter,
114
- devoid of any non-educational or complex content.
115
- The extract: {example}.
116
  After examining the extract:
117
- - Briefly justify your total score, up to 100 words.
118
- - Conclude with the score using the format: "Educational score: <total points>"\
119
  ```
120
 
121
- We added a classification head with a single regression output to answerdotai/ModernBERT-large, unroze the last 4 layers and trained the model for 5000 steps with a learning rate of 3e-4.
122
 
123
  **Training Details:**
124
 
@@ -131,7 +121,7 @@ We added a classification head with a single regression output to answerdotai/Mo
131
 
132
  **Classification report**
133
 
134
- We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 0 Qwen3-235B-A22B-Instruct-2507-annotated samples.
135
  ```
136
  Validation Report:
137
  | class | precision | recall | f1-score | support |
@@ -144,7 +134,7 @@ Validation Report:
144
 
145
  **Confusion matrix**
146
 
147
- We verify that the predicted educational scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.
148
  ```
149
  Confusion Matrix:
150
  | class | 0 | 1 | 2 | 3 |
@@ -157,11 +147,12 @@ Confusion Matrix:
157
 
158
 
159
  ## Limitations
160
- While the FinePDFs-Edu classifier performs well in distinguishing high-quality educational content for FinePDFs dataset, there are some limitations:
161
 
162
- - Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on educational content relevant to primary and grade school levels and may not perform as well on content intended for higher education or specialized domains.
163
- - Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using int_score >= 1.35 (top 10% for english) as a threshold for data curation.
164
- - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
165
 
 
166
  The training and inference code is available on GitHub
167
- https://github.com/huggingface/finepdfs/tree/main/classification
 
 
1
  ---
2
  language:
3
  - en
4
  license: apache-2.0
5
  datasets:
6
+ - HuggingFaceFW/finepdfs_eng_Latn_labeled
7
  ---
8
 
9
+ # FinePDFs-OCR-Quality classifier (English)
10
 
11
  ## Model summary
12
+ This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on 1304547
13
+ [annotations](https://huggingface.co/datasets/HuggingFaceFW/finepdfs_eng_Latn_labeled) generated by [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) for web samples from [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) dataset.
14
 
 
15
  ### How to use in transformers
16
+ To load the FinePDFs-OCR-Quality classifier, use the following code:
17
 
18
  ```python
19
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
 
82
  ```
83
 
84
  ## Training
85
+ The classifier was trained on 1304547 pairs of web samples and their scores from 0 to 5, generated by Qwen3-235B-A22B-Instruct-2507. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational.
86
 
87
  Below is the prompt used for Qwen3-235B-A22B-Instruct-2507 annotations:
88
  ```
89
+ Below is an extract from a PDF file. Evaluate the quality of the document extraction using the 4-point string system described below. Select the single score that best represents the extraction quality level:
90
+
91
+ **Score 0: Garbage Text Present**
92
+ - Award 0 points if there are any garbage artifacts present in the text, regardless of how much legitimate content surrounds them. This includes OCR corruption like random character sequences (e.g., "7*/3./ +*/ 6- 4603"), unreadable symbol combinations, corrupted encoding artifacts, or any form of garbled text that renders portions of the document incomprehensible. Even if 90% of the text is perfectly readable, the presence of any garbage characters results in a score of 0.
93
+
94
+ **Score 1: Clear Formatting Issues**
95
+ - Award 1 point if there are no garbage characters but clear formatting problems are present. This includes broken mathematical equations or formulas that are unreadable, excessive or irregular spacing that disrupts readability, malformed tables or lists, severely corrupted line breaks, or other structural formatting issues that significantly impact the document's usability while keeping the text itself readable.
96
+
97
+ **Score 2: Minor Formatting Problems**
98
+ - Award 2 points if there are no garbage characters but minor formatting issues exist. This includes scattered extra spaces within words or sentences (e.g., "A t t h e S how"), inconsistent spacing, minor alignment issues, occasional broken line formatting, or small structural problems that don't severely impact readability but indicate imperfect extraction quality.
99
+
100
+ **Score 3: Clean Extraction**
101
+ - Award 3 points if there are no OCR garbage artifacts, no significant formatting issues, and the text extraction preserves the document's structure and readability effectively. The content should be clean, properly formatted, and easily readable with minimal to no extraction artifacts.
102
+
103
+ ## Evaluation Process
104
+ The extract: {example}
105
+
 
 
 
 
 
 
 
 
 
106
  After examining the extract:
107
+ - Briefly justify your score, focusing specifically on the presence of garbage text, formatting issues, and overall extraction quality, up to 100 words.
108
+ - Conclude with the score using the format: "Document extraction score: <total points>"\
109
  ```
110
 
111
+ We added a classification head with a single regression output to answerdotai/ModernBERT-large, unfroze the last 4 layers and trained the model for 5000 steps with a learning rate of 3e-4.
112
 
113
  **Training Details:**
114
 
 
121
 
122
  **Classification report**
123
 
124
+ We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 10000 Qwen3-235B-A22B-Instruct-2507-annotated samples.
125
  ```
126
  Validation Report:
127
  | class | precision | recall | f1-score | support |
 
134
 
135
  **Confusion matrix**
136
 
137
+ We verify that the predicted ocr quality scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.
138
  ```
139
  Confusion Matrix:
140
  | class | 0 | 1 | 2 | 3 |
 
147
 
148
 
149
  ## Limitations
150
+ While the FinePDFs-OCR-Quality classifier performs well in distinguishing high-quality PDF extraction for FinePDFs dataset, there are some limitations:
151
 
152
+ - Scope: The model evaluates OCR quality using the recognized text only. Its behavior can vary across languages, scripts, and formatting (tables, math, mixed inline code). It is tuned on common, printed materials and may be less reliable on handwriting-heavy documents, highly technical notation, or unconventional orthography.
153
+ - Bias: Performance depends on the representativeness of the text produced by the OCR pipeline and the data used to train/annotate the classifier. If training skewed toward clean, Latin-script outputs or specific OCR engines, the classifier may systematically favor those and under-score text from other scripts, noisy sources, or different OCR models.
154
+ - Context: The classifier scores individual pages/snippets of post-OCR text without access to the original images, layout, or broader document context. It does not model downstream usage (e.g., NER, search, or translation) and cannot recover layout fidelity, tables, or figures lost during OCR.
155
 
156
+ Thresholds / Recommendation: In our evaluations, applying classifier-score filtering provided no measurable downstream performance benefit. We therefore do not recommend using any score threshold for curation or routing.
157
  The training and inference code is available on GitHub
158
+ https://github.com/huggingface/finepdfs/tree/main/classification