Update README.md
Browse files
README.md
CHANGED
|
@@ -7,4 +7,50 @@ sdk: static
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# IfGPT DATASET Qiality Components
|
| 11 |
+
|
| 12 |
+
## Objectives of the project IfGPT
|
| 13 |
+
The **IfGPT Qiality Pipeline** is developed within the project **IfGPT: Infrastructure for Fine-tuning Pre-trained Large Language Models** which aims to establish a freely accessible infrastructure for the selection and pre-processing of large datasets for Bulgarian as well as tailored data for particular industries and fine-tuning suitable freely available large language models for specific purposes.
|
| 14 |
+
|
| 15 |
+
## IfGPT Dataset Quality Pipeline
|
| 16 |
+
|
| 17 |
+
Modular Java pipeline to process and add new text documents to the IfGPT Dataset,
|
| 18 |
+
which includes cleaning, deduplication and quality evaluation of Bulgarian texts.
|
| 19 |
+
The pipeline includes:
|
| 20 |
+
|
| 21 |
+
- **Source-specific extraction** — metadata and plain text extracted from heterogeneous
|
| 22 |
+
corpora (MARCELL, CURLICAT, BulNC, Wikipedia), each handled by a dedicated class
|
| 23 |
+
that implements the shared `SourceProcessor` interface and extends `BaseSourceProcessor`
|
| 24 |
+
- **Sentence splitting** — `BulgarianSentenceSplitter` wraps the Apache OpenNLP Bulgarian
|
| 25 |
+
UD sentence-detection model, splitting every document into a sentence-per-line sidecar
|
| 26 |
+
file used by all downstream stages
|
| 27 |
+
- **Boilerplate cleaning** — `FileCleanProcessor` learns site-specific boilerplate from a
|
| 28 |
+
sample directory (lines appearing in ≥ 50 % of files) and removes them alongside
|
| 29 |
+
hardcoded patterns for HTML tags, navigation menus, URLs, cookie banners, and more
|
| 30 |
+
- **MinHash / LSH deduplication** — `DeduplicationProcessor` builds a MinHash signature
|
| 31 |
+
index over the full existing corpus and detects near-duplicate sentences in the new
|
| 32 |
+
batch (Jaccard ≥ 0.90), writing a ranked TSV report and optionally removing duplicates
|
| 33 |
+
- **Per-sentence PII scoring** — `PIIDetector` runs every sentence through the Phileas
|
| 34 |
+
engine (names, emails, phone numbers, IBANs, IP addresses, etc.) and stores the
|
| 35 |
+
proportion of flagged tokens as the `PersonallyIdentifiableInformation` vector in metadata
|
| 36 |
+
- **Per-sentence bias scoring** — `BiasAnalyser` matches tokens against `BiasLexicon`
|
| 37 |
+
(3 787-entry Bulgarian Bias Dictionary v4) to detect signal–evaluator pairs across five
|
| 38 |
+
categories (gender, race/ethnicity, religion, disability, appearance), storing the
|
| 39 |
+
per-sentence coverage ratio as the `BiasedInformation` vector in metadata
|
| 40 |
+
|
| 41 |
+
The full schema is enforced by `DocumentMetadata` (15 mandatory + 8 optional fields) and
|
| 42 |
+
the complete flow is managed by `IfGPTPipeline`, with `IfGPTDatasetProcessor` as the
|
| 43 |
+
main entry point.
|
| 44 |
+
|
| 45 |
+
```
|
| 46 |
+
source processors → sentence split → clean → deduplication → PII → bias → counts → final structuring
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
## License
|
| 50 |
+
|
| 51 |
+
Creative Commons Attribution 4.0 International (CC-BY-4.0)
|
| 52 |
+
__________________________________________
|
| 53 |
+
|
| 54 |
+
This work is part of the project **Infrastructure for Fine-tuning Pre-trained Large Language Models**, Grant Agreement No. ПВУ – 55 from 12.12.2024 /BG-RRP-2.017-0030-C01/.
|
| 55 |
+
|
| 56 |
+
https://ifgpt.dcl.bas.bg/en/
|