dcl-ibl-bas commited on
Commit
4ae0bcb
·
verified ·
1 Parent(s): 49849a9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -1
README.md CHANGED
@@ -7,4 +7,50 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ # IfGPT DATASET Qiality Components
11
+
12
+ ## Objectives of the project IfGPT
13
+ The **IfGPT Qiality Pipeline** is developed within the project **IfGPT: Infrastructure for Fine-tuning Pre-trained Large Language Models** which aims to establish a freely accessible infrastructure for the selection and pre-processing of large datasets for Bulgarian as well as tailored data for particular industries and fine-tuning suitable freely available large language models for specific purposes.
14
+
15
+ ## IfGPT Dataset Quality Pipeline
16
+
17
+ Modular Java pipeline to process and add new text documents to the IfGPT Dataset,
18
+ which includes cleaning, deduplication and quality evaluation of Bulgarian texts.
19
+ The pipeline includes:
20
+
21
+ - **Source-specific extraction** — metadata and plain text extracted from heterogeneous
22
+ corpora (MARCELL, CURLICAT, BulNC, Wikipedia), each handled by a dedicated class
23
+ that implements the shared `SourceProcessor` interface and extends `BaseSourceProcessor`
24
+ - **Sentence splitting** — `BulgarianSentenceSplitter` wraps the Apache OpenNLP Bulgarian
25
+ UD sentence-detection model, splitting every document into a sentence-per-line sidecar
26
+ file used by all downstream stages
27
+ - **Boilerplate cleaning** — `FileCleanProcessor` learns site-specific boilerplate from a
28
+ sample directory (lines appearing in ≥ 50 % of files) and removes them alongside
29
+ hardcoded patterns for HTML tags, navigation menus, URLs, cookie banners, and more
30
+ - **MinHash / LSH deduplication** — `DeduplicationProcessor` builds a MinHash signature
31
+ index over the full existing corpus and detects near-duplicate sentences in the new
32
+ batch (Jaccard ≥ 0.90), writing a ranked TSV report and optionally removing duplicates
33
+ - **Per-sentence PII scoring** — `PIIDetector` runs every sentence through the Phileas
34
+ engine (names, emails, phone numbers, IBANs, IP addresses, etc.) and stores the
35
+ proportion of flagged tokens as the `PersonallyIdentifiableInformation` vector in metadata
36
+ - **Per-sentence bias scoring** — `BiasAnalyser` matches tokens against `BiasLexicon`
37
+ (3 787-entry Bulgarian Bias Dictionary v4) to detect signal–evaluator pairs across five
38
+ categories (gender, race/ethnicity, religion, disability, appearance), storing the
39
+ per-sentence coverage ratio as the `BiasedInformation` vector in metadata
40
+
41
+ The full schema is enforced by `DocumentMetadata` (15 mandatory + 8 optional fields) and
42
+ the complete flow is managed by `IfGPTPipeline`, with `IfGPTDatasetProcessor` as the
43
+ main entry point.
44
+
45
+ ```
46
+ source processors → sentence split → clean → deduplication → PII → bias → counts → final structuring
47
+ ```
48
+
49
+ ## License
50
+
51
+ Creative Commons Attribution 4.0 International (CC-BY-4.0)
52
+ __________________________________________
53
+
54
+ This work is part of the project **Infrastructure for Fine-tuning Pre-trained Large Language Models**, Grant Agreement No. ПВУ – 55 from 12.12.2024 /BG-RRP-2.017-0030-C01/.
55
+
56
+ https://ifgpt.dcl.bas.bg/en/