Joblib
File size: 8,525 Bytes
238a066
 
 
 
 
 
 
 
 
 
43b197b
238a066
43b197b
238a066
 
 
43b197b
238a066
45d102e
238a066
 
 
09a7ef8
238a066
 
 
 
 
 
 
 
 
 
 
 
43b197b
 
 
238a066
 
 
 
 
 
 
 
 
 
 
6516b4e
238a066
 
 
 
 
 
c7de0a2
 
238a066
 
 
67e6aac
 
238a066
 
 
 
 
67e6aac
238a066
 
 
 
 
67e6aac
238a066
 
 
 
 
 
 
 
 
43b197b
238a066
43b197b
238a066
5382e15
 
 
 
 
 
43b197b
 
 
 
 
 
 
 
 
 
 
 
 
238a066
 
 
 
43b197b
238a066
 
 
 
 
 
 
 
 
 
 
5382e15
238a066
5382e15
238a066
 
 
 
67e6aac
238a066
67e6aac
238a066
 
 
 
 
 
 
5382e15
43b197b
343774e
43b197b
d2cd8e1
43b197b
d2cd8e1
 
43b197b
d2cd8e1
43b197b
 
 
 
 
 
238a066
1e1df41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5404815
 
add013b
 
 
6ca57a3
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
license: apache-2.0
---

# Polish Text Quality Classifier

## 1. Overview

This project provides a high-performance script for classifying the quality of Polish texts using a pre-trained XGBoost model. The classifier assigns one of three quality categories (`LOW`, `MEDIUM`, `HIGH`) to each text and provides a confidence score (probability).

The classification is based on over 200 linguistic features extracted from each text, such as the count of nouns and verbs, NER entities, sentence length statistics, and the number of out-of-vocabulary words. These features are calculated by a companion modules located in "features" folder.

The script supports processing files in both **Parquet** and **JSONL** formats.

## 2. Features

- **Efficient Batch Processing**: Processes all texts from a file at once, minimizing I/O and leveraging vectorized computations for high performance.
- **Dual Format Support**: Ingests data from either `.parquet` or `.jsonl` files.
- **Robust Feature Extraction**: Relies on a sophisticated feature engineering modules located in "features" folder to generate over 200 linguistic metrics for accurate classification.
- **Scalable**: Capable of handling millions of documents by processing files sequentially and texts in parallel.
- **Seamless Integration**: Appends classification results (`quality_ai` and `confidence`) directly to the original data, preserving all existing columns/keys.
- **User-Friendly Progress**: Displays a `tqdm` progress bar to monitor the analysis in real-time.
- **Language-Aware Filtering**: Automatically classifies all non-Polish texts as LOW quality, unless a multilingual mix (e.g., Polish-English) is detected, in which case the model’s prediction may vary accordingly.

## 3. How It Works

The script follows a simple yet powerful workflow:

1.  **Model Loading**: At startup, the script loads the XGBoost model (`model.joblib`) and the feature scaler (`scaler.pkl`) into memory. This is done only once to avoid I/O overhead during processing.
2.  **File Discovery**: It scans a specified input directory (`input_parquet/` or `input_jsonl/`) for files to process.
3.  **Data Ingestion**:
    -   For **Parquet files**, it reads the data into a pandas DataFrame using the efficient `pyarrow` engine.
    -   For **JSONL files**, it streams the file line by line, parsing each line as a separate JSON object.
4.  **Parallel Processing**: The core task of text analysis is distributed across a pool of worker processes.
    -   A list of texts is extracted from the input file.
    -   The `text` column is extracted into a list, forming a complete batch.
    -   This batch of texts is passed to the `predict_batch` function.
    -   Inside the function, the `TextAnalyzer` calculates features for all texts. This step may itself use mini-batches for memory efficiency.
5.  **Output Generation**:
    -   The results (category and confidence) are collected from all worker processes.
    -   The script appends two new fields to each original record:
        -   `quality_ai`: The predicted category (`LOW`, `MEDIUM`, or `HIGH`).
        -   `confidence`: The model's confidence score for the prediction (e.g., `95.5`).
    -   The enriched data is saved to a new file in the output directory (`output/`), preserving the original data structure.

## 4. Setup and Installation

### Prerequisites

- Python 3.10+ (3.12.3 tested)
- Required Python packages

### Installation

1.  **Clone the repository (if applicable):**
    ```bash
    git clone https://huggingface.co/adgw/quality_classifier_pl
    cd https://huggingface.co/adgw/quality_classifier_pl
    ```

2.  **Install the dependencies:**
    A `requirements.txt` file is uploaded
    Installing:
    ```bash
    pip install -r requirements.txt
    ```
    If you don't have a `requirements.txt` file, install the packages manually:
    ```bash
    pip install joblib pandas pyarrow scikit-learn xgboost tqdm spacy ...
    ```

3.  **Download SpaCy Model:**
    The feature extraction module likely requires a SpaCy model for Polish. Download it via the command line:
    ```bash
    python -m spacy download pl_core_news_md
    ```

### Directory Structure

Ensure your project follows this structure:

```
.
├── input_parquet/
│   └── test.parquet
├── input_jsonl/
│   └── test.jsonl
├── models/
│   ├── model.joblib    # The trained XGBoost model
│   └── scaler.pkl      # The scikit-learn scaler
├── output/             # Output directory for processed files
├── dummy.py            # The interactive testing script
├── main_jsonl.py       # The main processing script jsonl
├── main_parquet.py     # The main processing script parquet
└── text_analyzer/      # The feature extraction module
    ├── __init__.py
    ├── analyzer.py
    ├── utils.py
    ├── constants.py
    └── features/
        ├── base_features.py
        ├── linguistic_features.py
        ├── regex_features.py
        ├── spacy_features.py
        └── structural_features.py

  
```

## 5. Usage

The script is configured to run out-of-the-box. Simply place your data files in the input directory and execute the main script.

### Step 1: Place Your Data Files

-   For **Parquet** files, place them in the `input_parquet/` directory. The script expects a column named `text` in each file.
-   For **JSONL** files, place them in the `input_jsonl/` directory. The script expects a key named `text` in each JSON object.

### Step 2: Run the Script

Open your terminal and execute the Python script:

```bash
python -W ignore main_jsonl.py
or
python -W ignore main_parquet.py
```

### Step 3: Check the Output

The script will create an `output/` directory (if it doesn't exist) and save the processed files there. Each output file will be named `<original_filename>` in new output folder.

For example, `input_parquet/docs.parquet` will be processed and saved as `output/docs.parquet`.

The script automatically skips files that have already been processed and exist in the output directory, making it safe to re-run.

### Example Console Output

```

python -W ignore main_parquet.py
Analiza cech: 100%|███████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 12.47it/s]
                                                text quality_ai  confidence
0  Pierwszy kamienny kościół w stylu romańskim po...       HIGH       99.97
1  FJM.B.ZP \n cykl kształcenia 2019-2024\nKARTA ...        LOW       99.96
2  Sztuka romańska (styl romański, romanizm, roma...       HIGH       99.92
3  Przypisy\n Jerzy Z. Łoziński: Pomniki sztuki w...        LOW       92.30
4  Na temat historii sakramentarza w wiekach XII–...       HIGH       95.76
5  Przednia okładka\nPrzednia okładka\n \nMiniatu...        LOW       92.64
6  Uchwała Nr 19\nZarządu Związku Rzemiosła Polsk...     MEDIUM       62.75
7  Alternatywy 4 to jeden z najważniejszych i naj...       HIGH       99.98
8  Akslop, może to jakieś duńskie miasto\njestem ...       HIGH       73.60
9  Bielik - orzeł, czy nie orzeł?\nBielik, birkut...       HIGH       99.92
Pomyślnie zapisano przetworzone dane do pliku output\test.parquet
Processing time: 0.8603 seconds

Wszystkie pliki zostały przetworzone!
```

## 6. Interactive Testing (dummy.py)

For quick, single-text analysis or model debugging, you can use the `dummy.py` script. It runs in a single-threaded, interactive mode.

### How to Use

1.  Run the script from your terminal:
    ```bash
    python dummy.py
    ```
2.  The script will load the models and prompt you for input with a `>` symbol.
3.  Type any Polish text and press Enter. The script will immediately display the predicted category and confidence score.
4.  To exit the script, type `quit` or `exit` and press Enter.


### More about model evaluation
This model has been integral to the development of Bielik v1, Bielik v2, and Bielik v3, continuously enhanced with additional features across successive versions. It has also benefited from progressively larger training, validation, and test samples, ensuring robust and reliable performance improvements over time

Links to the technical reports:
Bielik v1 - https://arxiv.org/abs/2410.18565
Bielik v2 - https://arxiv.org/abs/2505.02410
Bielik v3 - https://arxiv.org/abs/2505.02550