update
Browse files
README.md
CHANGED
|
@@ -8,7 +8,7 @@ license: apache-2.0
|
|
| 8 |
|
| 9 |
This project provides a high-performance script for classifying the quality of Polish texts using a pre-trained XGBoost model. The classifier assigns one of three quality categories (`LOW`, `MEDIUM`, `HIGH`) to each text and provides a confidence score (probability).
|
| 10 |
|
| 11 |
-
The classification is based on over 200 linguistic features extracted from each text, such as the count of nouns and verbs, NER entities, sentence length statistics, and the number of out-of-vocabulary words. These features are calculated by a companion module (`
|
| 12 |
|
| 13 |
The script is designed for efficient, large-scale data processing. It leverages Python's `multiprocessing` to parallelize computations across all available CPU cores, significantly speeding up the analysis of large datasets. It supports processing files in both **Parquet** and **JSONL** formats.
|
| 14 |
|
|
@@ -16,7 +16,7 @@ The script is designed for efficient, large-scale data processing. It leverages
|
|
| 16 |
|
| 17 |
- **High-Performance Processing**: Utilizes `multiprocessing` to process texts in parallel, maximizing CPU usage.
|
| 18 |
- **Dual Format Support**: Ingests data from either `.parquet` or `.jsonl` files.
|
| 19 |
-
- **Robust Feature Extraction**: Relies on a sophisticated feature engineering module (`
|
| 20 |
- **Scalable**: Capable of handling millions of documents by processing files sequentially and texts in parallel.
|
| 21 |
- **Seamless Integration**: Appends classification results (`quality_ai` and `confidence`) directly to the original data, preserving all existing columns/keys.
|
| 22 |
- **User-Friendly Progress**: Displays a `tqdm` progress bar to monitor the analysis in real-time.
|
|
@@ -46,7 +46,7 @@ The script follows a simple yet powerful workflow:
|
|
| 46 |
|
| 47 |
### Prerequisites
|
| 48 |
|
| 49 |
-
- Python 3.8+
|
| 50 |
- Required Python packages
|
| 51 |
|
| 52 |
### Installation
|
|
@@ -58,31 +58,21 @@ The script follows a simple yet powerful workflow:
|
|
| 58 |
```
|
| 59 |
|
| 60 |
2. **Install the dependencies:**
|
| 61 |
-
A `requirements.txt` file is
|
| 62 |
```
|
| 63 |
-
|
| 64 |
-
pandas
|
| 65 |
-
pyarrow
|
| 66 |
-
scikit-learn
|
| 67 |
-
xgboost
|
| 68 |
-
tqdm
|
| 69 |
-
# Add spacy and any other dependencies from predictor_parquet_spacy.py
|
| 70 |
-
spacy
|
| 71 |
-
```
|
| 72 |
-
Then install them:
|
| 73 |
```bash
|
| 74 |
pip install -r requirements.txt
|
| 75 |
```
|
| 76 |
If you don't have a `requirements.txt` file, install the packages manually:
|
| 77 |
```bash
|
| 78 |
-
pip install joblib pandas pyarrow scikit-learn xgboost tqdm spacy
|
| 79 |
-
python -m spacy download pl_core_news_md
|
| 80 |
```
|
| 81 |
|
| 82 |
3. **Download SpaCy Model:**
|
| 83 |
The feature extraction module likely requires a SpaCy model for Polish. Download it via the command line:
|
| 84 |
```bash
|
| 85 |
-
python -m spacy download
|
| 86 |
```
|
| 87 |
|
| 88 |
### Directory Structure
|
|
@@ -98,11 +88,12 @@ Ensure your project follows this structure:
|
|
| 98 |
β βββ data1.jsonl
|
| 99 |
β βββ data2.jsonl
|
| 100 |
βββ models/
|
| 101 |
-
β βββ model.joblib
|
| 102 |
-
β βββ scaler.pkl
|
| 103 |
-
βββ output/
|
| 104 |
-
βββ
|
| 105 |
-
|
|
|
|
| 106 |
```
|
| 107 |
|
| 108 |
## 5. Usage
|
|
@@ -126,9 +117,9 @@ python -W ignore main_parquet_spacy_parquet.py
|
|
| 126 |
|
| 127 |
### Step 3: Check the Output
|
| 128 |
|
| 129 |
-
The script will create an `output/` directory (if it doesn't exist) and save the processed files there. Each output file will be named
|
| 130 |
|
| 131 |
-
For example, `input_parquet/docs.parquet` will be processed and saved as `output/
|
| 132 |
|
| 133 |
The script automatically skips files that have already been processed and exist in the output directory, making it safe to re-run.
|
| 134 |
|
|
|
|
| 8 |
|
| 9 |
This project provides a high-performance script for classifying the quality of Polish texts using a pre-trained XGBoost model. The classifier assigns one of three quality categories (`LOW`, `MEDIUM`, `HIGH`) to each text and provides a confidence score (probability).
|
| 10 |
|
| 11 |
+
The classification is based on over 200 linguistic features extracted from each text, such as the count of nouns and verbs, NER entities, sentence length statistics, and the number of out-of-vocabulary words. These features are calculated by a companion module (`predictor.py`).
|
| 12 |
|
| 13 |
The script is designed for efficient, large-scale data processing. It leverages Python's `multiprocessing` to parallelize computations across all available CPU cores, significantly speeding up the analysis of large datasets. It supports processing files in both **Parquet** and **JSONL** formats.
|
| 14 |
|
|
|
|
| 16 |
|
| 17 |
- **High-Performance Processing**: Utilizes `multiprocessing` to process texts in parallel, maximizing CPU usage.
|
| 18 |
- **Dual Format Support**: Ingests data from either `.parquet` or `.jsonl` files.
|
| 19 |
+
- **Robust Feature Extraction**: Relies on a sophisticated feature engineering module (`predictor.py`) to generate over 200 linguistic metrics for accurate classification.
|
| 20 |
- **Scalable**: Capable of handling millions of documents by processing files sequentially and texts in parallel.
|
| 21 |
- **Seamless Integration**: Appends classification results (`quality_ai` and `confidence`) directly to the original data, preserving all existing columns/keys.
|
| 22 |
- **User-Friendly Progress**: Displays a `tqdm` progress bar to monitor the analysis in real-time.
|
|
|
|
| 46 |
|
| 47 |
### Prerequisites
|
| 48 |
|
| 49 |
+
- Python 3.8+ (3.12.3 tested)
|
| 50 |
- Required Python packages
|
| 51 |
|
| 52 |
### Installation
|
|
|
|
| 58 |
```
|
| 59 |
|
| 60 |
2. **Install the dependencies:**
|
| 61 |
+
A `requirements.txt` file is uploaded
|
| 62 |
```
|
| 63 |
+
Installing:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
```bash
|
| 65 |
pip install -r requirements.txt
|
| 66 |
```
|
| 67 |
If you don't have a `requirements.txt` file, install the packages manually:
|
| 68 |
```bash
|
| 69 |
+
pip install joblib pandas pyarrow scikit-learn xgboost tqdm spacy ...
|
|
|
|
| 70 |
```
|
| 71 |
|
| 72 |
3. **Download SpaCy Model:**
|
| 73 |
The feature extraction module likely requires a SpaCy model for Polish. Download it via the command line:
|
| 74 |
```bash
|
| 75 |
+
python -m spacy download pl_core_news_md
|
| 76 |
```
|
| 77 |
|
| 78 |
### Directory Structure
|
|
|
|
| 88 |
β βββ data1.jsonl
|
| 89 |
β βββ data2.jsonl
|
| 90 |
βββ models/
|
| 91 |
+
β βββ model.joblib # The trained XGBoost model
|
| 92 |
+
β βββ scaler.pkl # The scikit-learn scaler
|
| 93 |
+
βββ output/ # Output directory for processed files
|
| 94 |
+
βββ main_parquet_spacy_jsonl.py # The main processing script jsonl
|
| 95 |
+
βββ main_parquet_spacy_parquet.py # The main processing script parquet
|
| 96 |
+
βββ predictor_parquet_spacy.py # The feature extraction module
|
| 97 |
```
|
| 98 |
|
| 99 |
## 5. Usage
|
|
|
|
| 117 |
|
| 118 |
### Step 3: Check the Output
|
| 119 |
|
| 120 |
+
The script will create an `output/` directory (if it doesn't exist) and save the processed files there. Each output file will be named `<original_filename>` in new output folder.
|
| 121 |
|
| 122 |
+
For example, `input_parquet/docs.parquet` will be processed and saved as `output/docs.parquet`.
|
| 123 |
|
| 124 |
The script automatically skips files that have already been processed and exist in the output directory, making it safe to re-run.
|
| 125 |
|