update readme
Browse files
README.md
CHANGED
|
@@ -1,3 +1,152 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# Polish Text Quality Classifier
|
| 6 |
+
|
| 7 |
+
## 1. Overview
|
| 8 |
+
|
| 9 |
+
This project provides a high-performance script for classifying the quality of Polish texts using a pre-trained XGBoost model. The classifier assigns one of three quality categories (`LOW`, `MEDIUM`, `HIGH`) to each text and provides a confidence score (probability).
|
| 10 |
+
|
| 11 |
+
The classification is based on over 200 linguistic features extracted from each text, such as the count of nouns and verbs, NER entities, sentence length statistics, and the number of out-of-vocabulary words. These features are calculated by a companion module (`predictor_parquet_spacy.py`).
|
| 12 |
+
|
| 13 |
+
The script is designed for efficient, large-scale data processing. It leverages Python's `multiprocessing` to parallelize computations across all available CPU cores, significantly speeding up the analysis of large datasets. It supports processing files in both **Parquet** and **JSONL** formats.
|
| 14 |
+
|
| 15 |
+
## 2. Features
|
| 16 |
+
|
| 17 |
+
- **High-Performance Processing**: Utilizes `multiprocessing` to process texts in parallel, maximizing CPU usage.
|
| 18 |
+
- **Dual Format Support**: Ingests data from either `.parquet` or `.jsonl` files.
|
| 19 |
+
- **Robust Feature Extraction**: Relies on a sophisticated feature engineering module (`predictor_parquet_spacy.py`) to generate over 200 linguistic metrics for accurate classification.
|
| 20 |
+
- **Scalable**: Capable of handling millions of documents by processing files sequentially and texts in parallel.
|
| 21 |
+
- **Seamless Integration**: Appends classification results (`quality_ai` and `confidence`) directly to the original data, preserving all existing columns/keys.
|
| 22 |
+
- **User-Friendly Progress**: Displays a `tqdm` progress bar to monitor the analysis in real-time.
|
| 23 |
+
|
| 24 |
+
## 3. How It Works
|
| 25 |
+
|
| 26 |
+
The script follows a simple yet powerful workflow:
|
| 27 |
+
|
| 28 |
+
1. **Model Loading**: At startup, the script loads the XGBoost model (`model.joblib`) and the feature scaler (`scaler.pkl`) into memory. This is done only once to avoid I/O overhead during processing.
|
| 29 |
+
2. **File Discovery**: It scans a specified input directory (`input_parquet/` or `input_jsonl/`) for files to process.
|
| 30 |
+
3. **Data Ingestion**:
|
| 31 |
+
- For **Parquet files**, it reads the data into a pandas DataFrame using the efficient `pyarrow` engine.
|
| 32 |
+
- For **JSONL files**, it streams the file line by line, parsing each line as a separate JSON object.
|
| 33 |
+
4. **Parallel Processing**: The core task of text analysis is distributed across a pool of worker processes.
|
| 34 |
+
- A list of texts is extracted from the input file.
|
| 35 |
+
- Each text is sent to a worker process.
|
| 36 |
+
- The worker process uses the `process_text_file` function to compute the 200+ linguistic features.
|
| 37 |
+
- These features are then scaled and fed into the XGBoost model to predict the quality category and confidence score.
|
| 38 |
+
5. **Output Generation**:
|
| 39 |
+
- The results (category and confidence) are collected from all worker processes.
|
| 40 |
+
- The script appends two new fields to each original record:
|
| 41 |
+
- `quality_ai`: The predicted category (`LOW`, `MEDIUM`, or `HIGH`).
|
| 42 |
+
- `confidence`: The model's confidence score for the prediction (e.g., `95.5`).
|
| 43 |
+
- The enriched data is saved to a new file in the output directory (`output/`), preserving the original data structure.
|
| 44 |
+
|
| 45 |
+
## 4. Setup and Installation
|
| 46 |
+
|
| 47 |
+
### Prerequisites
|
| 48 |
+
|
| 49 |
+
- Python 3.8+
|
| 50 |
+
- Required Python packages
|
| 51 |
+
|
| 52 |
+
### Installation
|
| 53 |
+
|
| 54 |
+
1. **Clone the repository (if applicable):**
|
| 55 |
+
```bash
|
| 56 |
+
git clone <your-repo-url>
|
| 57 |
+
cd <your-repo-name>
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
2. **Install the dependencies:**
|
| 61 |
+
A `requirements.txt` file is recommended. Create one with the following content:
|
| 62 |
+
```
|
| 63 |
+
joblib
|
| 64 |
+
pandas
|
| 65 |
+
pyarrow
|
| 66 |
+
scikit-learn
|
| 67 |
+
xgboost
|
| 68 |
+
tqdm
|
| 69 |
+
# Add spacy and any other dependencies from predictor_parquet_spacy.py
|
| 70 |
+
spacy
|
| 71 |
+
```
|
| 72 |
+
Then install them:
|
| 73 |
+
```bash
|
| 74 |
+
pip install -r requirements.txt
|
| 75 |
+
```
|
| 76 |
+
If you don't have a `requirements.txt` file, install the packages manually:
|
| 77 |
+
```bash
|
| 78 |
+
pip install joblib pandas pyarrow scikit-learn xgboost tqdm spacy
|
| 79 |
+
python -m spacy download pl_core_news_md
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
3. **Download SpaCy Model:**
|
| 83 |
+
The feature extraction module likely requires a SpaCy model for Polish. Download it via the command line:
|
| 84 |
+
```bash
|
| 85 |
+
python -m spacy download pl_core_news_lg
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
### Directory Structure
|
| 89 |
+
|
| 90 |
+
Ensure your project follows this structure:
|
| 91 |
+
|
| 92 |
+
```
|
| 93 |
+
.
|
| 94 |
+
├── input_parquet/
|
| 95 |
+
│ └── docs1.parquet
|
| 96 |
+
│ └── docs2.parquet
|
| 97 |
+
├── input_jsonl/
|
| 98 |
+
│ └── data1.jsonl
|
| 99 |
+
│ └── data2.jsonl
|
| 100 |
+
├── models/
|
| 101 |
+
│ ├── model.joblib # The trained XGBoost model
|
| 102 |
+
│ └── scaler.pkl # The scikit-learn scaler
|
| 103 |
+
├── output/ # Output directory for processed files
|
| 104 |
+
├── main_script.py # The main processing script
|
| 105 |
+
└── predictor_parquet_spacy.py # The feature extraction module
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
## 5. Usage
|
| 109 |
+
|
| 110 |
+
The script is configured to run out-of-the-box. Simply place your data files in the appropriate input directory and run the script.
|
| 111 |
+
|
| 112 |
+
### Step 1: Place Your Data Files
|
| 113 |
+
|
| 114 |
+
- For **Parquet** files, place them in the `input_parquet/` directory. The script expects a column named `text` in each file.
|
| 115 |
+
- For **JSONL** files, place them in the `input_jsonl/` directory. The script expects a key named `text` in each JSON object.
|
| 116 |
+
|
| 117 |
+
### Step 2: Run the Script
|
| 118 |
+
|
| 119 |
+
Open your terminal and execute the Python script:
|
| 120 |
+
|
| 121 |
+
```bash
|
| 122 |
+
python -W ignore main_parquet_spacy_jsonl.py
|
| 123 |
+
or
|
| 124 |
+
python -W ignore main_parquet_spacy_parquet.py
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
### Step 3: Check the Output
|
| 128 |
+
|
| 129 |
+
The script will create an `output/` directory (if it doesn't exist) and save the processed files there. Each output file will be named `calculated_<original_filename>`.
|
| 130 |
+
|
| 131 |
+
For example, `input_parquet/docs.parquet` will be processed and saved as `output/calculated_docs.parquet`.
|
| 132 |
+
|
| 133 |
+
The script automatically skips files that have already been processed and exist in the output directory, making it safe to re-run.
|
| 134 |
+
|
| 135 |
+
### Example Console Output
|
| 136 |
+
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
python -W ignore main_parquet_spacy_parquet.py
|
| 140 |
+
Znaleziono 1 plików w folderze wejściowym
|
| 141 |
+
input_parquet\docs.parquet
|
| 142 |
+
|
| 143 |
+
Podsumowanie:
|
| 144 |
+
- Plików do przetworzenia: 1
|
| 145 |
+
- Plików już przetworzonych: 0
|
| 146 |
+
|
| 147 |
+
Rozpoczynanie przetwarzania 1 plików...
|
| 148 |
+
Dodawanie do kolejki: input_parquet\docs.parquet
|
| 149 |
+
<dataframe head>
|
| 150 |
+
Processing time: 47.61986541748047 seconds
|
| 151 |
+
Wszystkie pliki zostały przetworzone!
|
| 152 |
+
```
|