Joblib
adgw commited on
Commit
67e6aac
Β·
verified Β·
1 Parent(s): 238a066
Files changed (1) hide show
  1. README.md +15 -24
README.md CHANGED
@@ -8,7 +8,7 @@ license: apache-2.0
8
 
9
  This project provides a high-performance script for classifying the quality of Polish texts using a pre-trained XGBoost model. The classifier assigns one of three quality categories (`LOW`, `MEDIUM`, `HIGH`) to each text and provides a confidence score (probability).
10
 
11
- The classification is based on over 200 linguistic features extracted from each text, such as the count of nouns and verbs, NER entities, sentence length statistics, and the number of out-of-vocabulary words. These features are calculated by a companion module (`predictor_parquet_spacy.py`).
12
 
13
  The script is designed for efficient, large-scale data processing. It leverages Python's `multiprocessing` to parallelize computations across all available CPU cores, significantly speeding up the analysis of large datasets. It supports processing files in both **Parquet** and **JSONL** formats.
14
 
@@ -16,7 +16,7 @@ The script is designed for efficient, large-scale data processing. It leverages
16
 
17
  - **High-Performance Processing**: Utilizes `multiprocessing` to process texts in parallel, maximizing CPU usage.
18
  - **Dual Format Support**: Ingests data from either `.parquet` or `.jsonl` files.
19
- - **Robust Feature Extraction**: Relies on a sophisticated feature engineering module (`predictor_parquet_spacy.py`) to generate over 200 linguistic metrics for accurate classification.
20
  - **Scalable**: Capable of handling millions of documents by processing files sequentially and texts in parallel.
21
  - **Seamless Integration**: Appends classification results (`quality_ai` and `confidence`) directly to the original data, preserving all existing columns/keys.
22
  - **User-Friendly Progress**: Displays a `tqdm` progress bar to monitor the analysis in real-time.
@@ -46,7 +46,7 @@ The script follows a simple yet powerful workflow:
46
 
47
  ### Prerequisites
48
 
49
- - Python 3.8+
50
  - Required Python packages
51
 
52
  ### Installation
@@ -58,31 +58,21 @@ The script follows a simple yet powerful workflow:
58
  ```
59
 
60
  2. **Install the dependencies:**
61
- A `requirements.txt` file is recommended. Create one with the following content:
62
  ```
63
- joblib
64
- pandas
65
- pyarrow
66
- scikit-learn
67
- xgboost
68
- tqdm
69
- # Add spacy and any other dependencies from predictor_parquet_spacy.py
70
- spacy
71
- ```
72
- Then install them:
73
  ```bash
74
  pip install -r requirements.txt
75
  ```
76
  If you don't have a `requirements.txt` file, install the packages manually:
77
  ```bash
78
- pip install joblib pandas pyarrow scikit-learn xgboost tqdm spacy
79
- python -m spacy download pl_core_news_md
80
  ```
81
 
82
  3. **Download SpaCy Model:**
83
  The feature extraction module likely requires a SpaCy model for Polish. Download it via the command line:
84
  ```bash
85
- python -m spacy download pl_core_news_lg
86
  ```
87
 
88
  ### Directory Structure
@@ -98,11 +88,12 @@ Ensure your project follows this structure:
98
  β”‚ └── data1.jsonl
99
  β”‚ └── data2.jsonl
100
  β”œβ”€β”€ models/
101
- β”‚ β”œβ”€β”€ model.joblib # The trained XGBoost model
102
- β”‚ └── scaler.pkl # The scikit-learn scaler
103
- β”œβ”€β”€ output/ # Output directory for processed files
104
- β”œβ”€β”€ main_script.py # The main processing script
105
- └── predictor_parquet_spacy.py # The feature extraction module
 
106
  ```
107
 
108
  ## 5. Usage
@@ -126,9 +117,9 @@ python -W ignore main_parquet_spacy_parquet.py
126
 
127
  ### Step 3: Check the Output
128
 
129
- The script will create an `output/` directory (if it doesn't exist) and save the processed files there. Each output file will be named `calculated_<original_filename>`.
130
 
131
- For example, `input_parquet/docs.parquet` will be processed and saved as `output/calculated_docs.parquet`.
132
 
133
  The script automatically skips files that have already been processed and exist in the output directory, making it safe to re-run.
134
 
 
8
 
9
  This project provides a high-performance script for classifying the quality of Polish texts using a pre-trained XGBoost model. The classifier assigns one of three quality categories (`LOW`, `MEDIUM`, `HIGH`) to each text and provides a confidence score (probability).
10
 
11
+ The classification is based on over 200 linguistic features extracted from each text, such as the count of nouns and verbs, NER entities, sentence length statistics, and the number of out-of-vocabulary words. These features are calculated by a companion module (`predictor.py`).
12
 
13
  The script is designed for efficient, large-scale data processing. It leverages Python's `multiprocessing` to parallelize computations across all available CPU cores, significantly speeding up the analysis of large datasets. It supports processing files in both **Parquet** and **JSONL** formats.
14
 
 
16
 
17
  - **High-Performance Processing**: Utilizes `multiprocessing` to process texts in parallel, maximizing CPU usage.
18
  - **Dual Format Support**: Ingests data from either `.parquet` or `.jsonl` files.
19
+ - **Robust Feature Extraction**: Relies on a sophisticated feature engineering module (`predictor.py`) to generate over 200 linguistic metrics for accurate classification.
20
  - **Scalable**: Capable of handling millions of documents by processing files sequentially and texts in parallel.
21
  - **Seamless Integration**: Appends classification results (`quality_ai` and `confidence`) directly to the original data, preserving all existing columns/keys.
22
  - **User-Friendly Progress**: Displays a `tqdm` progress bar to monitor the analysis in real-time.
 
46
 
47
  ### Prerequisites
48
 
49
+ - Python 3.8+ (3.12.3 tested)
50
  - Required Python packages
51
 
52
  ### Installation
 
58
  ```
59
 
60
  2. **Install the dependencies:**
61
+ A `requirements.txt` file is uploaded
62
  ```
63
+ Installing:
 
 
 
 
 
 
 
 
 
64
  ```bash
65
  pip install -r requirements.txt
66
  ```
67
  If you don't have a `requirements.txt` file, install the packages manually:
68
  ```bash
69
+ pip install joblib pandas pyarrow scikit-learn xgboost tqdm spacy ...
 
70
  ```
71
 
72
  3. **Download SpaCy Model:**
73
  The feature extraction module likely requires a SpaCy model for Polish. Download it via the command line:
74
  ```bash
75
+ python -m spacy download pl_core_news_md
76
  ```
77
 
78
  ### Directory Structure
 
88
  β”‚ └── data1.jsonl
89
  β”‚ └── data2.jsonl
90
  β”œβ”€β”€ models/
91
+ β”‚ β”œβ”€β”€ model.joblib # The trained XGBoost model
92
+ β”‚ └── scaler.pkl # The scikit-learn scaler
93
+ β”œβ”€β”€ output/ # Output directory for processed files
94
+ β”œβ”€β”€ main_parquet_spacy_jsonl.py # The main processing script jsonl
95
+ β”œβ”€β”€ main_parquet_spacy_parquet.py # The main processing script parquet
96
+ └── predictor_parquet_spacy.py # The feature extraction module
97
  ```
98
 
99
  ## 5. Usage
 
117
 
118
  ### Step 3: Check the Output
119
 
120
+ The script will create an `output/` directory (if it doesn't exist) and save the processed files there. Each output file will be named `<original_filename>` in new output folder.
121
 
122
+ For example, `input_parquet/docs.parquet` will be processed and saved as `output/docs.parquet`.
123
 
124
  The script automatically skips files that have already been processed and exist in the output directory, making it safe to re-run.
125