adgw
/

quality_classifier_pl

Joblib

Model card Files Files and versions

xet

Community

adgw commited on Jul 9, 2025

Commit

67e6aac

verified ·

1 Parent(s): 238a066

update

Browse files

Files changed (1) hide show

README.md +15 -24

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ license: apache-2.0
 This project provides a high-performance script for classifying the quality of Polish texts using a pre-trained XGBoost model. The classifier assigns one of three quality categories (`LOW`, `MEDIUM`, `HIGH`) to each text and provides a confidence score (probability).
-The classification is based on over 200 linguistic features extracted from each text, such as the count of nouns and verbs, NER entities, sentence length statistics, and the number of out-of-vocabulary words. These features are calculated by a companion module (`predictor_parquet_spacy.py`).
 The script is designed for efficient, large-scale data processing. It leverages Python's `multiprocessing` to parallelize computations across all available CPU cores, significantly speeding up the analysis of large datasets. It supports processing files in both **Parquet** and **JSONL** formats.
@@ -16,7 +16,7 @@ The script is designed for efficient, large-scale data processing. It leverages
 - **High-Performance Processing**: Utilizes `multiprocessing` to process texts in parallel, maximizing CPU usage.
 - **Dual Format Support**: Ingests data from either `.parquet` or `.jsonl` files.
-- **Robust Feature Extraction**: Relies on a sophisticated feature engineering module (`predictor_parquet_spacy.py`) to generate over 200 linguistic metrics for accurate classification.
 - **Scalable**: Capable of handling millions of documents by processing files sequentially and texts in parallel.
 - **Seamless Integration**: Appends classification results (`quality_ai` and `confidence`) directly to the original data, preserving all existing columns/keys.
 - **User-Friendly Progress**: Displays a `tqdm` progress bar to monitor the analysis in real-time.
@@ -46,7 +46,7 @@ The script follows a simple yet powerful workflow:
 ### Prerequisites
-- Python 3.8+
 - Required Python packages
 ### Installation
@@ -58,31 +58,21 @@ The script follows a simple yet powerful workflow:
     ```
 2.  **Install the dependencies:**
-    A `requirements.txt` file is recommended. Create one with the following content:
     ```
-    joblib
-    pandas
-    pyarrow
-    scikit-learn
-    xgboost
-    tqdm
-    # Add spacy and any other dependencies from predictor_parquet_spacy.py
-    spacy
-    ```
-    Then install them:
     ```bash
     pip install -r requirements.txt
     ```
     If you don't have a `requirements.txt` file, install the packages manually:
     ```bash
-    pip install joblib pandas pyarrow scikit-learn xgboost tqdm spacy
-    python -m spacy download pl_core_news_md
     ```
 3.  **Download SpaCy Model:**
     The feature extraction module likely requires a SpaCy model for Polish. Download it via the command line:
     ```bash
-    python -m spacy download pl_core_news_lg
     ```
 ### Directory Structure
@@ -98,11 +88,12 @@ Ensure your project follows this structure:
 │   └── data1.jsonl
 │   └── data2.jsonl
 ├── models/
-│   ├── model.joblib           # The trained XGBoost model
-│   └── scaler.pkl             # The scikit-learn scaler
-├── output/                    # Output directory for processed files
-├── main_script.py             # The main processing script
-└── predictor_parquet_spacy.py # The feature extraction module
 ```
 ## 5. Usage
@@ -126,9 +117,9 @@ python -W ignore main_parquet_spacy_parquet.py
 ### Step 3: Check the Output
-The script will create an `output/` directory (if it doesn't exist) and save the processed files there. Each output file will be named `calculated_<original_filename>`.
-For example, `input_parquet/docs.parquet` will be processed and saved as `output/calculated_docs.parquet`.
 The script automatically skips files that have already been processed and exist in the output directory, making it safe to re-run.

 This project provides a high-performance script for classifying the quality of Polish texts using a pre-trained XGBoost model. The classifier assigns one of three quality categories (`LOW`, `MEDIUM`, `HIGH`) to each text and provides a confidence score (probability).
+The classification is based on over 200 linguistic features extracted from each text, such as the count of nouns and verbs, NER entities, sentence length statistics, and the number of out-of-vocabulary words. These features are calculated by a companion module (`predictor.py`).
 The script is designed for efficient, large-scale data processing. It leverages Python's `multiprocessing` to parallelize computations across all available CPU cores, significantly speeding up the analysis of large datasets. It supports processing files in both **Parquet** and **JSONL** formats.
 - **High-Performance Processing**: Utilizes `multiprocessing` to process texts in parallel, maximizing CPU usage.
 - **Dual Format Support**: Ingests data from either `.parquet` or `.jsonl` files.
+- **Robust Feature Extraction**: Relies on a sophisticated feature engineering module (`predictor.py`) to generate over 200 linguistic metrics for accurate classification.
 - **Scalable**: Capable of handling millions of documents by processing files sequentially and texts in parallel.
 - **Seamless Integration**: Appends classification results (`quality_ai` and `confidence`) directly to the original data, preserving all existing columns/keys.
 - **User-Friendly Progress**: Displays a `tqdm` progress bar to monitor the analysis in real-time.
 ### Prerequisites
+- Python 3.8+ (3.12.3 tested)
 - Required Python packages
 ### Installation
     ```
 2.  **Install the dependencies:**
+    A `requirements.txt` file is uploaded
     ```
+    Installing:
     ```bash
     pip install -r requirements.txt
     ```
     If you don't have a `requirements.txt` file, install the packages manually:
     ```bash
+    pip install joblib pandas pyarrow scikit-learn xgboost tqdm spacy ...
     ```
 3.  **Download SpaCy Model:**
     The feature extraction module likely requires a SpaCy model for Polish. Download it via the command line:
     ```bash
+    python -m spacy download pl_core_news_md
     ```
 ### Directory Structure
 │   └── data1.jsonl
 │   └── data2.jsonl
 ├── models/
+│   ├── model.joblib               # The trained XGBoost model
+│   └── scaler.pkl                 # The scikit-learn scaler
+├── output/                        # Output directory for processed files
+├── main_parquet_spacy_jsonl.py             # The main processing script jsonl
+├── main_parquet_spacy_parquet.py           # The main processing script parquet
+└── predictor_parquet_spacy.py     # The feature extraction module
 ```
 ## 5. Usage
 ### Step 3: Check the Output
+The script will create an `output/` directory (if it doesn't exist) and save the processed files there. Each output file will be named `<original_filename>` in new output folder.
+For example, `input_parquet/docs.parquet` will be processed and saved as `output/docs.parquet`.
 The script automatically skips files that have already been processed and exist in the output directory, making it safe to re-run.