--- license: mit title: AVeri sdk: docker emoji: ๐Ÿ“š colorFrom: gray colorTo: indigo short_description: An NLP-based author verifier tool. --- # AVeri: Author Verification This repository contains the source code for an *authorship verifier* tool, which is used to predict whether a given pair of two texts were written by the same author based purely on stylistic and lexical characteristics (not semantic which is used to convey meaning or topic). The repository includes end-to-end machine learning pipeline for preparing paired texts, extracting stylometric and lexical features, training a binary classifier, and serving the trained model through a small Flask web app. ๐ŸŽฅ [YOU CAN ACCESS THE LIVE DEMO HERE](https://huggingface.co/spaces/salirafi/AVeri) ๐ŸŽฅ ![static/image.png](static/image.png) โš ๏ธ **IMPORTANT!** โš ๏ธ - The tool is only trained on English text. Other languages are not yet supported. - The tool is **NOT** perfect. Although the in-app-shown metrics show that the model might perform reasonably well, in many cases, it will still predict false positives or negatives. This is because the model has a very important underlying assumption: different authors would have different writing styles. Since it is trained by considering the stylistic and lexical features only, if two texts look similar based on those features, it will still predict them as coming from the same author although they were actually written by different authors. This is especially true for many formal or proper texts, where there are minimum stylistic differences between authors, making distinguishing the two classes noticeably more difficult. Hence, please interpret the result accordingly. - The current saved model uses a classification threshold of 0.58, selected by maximizing Youden's J statistic on the validation set. ## Tools Used ### Backend - pandas - numpy - sklearn - scipy - spaCy - XGBoost - Flask ### Frontend - HTML - CSS - JavaScript (plain) ## How to Run ### Minimal Procedure Run the following commands: ```bash python -m venv .venv source .venv/bin/activate pip install -r requirements.txt python -m spacy download en_core_web_lg ``` `en_core_web_lg` is the spaCy model used to process the texts before training. See [https://spacy.io/models/en#en_core_web_lg](https://spacy.io/models/en#en_core_web_lg). In principle, one can use any spaCy model desired, including `en_core_web_trf`, which is the best model available (using transformer). Just make sure there is enough device RAM for the processing as it may consume a huge amount of memory if there are many long texts. If all saved model artifacts are present in the repo (which they should be; but if not, see [here](#4.-rebuild-the-training-pipeline)), directly run: ```bash python app.py ``` Then open: ```text http://127.0.0.1:5000 ``` ### Run a Prediction From Python From the project root: ```bash PYTHONPATH=src python ``` Then: ```python from inference import Inference service = Inference(project_root=".") result = service.predict("First text here.", "Second text here.") print(result.to_dict()) ``` ### Rebuild the Training Pipeline The full pipeline (except model training and inference) is run in [src/pipeline.ipynb](src/pipeline.ipynb) so any interesting user can just follow the steps in the notebook until features extraction. The datatest can be downloaded from [HERE](https://huggingface.co/swan07/bert-authorship-verification), which already includes all three splits: train, validation, and test. The expected downloaded data location is: ```text data/raw/ |-- authorship_verification_train/ |-- authorship_verification_validation/ `-- authorship_verification_test/ ``` The notebook performs the pipeline stages in order and writes artifacts under [saved/](saved/). I make `data/raw/`, `*.parquet`, and `*.pkl` to be ignored by Git, so these generated artifacts must be recreated locally after cloning before you can train the model. Specifically, [src/model_training.py](src/model_training.py), the training pipeline, expects: ```text saved/ngram_features/dataframes/train_ngram.parquet saved/ngram_features/dataframes/validation_ngram.parquet saved/ngram_features/dataframes/test_ngram.parquet ``` Model training can then be launched directly after the feature parquet files exist with: ```bash python src/model_training.py ``` ## Contents ```text . |-- app.py |-- README.md |-- .gitignore |-- src/ | |-- audit.py | |-- normalization.py | |-- masking_regex.py | |-- masking_spacy.py | |-- features_statistical.py | |-- features_tfidf.py | |-- features_ngram.py | |-- dimensionality_reduction.py | |-- model_training.py | |-- inference.py | |-- helpers.py | |-- function_words.py | `-- pipeline.ipynb |-- saved/ | |-- audit/ | |-- normalization/ | |-- masking/ | |-- statistical_features/ | |-- tfidf_features/ | |-- ngram_features/ | |-- dimensionality_reduction/ | |-- pairwise_baseline/ | `-- model/ |-- static/ | |-- app.js | `-- styles.css `-- templates/ `-- index.html ``` ## Pipeline Overview ### 1. Audit and Filtering Implemented in [src/audit.py](src/audit.py). The audit stage loads the HuggingFace datasets (download from [HERE](https://huggingface.co/swan07/bert-authorship-verification)) from disk for `train`, `validation`, and `test`. Importantly, tt removes rows where (among other filterings) either text is outside the configured word-count range. ![static/cdf.png](static/cdf.png) ### 2. Normalization Implemented in [src/normalization.py](src/normalization.py). The normalization stage reduces accidental text variation before feature extraction: - HTML entities are unescaped. - Broken Unicode is repaired with `ftfy`. - Unicode is normalized with NFC. - Line endings are standardized to `\n`. - Non-printable control characters are removed. - Curly quotes, long dashes, minus signs, and non-breaking spaces are mapped to simpler equivalents. - Inline whitespace is collapsed. - Excess newlines are limited. - LaTeX math spans are normalized. ### 3. Regex Masking Implemented in [src/masking_regex.py](src/masking_regex.py). This stage replaces surface identifiers with placeholders: | Pattern | Placeholder | |---|---| | URL | `` | | Email | `` | | Date | `` | | Time | `