--- license: cc-by-sa-4.0 datasets: - JQ1984/GDPRcasedata language: - en metrics: - accuracy base_model: - JQ1984/legalbert_gdpr_pretrained pipeline_tag: text-classification tags: - legal --- # GDPR Violation Result Classifier This repository contains a fine-tuned transformer model for predicting GDPR violation results based on case features. ## Model Overview - **Base Model**: JQ1984/legalbert_gdpr_pretrained (a BERT model pre-trained on legal and GDPR-specific texts) - **Task**: Binary classification to predict violation results (0: no violation, 1: violation) - **Training Method**: 5-fold cross-validation with hyperparameter optimization ## Dataset The model was trained on a custom GDPR violation dataset containing real violation cases. The dataset includes: - 2,412 cases total (2,058 violations, 354 non-violations) - Features include affected data volume, countries, industry sectors, data categories, data processing basis, GDPR clauses, and various violation indicators - All categorical features were converted to text descriptions for the transformer model - Dataset link: https://huggingface.co/datasets/JQ1984/GDPRcasedata ## Training Methodology The training pipeline followed these steps: 1. **Text Conversion**: All numerical and categorical features were converted to text descriptions 2. **K-Fold Cross-Validation**: 5-fold cross-validation was used to ensure robust model performance 3. **Fine-tuning**: LegalBERT model was fine-tuned on the classification task 4. **Hyperparameters**: - Batch size: 16 - Learning rate: 3e-5 - Epochs: 3 - Weight decay: 0.01 - Optimizer: AdamW ## Performance Metrics The model achieved the following performance metrics across 5-fold cross-validation: - **Average Accuracy**: 95.03% - **Average F1 Score**: 89.33% - **Average Precision**: 92.79% - **Average Recall**: 86.60% ## Usage ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer # Load the model and tokenizer model_path = "YOUR_USERNAME/gdpr-violation-classifier" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForSequenceClassification.from_pretrained(model_path) # Example text (format similar to training data) text = "GDPR clauses are Art. 5, Art. 6. Date is 2022-05-15. country is Germany. company_industry is Technology. data_category_personal_data is true. data_processing_basis_consent is true." # Tokenize and predict inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256) outputs = model(**inputs) probabilities = outputs.logits.softmax(dim=-1) predicted_class = outputs.logits.argmax(dim=-1).item() print(f"Predicted class: {predicted_class}") print(f"Class probabilities: {probabilities[0].tolist()}") ``` ## Contact For questions, feedback, or collaboration opportunities, please contact: Jacques Qiu(邱耿航) Email: jonstark186@gmail.com GitHub: JacquotQ LinkedIn: https://www.linkedin.com/in/jacques-qiu-50477b266/