File size: 2,981 Bytes
f7abaaa 73d8ca7 255a85e 73d8ca7 255a85e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | ---
license: cc-by-sa-4.0
datasets:
- JQ1984/GDPRcasedata
language:
- en
metrics:
- accuracy
base_model:
- JQ1984/legalbert_gdpr_pretrained
pipeline_tag: text-classification
tags:
- legal
---
# GDPR Violation Result Classifier
This repository contains a fine-tuned transformer model for predicting GDPR violation results based on case features.
## Model Overview
- **Base Model**: JQ1984/legalbert_gdpr_pretrained (a BERT model pre-trained on legal and GDPR-specific texts)
- **Task**: Binary classification to predict violation results (0: no violation, 1: violation)
- **Training Method**: 5-fold cross-validation with hyperparameter optimization
## Dataset
The model was trained on a custom GDPR violation dataset containing real violation cases. The dataset includes:
- 2,412 cases total (2,058 violations, 354 non-violations)
- Features include affected data volume, countries, industry sectors, data categories, data processing basis, GDPR clauses, and various violation indicators
- All categorical features were converted to text descriptions for the transformer model
- Dataset link: https://huggingface.co/datasets/JQ1984/GDPRcasedata
## Training Methodology
The training pipeline followed these steps:
1. **Text Conversion**: All numerical and categorical features were converted to text descriptions
2. **K-Fold Cross-Validation**: 5-fold cross-validation was used to ensure robust model performance
3. **Fine-tuning**: LegalBERT model was fine-tuned on the classification task
4. **Hyperparameters**:
- Batch size: 16
- Learning rate: 3e-5
- Epochs: 3
- Weight decay: 0.01
- Optimizer: AdamW
## Performance Metrics
The model achieved the following performance metrics across 5-fold cross-validation:
- **Average Accuracy**: 95.03%
- **Average F1 Score**: 89.33%
- **Average Precision**: 92.79%
- **Average Recall**: 86.60%
## Usage
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load the model and tokenizer
model_path = "YOUR_USERNAME/gdpr-violation-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
# Example text (format similar to training data)
text = "GDPR clauses are Art. 5, Art. 6. Date is 2022-05-15. country is Germany. company_industry is Technology. data_category_personal_data is true. data_processing_basis_consent is true."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
outputs = model(**inputs)
probabilities = outputs.logits.softmax(dim=-1)
predicted_class = outputs.logits.argmax(dim=-1).item()
print(f"Predicted class: {predicted_class}")
print(f"Class probabilities: {probabilities[0].tolist()}")
```
## Contact
For questions, feedback, or collaboration opportunities, please contact:
Jacques Qiu(邱耿航)
Email: jonstark186@gmail.com
GitHub: JacquotQ
LinkedIn: https://www.linkedin.com/in/jacques-qiu-50477b266/ |