File size: 2,981 Bytes
f7abaaa
 
 
 
 
 
 
 
 
 
 
 
 
73d8ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
255a85e
73d8ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
255a85e
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: cc-by-sa-4.0
datasets:
- JQ1984/GDPRcasedata
language:
- en
metrics:
- accuracy
base_model:
- JQ1984/legalbert_gdpr_pretrained
pipeline_tag: text-classification
tags:
- legal
---

# GDPR Violation Result Classifier

This repository contains a fine-tuned transformer model for predicting GDPR violation results based on case features.

## Model Overview

- **Base Model**: JQ1984/legalbert_gdpr_pretrained (a BERT model pre-trained on legal and GDPR-specific texts)
- **Task**: Binary classification to predict violation results (0: no violation, 1: violation)
- **Training Method**: 5-fold cross-validation with hyperparameter optimization

## Dataset

The model was trained on a custom GDPR violation dataset containing real violation cases. The dataset includes:

- 2,412 cases total (2,058 violations, 354 non-violations)
- Features include affected data volume, countries, industry sectors, data categories, data processing basis, GDPR clauses, and various violation indicators
- All categorical features were converted to text descriptions for the transformer model
- Dataset link: https://huggingface.co/datasets/JQ1984/GDPRcasedata

## Training Methodology

The training pipeline followed these steps:

1. **Text Conversion**: All numerical and categorical features were converted to text descriptions
2. **K-Fold Cross-Validation**: 5-fold cross-validation was used to ensure robust model performance
3. **Fine-tuning**: LegalBERT model was fine-tuned on the classification task
4. **Hyperparameters**:
   - Batch size: 16
   - Learning rate: 3e-5
   - Epochs: 3
   - Weight decay: 0.01
   - Optimizer: AdamW

## Performance Metrics

The model achieved the following performance metrics across 5-fold cross-validation:

- **Average Accuracy**: 95.03%
- **Average F1 Score**: 89.33%
- **Average Precision**: 92.79%
- **Average Recall**: 86.60%

## Usage

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the model and tokenizer
model_path = "YOUR_USERNAME/gdpr-violation-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Example text (format similar to training data)
text = "GDPR clauses are Art. 5, Art. 6. Date is 2022-05-15. country is Germany. company_industry is Technology. data_category_personal_data is true. data_processing_basis_consent is true."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
outputs = model(**inputs)
probabilities = outputs.logits.softmax(dim=-1)
predicted_class = outputs.logits.argmax(dim=-1).item()

print(f"Predicted class: {predicted_class}")
print(f"Class probabilities: {probabilities[0].tolist()}")
```

## Contact

For questions, feedback, or collaboration opportunities, please contact:
Jacques Qiu(邱耿航)
Email: jonstark186@gmail.com
GitHub: JacquotQ
LinkedIn: https://www.linkedin.com/in/jacques-qiu-50477b266/