Estonel commited on
Commit
dd024d8
·
verified ·
1 Parent(s): c3e6d0e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +162 -0
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DistilBERT French Multilingual Sequence Classification
2
+
3
+ This is a distilled version of BERT fine-tuned for multilingual sequence classification, with a focus on French text processing. The model demonstrates strong performance across French, Hindi, and English languages.
4
+
5
+ ## Model Details
6
+
7
+ - **Model Type**: DistilBERT for Sequence Classification
8
+ - **Base Architecture**: 6-layer transformer with 12 attention heads
9
+ - **Hidden Size**: 768
10
+ - **Vocabulary Size**: 119,547 tokens
11
+ - **Max Sequence Length**: 512 tokens
12
+ - **Languages**: Multilingual (French, Hindi, English)
13
+
14
+ ## Performance Metrics
15
+
16
+ The model achieved the following performance on evaluation datasets:
17
+
18
+ ### Validation Set (Overall)
19
+ - **Accuracy**: 96.75%
20
+ - **F1 Score**: 96.78%
21
+ - **Precision**: 95.77%
22
+ - **Recall**: 97.82%
23
+
24
+ ### Language-Specific Performance
25
+
26
+ #### French
27
+ - **Accuracy**: 77.32%
28
+ - **F1 Score**: 80.88%
29
+ - **Precision**: 69.46%
30
+ - **Recall**: 96.81%
31
+ - **Samples**: 4,267
32
+
33
+ #### Hindi
34
+ - **Accuracy**: 80.17%
35
+ - **F1 Score**: 83.17%
36
+ - **Precision**: 72.14%
37
+ - **Recall**: 98.17%
38
+ - **Samples**: 2,500
39
+
40
+ #### English
41
+ - **Accuracy**: 97.22%
42
+ - **Samples**: 3,233
43
+
44
+ ### External Dataset Performance
45
+ - **TURNS2K Dataset**: 90.25% accuracy, 90.96% F1 score
46
+
47
+ ## Model Configuration
48
+
49
+ ```json
50
+ {
51
+ "model_type": "distilbert",
52
+ "architectures": ["DistilBertForSequenceClassification"],
53
+ "n_layers": 6,
54
+ "n_heads": 12,
55
+ "dim": 768,
56
+ "hidden_dim": 3072,
57
+ "max_position_embeddings": 512,
58
+ "vocab_size": 119547,
59
+ "activation": "gelu",
60
+ "attention_dropout": 0.1,
61
+ "dropout": 0.1,
62
+ "seq_classif_dropout": 0.2
63
+ }
64
+ ```
65
+
66
+ ## Files Included
67
+
68
+ - `config.json`: Model configuration
69
+ - `tokenizer_config.json`: Tokenizer configuration
70
+ - `tokenizer.json`: Fast tokenizer file
71
+ - `vocab.txt`: Vocabulary file
72
+ - `special_tokens_map.json`: Special tokens mapping
73
+ - `bert_model.onnx`: ONNX model for inference
74
+ - `bert_model_optimized.onnx`: Optimized ONNX model
75
+ - `bert_model_optimized_dynamic_int8.onnx`: INT8 quantized ONNX model
76
+ - `metrics.yaml`: Detailed performance metrics
77
+
78
+ ## Usage
79
+
80
+ ### With Transformers
81
+
82
+ ```python
83
+ from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
84
+ import torch
85
+
86
+ # Load model and tokenizer
87
+ model = DistilBertForSequenceClassification.from_pretrained("your-username/distilled_bert_french_12")
88
+ tokenizer = DistilBertTokenizer.from_pretrained("your-username/distilled_bert_french_12")
89
+
90
+ # Example inference
91
+ text = "Votre texte français ici"
92
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
93
+
94
+ with torch.no_grad():
95
+ outputs = model(**inputs)
96
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
97
+
98
+ print(f"Predictions: {predictions}")
99
+ ```
100
+
101
+ ### With ONNX Runtime
102
+
103
+ ```python
104
+ import onnxruntime as ort
105
+ from transformers import DistilBertTokenizer
106
+ import numpy as np
107
+
108
+ # Load tokenizer and ONNX model
109
+ tokenizer = DistilBertTokenizer.from_pretrained("your-username/distilled_bert_french_12")
110
+ session = ort.InferenceSession("bert_model_optimized.onnx")
111
+
112
+ # Prepare input
113
+ text = "Votre texte français ici"
114
+ inputs = tokenizer(text, return_tensors="np", truncation=True, padding=True, max_length=512)
115
+
116
+ # Run inference
117
+ outputs = session.run(None, {
118
+ "input_ids": inputs["input_ids"],
119
+ "attention_mask": inputs["attention_mask"]
120
+ })
121
+
122
+ predictions = outputs[0]
123
+ print(f"Predictions: {predictions}")
124
+ ```
125
+
126
+ ## Training Details
127
+
128
+ - **Training Steps**: 8,000
129
+ - **Epochs**: 2
130
+ - **Framework**: PyTorch/Transformers
131
+ - **Optimizer**: AdamW (inferred)
132
+ - **Learning Rate Schedule**: Cosine with warmup (inferred)
133
+
134
+ ## Optimization
135
+
136
+ The model includes three ONNX variants for different deployment scenarios:
137
+
138
+ 1. **Standard ONNX** (`bert_model.onnx`): Full precision model
139
+ 2. **Optimized ONNX** (`bert_model_optimized.onnx`): Graph optimizations applied
140
+ 3. **INT8 Quantized** (`bert_model_optimized_dynamic_int8.onnx`): Quantized for faster inference
141
+
142
+ ## License
143
+
144
+ Please ensure you comply with the original BERT license and any dataset licenses used during training.
145
+
146
+ ## Citation
147
+
148
+ If you use this model in your research, please cite:
149
+
150
+ ```bibtex
151
+ @misc{distilled_bert_french_12,
152
+ title={DistilBERT French Multilingual Sequence Classification},
153
+ author={Your Name},
154
+ year={2024},
155
+ howpublished={Hugging Face Model Hub},
156
+ url={https://huggingface.co/your-username/distilled_bert_french_12}
157
+ }
158
+ ```
159
+
160
+ ## Contact
161
+
162
+ For questions or issues, please open an issue in the model repository or contact [your-email@example.com].