SirMappel commited on
Commit
610efb0
·
verified ·
1 Parent(s): b341d95

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +238 -3
README.md CHANGED
@@ -1,3 +1,238 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - da
5
+ metrics:
6
+ - accuracy
7
+ base_model:
8
+ - intfloat/multilingual-e5-large
9
+ pipeline_tag: zero-shot-classification
10
+ library_name: setfit
11
+ tags:
12
+ - Few-Shot
13
+ - Transformers
14
+ - Text-classification
15
+ - Computational_humanities
16
+ - SSH
17
+ - Social-work
18
+ ---
19
+ # Model Card for Model ID
20
+
21
+ <!-- Provide a quick summary of what the model is/does. -->
22
+ This fine-tuned few-shot model
23
+
24
+ This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
25
+
26
+ ## Model Details
27
+ Base model: intfloat/multilingual-e5-large
28
+ Language: Danish (da)
29
+ Task: Reported Speech Detection
30
+ Training data: Danish jobcenter conversation transcripts
31
+
32
+ ### Model Description
33
+
34
+ <!-- Provide a longer summary of what this model is. -->
35
+ This model is a few-shot classifier fine-tuned on transcribed interviews from a job center in Denmark.
36
+ It is designed for binary classification of reported speech, identifying sentences where a speaker references or quotes another person.
37
+
38
+ To support real-world usage, this model is integrated into a two-part processing pipeline that allows users to analyze interview documents and highlight relevant sentences.
39
+
40
+ This model is used in a document processing pipeline that performs the following tasks:
41
+
42
+ - 1️⃣ Input Handling: Accepts .docx files containing interview transcripts.
43
+ - 2️⃣ Sentence Segmentation: Splits the document into individual sentences.
44
+ - 3️⃣ Sentence Classification: Applies the trained model to classify sentences based on reported speech criteria.
45
+ - 4️⃣ HTML-Based Highlighting: Adds visual markers (via HTML tags) to classified sentences.
46
+ - 5️⃣ Output Generation: Produces a .docx file with highlighted sentences, preserving the original content.
47
+
48
+ Additionally, a GUI-based wrapper (built with Gooey) provides a user-friendly .exe program, allowing non-technical users to process documents efficiently.
49
+ For a more in-depth view for the GUI, please read the Github page provided further down.
50
+
51
+
52
+ - **Developed by:** CALDISS, AAU
53
+ - **Funded by [optional]:** Aalborg University
54
+ - **Model type:** [Few-Shot text-Classifier]
55
+ - **Language(s) (NLP):** [Danish]
56
+ - **License:** [MIT]
57
+ - **Finetuned from model [intfloat/Multilingual-e5-large]:**
58
+
59
+ ### Model Sources
60
+ - **Repository:** [Project repository](https://github.com/CALDISS-AAU/bp_SMI_CM)
61
+ - **Paper:** Work in progress by the collaborative Researcher.
62
+
63
+ ## Uses
64
+
65
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
66
+ - Detecting reported speech in transcripts and conversational text.
67
+ - Improving NLP pipelines for Danish-language text processing
68
+ - Enhancing retrieval and classification in Danish conversational datasets.
69
+
70
+ Inteded users inludes researchers or analysts working with danish conversational data or transcripts specifically interested in reported speech as a phenomenon.
71
+
72
+ Following group (but not excluded to) may find it useful:
73
+
74
+ Social Scientists & political scientist:
75
+ - Analysing interview transcipts for social
76
+ - Identifying speech patterns in employment, front-desk services or other institutional/governmental settings.
77
+
78
+ Linguists & NLP researchers:
79
+ - studying reported speech in danish.
80
+ - Developing methods for classiying speech using Transformers architechture.
81
+
82
+
83
+ ### Out-of-Scope Use
84
+
85
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
86
+ - This model is not designed for live conversation analysis or chatbot-like interactions. It works best in offline document processing workflows.
87
+ - General-purpose text classification outside reported speech.
88
+ - Live conversational AI or real-time speech processing.
89
+ - Multilingual applications (this model is optimized for Danish only).
90
+
91
+ ## Bias, Risks, and Limitations
92
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
93
+ - The model is trained on Danish job center interviews, so performance may vary on other types of texts.
94
+
95
+ - Binary classification is based on reported speech detection, but edge cases may exist.
96
+
97
+ - While based on a multilingual model, this fine-tuned version is specifically optimized for Danish. Performance may be unreliable in other languages.
98
+
99
+ - The model assumes transcripts. Messy, formal, or highly unstructured text (e.g., speech-to-text outputs with errors) may reduce accuracy.
100
+
101
+
102
+ ### Recommendations
103
+
104
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
105
+
106
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
107
+
108
+ ## How to Get Started with the Model
109
+
110
+ Use the code below to get started with the model.
111
+ ```
112
+ from transformers import AutoModel, AutoTokenizer
113
+
114
+ model_name = "your-huggingface-username/danish-rep-speech-e5"
115
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
116
+ model = AutoModel.from_pretrained(model_name)
117
+
118
+ text = "Han sagde: 'Jeg kommer i morgen.'"
119
+ inputs = tokenizer(text, return_tensors="pt")
120
+ outputs = model(**inputs)
121
+
122
+ # Extract the embedding
123
+ embedding = outputs.last_hidden_state[:, 0, :].detach().numpy()
124
+ ```
125
+
126
+ ## Training Details
127
+
128
+ ### Training Data
129
+
130
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
131
+
132
+ Training data consits of 55 transcripts of conversations between a citizen and a social worker collected from a danish jobcenter. Data is therefore sensitive and not attached in this model card.
133
+ Data was further evaluated to be balanced and containing a 50/50 split between both tags.
134
+
135
+ ### Training Procedure
136
+
137
+ Pretraining & Base Model:
138
+
139
+ This model is fine-tuned on top of intfloat/multilingual-e5-large, a transformer-based model optimized for embedding-based retrieval. The base model was pretrained using contrastive learning and large-scale multilingual datasets, making it well-suited for semantic similarity and classification tasks.
140
+ Fine-Tuning Details
141
+
142
+ Training Dataset:
143
+ The model was fine-tuned using labelled transcribed interviews from a Danish job center.
144
+ Due to the sensitive nature of the data, it is not publicly available.
145
+
146
+ Objective:
147
+ The model was trained for binary classification of reported speech.
148
+ Labels indicate whether a sentence contains reported speech (reported-speech, not reported-speech).
149
+
150
+ Training Configuration:
151
+ Few-shot learning approach with domain-specific samples.
152
+ Batch size: 32.
153
+ Body Learning rate: 1.0770502781075495e-06
154
+ Solver: lbfgs.
155
+ Number of epochs: 6
156
+ Max Iterations: 279
157
+ Evaluation metric: Accuracy & F1-score.
158
+
159
+ Technical Implementation
160
+
161
+ Tokenization performed using the SentencePiece-based tokenizer from intfloat/multilingual-e5-large.
162
+ Fine-tuning was done using PyTorch and the Hugging Face Trainer API.
163
+ The model is optimized for batch inference rather than real-time processing.
164
+
165
+ 📌 For more details on the architecture, refer to the base model: multilingual-e5-large.
166
+
167
+
168
+ #### Training Hyperparameters
169
+
170
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
171
+
172
+ #### Speeds, Sizes, Times [optional]
173
+
174
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
175
+
176
+ [More Information Needed]
177
+
178
+ ## Evaluation
179
+
180
+ <!-- This section describes the evaluation protocols and provides the results. -->
181
+
182
+ ### Testing Data, Factors & Metrics
183
+
184
+ #### Factors
185
+
186
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
187
+
188
+ [More Information Needed]
189
+
190
+ #### Metrics
191
+
192
+ The model was evaluated using standard classification metrics to measure its performance.
193
+ Evaluation Metrics
194
+
195
+ Accuracy: Measures the overall correctness of predictions.
196
+ F1-Score: Balances precision and recall, ensuring that both false positives and false negatives are considered.
197
+ Precision: Measures how many of the predicted reported speech sentences are actually correct.
198
+
199
+ Results:
200
+
201
+ Not Reported Speech:
202
+ Precision: 0.959
203
+ Recall: 0.924
204
+ F1-Score: 0.941
205
+ Recall: 0.942
206
+
207
+ Reported Speech:
208
+ Precision: 0.927
209
+ Recall: 0.961
210
+ F1: 0.943
211
+
212
+ Accuracy: 0.942
213
+
214
+ ## Hardware used
215
+
216
+
217
+ - **Hardware Type:** 48 (AMD EPYC 9454), 192 GB memory, 1 Nividia H100
218
+ - **Hours used:** 50
219
+ - **Cloud Provider:** Ucloud SDU
220
+ - **Compute Region:** Cloud services based at University of Southern Denmark, Aarhus University and Aalborg Univesity
221
+
222
+ ### Compute Infrastructure
223
+
224
+ Ucloud-cloud infrastructure available at the danish universities
225
+
226
+
227
+ **BibTeX:**
228
+
229
+ [More Information Needed]
230
+
231
+ **APA:**
232
+
233
+ [More Information Needed]
234
+
235
+
236
+ ## Model Card Authors [optional]
237
+
238
+ MKAP @ CALDISS, AAU