File size: 10,368 Bytes
610efb0
 
 
 
 
 
 
 
fdee6a6
610efb0
 
 
 
 
fdee6a6
 
 
610efb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdee6a6
 
 
 
 
 
 
 
610efb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
740f01e
 
 
 
 
610efb0
 
 
 
 
 
 
fdee6a6
610efb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdee6a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
610efb0
 
 
 
 
 
fdee6a6
 
 
 
 
 
 
 
 
 
 
 
 
610efb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86cc441
 
 
 
 
 
 
 
610efb0
 
 
fdee6a6
 
 
 
 
 
 
 
 
 
610efb0
 
86cc441
 
 
 
610efb0
86cc441
610efb0
86cc441
610efb0
86cc441
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
license: mit
language:
- da
metrics:
- accuracy
base_model:
- intfloat/multilingual-e5-large
pipeline_tag: text-classification
library_name: setfit
tags:
- Few-Shot
- Transformers
- Text-classification
- sentence-transformers
- setfit
- generated_from_setfit_traine
- Computational_humanities
- SSH
- Social-work
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->
This fine-tuned few-shot model 

This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).

## Model Details
Base model: intfloat/multilingual-e5-large
Language: Danish (da)
Task: Reported Speech Detection
Training data: Danish jobcenter conversation transcripts

### Model Description

- **Model Type:** SetFit
- **Sentence Transformer body:** [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)
- **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
- **Maximum Sequence Length:** 512 tokens
- **Number of Classes:** 2 classes
- **Language:** Danish
- **License:** MIT License

This model is a few-shot classifier fine-tuned on transcribed interviews from a job center in Denmark. 
It is designed for binary classification of reported speech, identifying sentences where a speaker references or quotes another person.

To support real-world usage, this model is integrated into a two-part processing pipeline that allows users to analyze interview documents and highlight relevant sentences.

This model is used in a document processing pipeline that performs the following tasks:

- 1️⃣ Input Handling: Accepts .docx files containing interview transcripts.
- 2️⃣ Sentence Segmentation: Splits the document into individual sentences.
- 3️⃣ Sentence Classification: Applies the trained model to classify sentences based on reported speech criteria.
- 4️⃣ HTML-Based Highlighting: Adds visual markers (via HTML tags) to classified sentences.
- 5️⃣ Output Generation: Produces a .docx file with highlighted sentences, preserving the original content.

Additionally, a GUI-based wrapper (built with Gooey) provides a user-friendly .exe program, allowing non-technical users to process documents efficiently.
For a more in-depth view for the GUI, please read the Github page provided further down.


- **Developed by:** CALDISS, AAU
- **Funded by:** Aalborg University
- **Model type:** Few-Shot text-Classifier
- **Language(s) (NLP):** Danish
- **License:** MIT
- **Finetuned from model intfloat/Multilingual-e5-large:**

### Model Sources
- **Repository:** [Project repository](https://github.com/CALDISS-AAU/bp_SMI_CM)
- **Paper:** Work in progress by the collaborative Researcher.

## Uses

The model is trained and evaluated on text snippets of "reported speech" in Danish interviews between citizens and job counselors. It is intended to identify "reported speech" in similar text documents of that genre. It is assumed unsuitable for general classification of "reported speech".

Inteded users inludes researchers or analysts working with danish conversational data or transcripts specifically interested in reported speech as a phenomenon.

Following group (but not excluded to) may find it useful:

Social Scientists & political scientist:
- Analysing interview transcipts for social
- Identifying speech patterns in employment, front-desk services or other institutional/governmental settings.

Linguists & NLP researchers:
- studying reported speech in danish.
- Developing methods for classiying speech using Transformers architechture.


### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
- This model is not designed for live conversation analysis or chatbot-like interactions. It works best in offline document processing workflows.
- General-purpose text classification outside reported speech.
- Live conversational AI or real-time speech processing.
- Multilingual applications (this model is optimized for Danish only).

## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
- The model is trained on Danish job center interviews, so performance may vary on other types of texts.

- Binary classification is based on reported speech detection, but edge cases may exist.

- While based on a multilingual model, this fine-tuned version is specifically optimized for Danish. Performance may be unreliable in other languages.

- The model assumes transcripts. Messy, formal, or highly unstructured text (e.g., speech-to-text outputs with errors) may reduce accuracy.


### Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

## How to Get Started with the Model

Use the code below to get started with the model.
```
from transformers import AutoModel, AutoTokenizer

model_name = "your-huggingface-username/danish-rep-speech-e5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

text = "Han sagde: 'Jeg kommer i morgen.'"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Extract the embedding
embedding = outputs.last_hidden_state[:, 0, :].detach().numpy()
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

Training data consits of 55 transcripts of conversations between a citizen and a social worker collected from a danish jobcenter. Data is therefore sensitive and not attached in this model card.
Data was further evaluated to be balanced and containing a 50/50 split between both tags. 

### Training Procedure

Pretraining & Base Model:

This model is fine-tuned on top of intfloat/multilingual-e5-large, a transformer-based model optimized for embedding-based retrieval. The base model was pretrained using contrastive learning and large-scale multilingual datasets, making it well-suited for semantic similarity and classification tasks.
Fine-Tuning Details

    Training Dataset:
        The model was fine-tuned using labelled transcribed interviews from a Danish job center.
        Due to the sensitive nature of the data, it is not publicly available.

    Objective:
        The model was trained for binary classification of reported speech.
        Labels indicate whether a sentence contains reported speech (reported-speech, not reported-speech).

    Training Configuration:
        Few-shot learning approach with domain-specific samples.
        Batch size: 32.
        Body Learning rate: 1.0770502781075495e-06
        Solver: lbfgs.
        Number of epochs: 6
        Max Iterations: 279
        Evaluation metric: Accuracy & F1-score.

Technical Implementation

    Tokenization performed using the SentencePiece-based tokenizer from intfloat/multilingual-e5-large.
    Fine-tuning was done using PyTorch and the Hugging Face Trainer API.
    The model is optimized for batch inference rather than real-time processing.

📌 For more details on the architecture, refer to the base model: multilingual-e5-large.


#### Training Hyperparameters
- batch_size: (32, 32)
- num_epochs: (6, 6)
- max_steps: -1
- sampling_strategy: oversampling
- body_learning_rate: (1.0770502781075495e-06, 1.0770502781075495e-06)
- head_learning_rate: 0.01
- loss: CosineSimilarityLoss
- distance_metric: cosine_distance
- margin: 0.25
- end_to_end: False
- use_amp: False
- warmup_proportion: 0.1
- seed: 42
- eval_max_steps: -1
- load_best_model_at_end: True

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics
    metrics:
    - type: accuracy
      value: 0.9724770642201835
      name: Accuracy
    - type: precision
      value: 0.9557522123893806
      name: Precision
    - type: recall
      value: 0.9908256880733946
      name: Recall
    - type: f1
      value: 0.972972972972973
      name: F1

#### Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

[More Information Needed]

#### Metrics

The model was evaluated using standard classification metrics to measure its performance.
Evaluation Metrics

    Accuracy: Measures the overall correctness of predictions.
    F1-Score: Balances precision and recall, ensuring that both false positives and false negatives are considered.
    Precision: Measures how many of the predicted reported speech sentences are actually correct.

Results:

Not Reported Speech:
Precision: 0.959
Recall: 0.924
F1-Score: 0.941
Recall: 0.942

Reported Speech:
Precision: 0.927
Recall: 0.961
F1: 0.943

Accuracy: 0.942

## Hardware used


- **Hardware Type:** 48 (AMD EPYC 9454), 192 GB memory, 1 Nividia H100
- **Hours used:** 50
- **Cloud Provider:** Ucloud SDU
- **Compute Region:** Cloud services based at University of Southern Denmark, Aarhus University and Aalborg Univesity

### Compute Infrastructure

Ucloud-cloud infrastructure available at the danish universities

### Framework Versions
- Python: 3.12.3
- SetFit: 1.0.3
- Sentence Transformers: 3.0.1
- Transformers: 4.39.0
- PyTorch: 2.4.1+cu121
- Datasets: 2.21.0
- Tokenizers: 0.15.2

**BibTeX:**

@article{https://doi.org/10.48550/arxiv.2209.11055,
    doi = {10.48550/ARXIV.2209.11055},
    url = {https://arxiv.org/abs/2209.11055},
    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {Efficient Few-Shot Learning Without Prompts},
    publisher = {arXiv},
    year = {2022},
    copyright = {Creative Commons Attribution 4.0 International}
}


## Model Card Authors
- Matias Kokholm Appel - mkap@adm.aau.dk
- Kristian Gade Kjelmann - kgk@adm.aau.dk
- Nana Ohmeyer

## Model Card Contact

caldiss@adm.aau.dk

https://www.en.caldiss.aau.dk/