File size: 10,368 Bytes
610efb0 fdee6a6 610efb0 fdee6a6 610efb0 fdee6a6 610efb0 740f01e 610efb0 fdee6a6 610efb0 fdee6a6 610efb0 fdee6a6 610efb0 86cc441 610efb0 fdee6a6 610efb0 86cc441 610efb0 86cc441 610efb0 86cc441 610efb0 86cc441 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | ---
license: mit
language:
- da
metrics:
- accuracy
base_model:
- intfloat/multilingual-e5-large
pipeline_tag: text-classification
library_name: setfit
tags:
- Few-Shot
- Transformers
- Text-classification
- sentence-transformers
- setfit
- generated_from_setfit_traine
- Computational_humanities
- SSH
- Social-work
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
This fine-tuned few-shot model
This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
## Model Details
Base model: intfloat/multilingual-e5-large
Language: Danish (da)
Task: Reported Speech Detection
Training data: Danish jobcenter conversation transcripts
### Model Description
- **Model Type:** SetFit
- **Sentence Transformer body:** [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)
- **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
- **Maximum Sequence Length:** 512 tokens
- **Number of Classes:** 2 classes
- **Language:** Danish
- **License:** MIT License
This model is a few-shot classifier fine-tuned on transcribed interviews from a job center in Denmark.
It is designed for binary classification of reported speech, identifying sentences where a speaker references or quotes another person.
To support real-world usage, this model is integrated into a two-part processing pipeline that allows users to analyze interview documents and highlight relevant sentences.
This model is used in a document processing pipeline that performs the following tasks:
- 1️⃣ Input Handling: Accepts .docx files containing interview transcripts.
- 2️⃣ Sentence Segmentation: Splits the document into individual sentences.
- 3️⃣ Sentence Classification: Applies the trained model to classify sentences based on reported speech criteria.
- 4️⃣ HTML-Based Highlighting: Adds visual markers (via HTML tags) to classified sentences.
- 5️⃣ Output Generation: Produces a .docx file with highlighted sentences, preserving the original content.
Additionally, a GUI-based wrapper (built with Gooey) provides a user-friendly .exe program, allowing non-technical users to process documents efficiently.
For a more in-depth view for the GUI, please read the Github page provided further down.
- **Developed by:** CALDISS, AAU
- **Funded by:** Aalborg University
- **Model type:** Few-Shot text-Classifier
- **Language(s) (NLP):** Danish
- **License:** MIT
- **Finetuned from model intfloat/Multilingual-e5-large:**
### Model Sources
- **Repository:** [Project repository](https://github.com/CALDISS-AAU/bp_SMI_CM)
- **Paper:** Work in progress by the collaborative Researcher.
## Uses
The model is trained and evaluated on text snippets of "reported speech" in Danish interviews between citizens and job counselors. It is intended to identify "reported speech" in similar text documents of that genre. It is assumed unsuitable for general classification of "reported speech".
Inteded users inludes researchers or analysts working with danish conversational data or transcripts specifically interested in reported speech as a phenomenon.
Following group (but not excluded to) may find it useful:
Social Scientists & political scientist:
- Analysing interview transcipts for social
- Identifying speech patterns in employment, front-desk services or other institutional/governmental settings.
Linguists & NLP researchers:
- studying reported speech in danish.
- Developing methods for classiying speech using Transformers architechture.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
- This model is not designed for live conversation analysis or chatbot-like interactions. It works best in offline document processing workflows.
- General-purpose text classification outside reported speech.
- Live conversational AI or real-time speech processing.
- Multilingual applications (this model is optimized for Danish only).
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
- The model is trained on Danish job center interviews, so performance may vary on other types of texts.
- Binary classification is based on reported speech detection, but edge cases may exist.
- While based on a multilingual model, this fine-tuned version is specifically optimized for Danish. Performance may be unreliable in other languages.
- The model assumes transcripts. Messy, formal, or highly unstructured text (e.g., speech-to-text outputs with errors) may reduce accuracy.
### Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
## How to Get Started with the Model
Use the code below to get started with the model.
```
from transformers import AutoModel, AutoTokenizer
model_name = "your-huggingface-username/danish-rep-speech-e5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
text = "Han sagde: 'Jeg kommer i morgen.'"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Extract the embedding
embedding = outputs.last_hidden_state[:, 0, :].detach().numpy()
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
Training data consits of 55 transcripts of conversations between a citizen and a social worker collected from a danish jobcenter. Data is therefore sensitive and not attached in this model card.
Data was further evaluated to be balanced and containing a 50/50 split between both tags.
### Training Procedure
Pretraining & Base Model:
This model is fine-tuned on top of intfloat/multilingual-e5-large, a transformer-based model optimized for embedding-based retrieval. The base model was pretrained using contrastive learning and large-scale multilingual datasets, making it well-suited for semantic similarity and classification tasks.
Fine-Tuning Details
Training Dataset:
The model was fine-tuned using labelled transcribed interviews from a Danish job center.
Due to the sensitive nature of the data, it is not publicly available.
Objective:
The model was trained for binary classification of reported speech.
Labels indicate whether a sentence contains reported speech (reported-speech, not reported-speech).
Training Configuration:
Few-shot learning approach with domain-specific samples.
Batch size: 32.
Body Learning rate: 1.0770502781075495e-06
Solver: lbfgs.
Number of epochs: 6
Max Iterations: 279
Evaluation metric: Accuracy & F1-score.
Technical Implementation
Tokenization performed using the SentencePiece-based tokenizer from intfloat/multilingual-e5-large.
Fine-tuning was done using PyTorch and the Hugging Face Trainer API.
The model is optimized for batch inference rather than real-time processing.
📌 For more details on the architecture, refer to the base model: multilingual-e5-large.
#### Training Hyperparameters
- batch_size: (32, 32)
- num_epochs: (6, 6)
- max_steps: -1
- sampling_strategy: oversampling
- body_learning_rate: (1.0770502781075495e-06, 1.0770502781075495e-06)
- head_learning_rate: 0.01
- loss: CosineSimilarityLoss
- distance_metric: cosine_distance
- margin: 0.25
- end_to_end: False
- use_amp: False
- warmup_proportion: 0.1
- seed: 42
- eval_max_steps: -1
- load_best_model_at_end: True
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
metrics:
- type: accuracy
value: 0.9724770642201835
name: Accuracy
- type: precision
value: 0.9557522123893806
name: Precision
- type: recall
value: 0.9908256880733946
name: Recall
- type: f1
value: 0.972972972972973
name: F1
#### Factors
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
[More Information Needed]
#### Metrics
The model was evaluated using standard classification metrics to measure its performance.
Evaluation Metrics
Accuracy: Measures the overall correctness of predictions.
F1-Score: Balances precision and recall, ensuring that both false positives and false negatives are considered.
Precision: Measures how many of the predicted reported speech sentences are actually correct.
Results:
Not Reported Speech:
Precision: 0.959
Recall: 0.924
F1-Score: 0.941
Recall: 0.942
Reported Speech:
Precision: 0.927
Recall: 0.961
F1: 0.943
Accuracy: 0.942
## Hardware used
- **Hardware Type:** 48 (AMD EPYC 9454), 192 GB memory, 1 Nividia H100
- **Hours used:** 50
- **Cloud Provider:** Ucloud SDU
- **Compute Region:** Cloud services based at University of Southern Denmark, Aarhus University and Aalborg Univesity
### Compute Infrastructure
Ucloud-cloud infrastructure available at the danish universities
### Framework Versions
- Python: 3.12.3
- SetFit: 1.0.3
- Sentence Transformers: 3.0.1
- Transformers: 4.39.0
- PyTorch: 2.4.1+cu121
- Datasets: 2.21.0
- Tokenizers: 0.15.2
**BibTeX:**
@article{https://doi.org/10.48550/arxiv.2209.11055,
doi = {10.48550/ARXIV.2209.11055},
url = {https://arxiv.org/abs/2209.11055},
author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Efficient Few-Shot Learning Without Prompts},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
## Model Card Authors
- Matias Kokholm Appel - mkap@adm.aau.dk
- Kristian Gade Kjelmann - kgk@adm.aau.dk
- Nana Ohmeyer
## Model Card Contact
caldiss@adm.aau.dk
https://www.en.caldiss.aau.dk/ |