CALDISS-AAU
/

da-reported-speech-e5

+---
+license: mit
+language:
+- da
+metrics:
+- accuracy
+base_model:
+- intfloat/multilingual-e5-large
+pipeline_tag: zero-shot-classification
+library_name: setfit
+tags:
+- Few-Shot
+- Transformers
+- Text-classification
+- Computational_humanities
+- SSH
+- Social-work
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+This fine-tuned few-shot model
+This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
+## Model Details
+Base model: intfloat/multilingual-e5-large
+Language: Danish (da)
+Task: Reported Speech Detection
+Training data: Danish jobcenter conversation transcripts
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This model is a few-shot classifier fine-tuned on transcribed interviews from a job center in Denmark.
+It is designed for binary classification of reported speech, identifying sentences where a speaker references or quotes another person.
+To support real-world usage, this model is integrated into a two-part processing pipeline that allows users to analyze interview documents and highlight relevant sentences.
+This model is used in a document processing pipeline that performs the following tasks:
+- 1️⃣ Input Handling: Accepts .docx files containing interview transcripts.
+- 2️⃣ Sentence Segmentation: Splits the document into individual sentences.
+- 3️⃣ Sentence Classification: Applies the trained model to classify sentences based on reported speech criteria.
+- 4️⃣ HTML-Based Highlighting: Adds visual markers (via HTML tags) to classified sentences.
+- 5️⃣ Output Generation: Produces a .docx file with highlighted sentences, preserving the original content.
+Additionally, a GUI-based wrapper (built with Gooey) provides a user-friendly .exe program, allowing non-technical users to process documents efficiently.
+For a more in-depth view for the GUI, please read the Github page provided further down.
+- **Developed by:** CALDISS, AAU
+- **Funded by [optional]:** Aalborg University
+- **Model type:** [Few-Shot text-Classifier]
+- **Language(s) (NLP):** [Danish]
+- **License:** [MIT]
+- **Finetuned from model [intfloat/Multilingual-e5-large]:**
+### Model Sources
+- **Repository:** [Project repository](https://github.com/CALDISS-AAU/bp_SMI_CM)
+- **Paper:** Work in progress by the collaborative Researcher.
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+- Detecting reported speech in transcripts and conversational text.
+- Improving NLP pipelines for Danish-language text processing
+- Enhancing retrieval and classification in Danish conversational datasets.
+Inteded users inludes researchers or analysts working with danish conversational data or transcripts specifically interested in reported speech as a phenomenon.
+Following group (but not excluded to) may find it useful:
+Social Scientists & political scientist:
+- Analysing interview transcipts for social
+- Identifying speech patterns in employment, front-desk services or other institutional/governmental settings.
+Linguists & NLP researchers:
+- studying reported speech in danish.
+- Developing methods for classiying speech using Transformers architechture.
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+- This model is not designed for live conversation analysis or chatbot-like interactions. It works best in offline document processing workflows.
+- General-purpose text classification outside reported speech.
+- Live conversational AI or real-time speech processing.
+- Multilingual applications (this model is optimized for Danish only).
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+- The model is trained on Danish job center interviews, so performance may vary on other types of texts.
+- Binary classification is based on reported speech detection, but edge cases may exist.
+- While based on a multilingual model, this fine-tuned version is specifically optimized for Danish. Performance may be unreliable in other languages.
+- The model assumes transcripts. Messy, formal, or highly unstructured text (e.g., speech-to-text outputs with errors) may reduce accuracy.
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```
+from transformers import AutoModel, AutoTokenizer
+model_name = "your-huggingface-username/danish-rep-speech-e5"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name)
+text = "Han sagde: 'Jeg kommer i morgen.'"
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model(**inputs)
+# Extract the embedding
+embedding = outputs.last_hidden_state[:, 0, :].detach().numpy()
+```
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+Training data consits of 55 transcripts of conversations between a citizen and a social worker collected from a danish jobcenter. Data is therefore sensitive and not attached in this model card.
+Data was further evaluated to be balanced and containing a 50/50 split between both tags.
+### Training Procedure
+Pretraining & Base Model:
+This model is fine-tuned on top of intfloat/multilingual-e5-large, a transformer-based model optimized for embedding-based retrieval. The base model was pretrained using contrastive learning and large-scale multilingual datasets, making it well-suited for semantic similarity and classification tasks.
+Fine-Tuning Details
+    Training Dataset:
+        The model was fine-tuned using labelled transcribed interviews from a Danish job center.
+        Due to the sensitive nature of the data, it is not publicly available.
+    Objective:
+        The model was trained for binary classification of reported speech.
+        Labels indicate whether a sentence contains reported speech (reported-speech, not reported-speech).
+    Training Configuration:
+        Few-shot learning approach with domain-specific samples.
+        Batch size: 32.
+        Body Learning rate: 1.0770502781075495e-06
+        Solver: lbfgs.
+        Number of epochs: 6
+        Max Iterations: 279
+        Evaluation metric: Accuracy & F1-score.
+Technical Implementation
+    Tokenization performed using the SentencePiece-based tokenizer from intfloat/multilingual-e5-large.
+    Fine-tuning was done using PyTorch and the Hugging Face Trainer API.
+    The model is optimized for batch inference rather than real-time processing.
+📌 For more details on the architecture, refer to the base model: multilingual-e5-large.
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+The model was evaluated using standard classification metrics to measure its performance.
+Evaluation Metrics
+    Accuracy: Measures the overall correctness of predictions.
+    F1-Score: Balances precision and recall, ensuring that both false positives and false negatives are considered.
+    Precision: Measures how many of the predicted reported speech sentences are actually correct.
+Results:
+Not Reported Speech:
+Precision: 0.959
+Recall: 0.924
+F1-Score: 0.941
+Recall: 0.942
+Reported Speech:
+Precision: 0.927
+Recall: 0.961
+F1: 0.943
+Accuracy: 0.942
+## Hardware used
+- **Hardware Type:** 48 (AMD EPYC 9454), 192 GB memory, 1 Nividia H100
+- **Hours used:** 50
+- **Cloud Provider:** Ucloud SDU
+- **Compute Region:** Cloud services based at University of Southern Denmark, Aarhus University and Aalborg Univesity
+### Compute Infrastructure
+Ucloud-cloud infrastructure available at the danish universities
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Model Card Authors [optional]
+MKAP @ CALDISS, AAU