Model Card: TowerInstruct-7B-v0.2_236k

This model is a domain-adapted version of Unbabel/TowerInstruct-7B-v0.2, fine-tuned on 236k English–French sentence pairs from the bioinformatics and biomedical domains.
It is designed for English β†’ French Machine Translation.

✏️ Model Details

Model Description

  • Developed by: Jurgi Giraud
  • Model type: Multilingual Large Language Models (LLMs)
  • Language(s) (NLP): English to French
  • License: CC-BY-NC-4.0
  • Finetuned from model: Unbabel/TowerInstruct-7B-v0.2

This model was fine-tuned as part of a PhD research project investigating domain adaptation for Machine Translation (MT) in low-resource scenario within the bioinformatics domain (English β†’ French). The project explores the performance of compact MT models and Large Language Models (LLMs), including architectures under 1B parameters as well as models in the 3B–8B range, with a strong emphasis on resource-efficient fine-tuning strategies. The fine-tuning process made use of Parameter-Efficient Fine-Tuning (PEFT) and quantization, in particular QLoRA (Quantized Low-Rank Adaptation), for larger models (Dettmers et al., 2023).

In total, 5 models were fine-tuned on in-domain data: t5_236k | nllb-200-distilled-600M_236K | madlad400-3b-mt_236k | TowerInstruct-7B-v0.2_236k (πŸ‘ˆ current model) | and Llama-3.1-8B-Instruct_236K

πŸš€ Usage

This model is intended to be used for English β†’ French Machine Translation in the bioinformatics domain.

Example (GPU)

Find below an example of basic usage with GPU using Hugging Face's Transformers library.

First, install dependencies:

pip install torch transformers accelerate
import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="jurgiraud/TowerInstruct-7B-v0.2_236k", torch_dtype=torch.bfloat16, device_map="auto")

messages = [
    {"role": "user", "content": "Translate from English to French in the bioinformatics domain. Provide only the translation:\nThe deletion of a gene may result in death or in a block of cell division."},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=False)
print(outputs[0]["generated_text"])

πŸ”§ Fine-tuning Details

Fine-tuning Data

The model was fine-tuned on a set of 236k English-French parallel examples consisting of:

  • Natural parallel data (bioinformatics and biomedical data)
  • Synthetic data, including:
    • Back-translation of in-domain monolingual texts
    • Paraphrased data
    • Terminology-constrained synthetic generation

Fine-tuning dataset available πŸ‘‰ here.

Fine-tuning Procedure

The model was fine-tuned using QLoRa (Quantized Low-Rank Adaptation).

Fine-tuning performed with transformers SFTTrainer.

Template

<|im_start|>user
{USER_PROMPT}<|im_end|>
<|im_start|>assistant
{MODEL_RESPONSE}<|im_end|>

Fine-tuning Hyperparameters

Key hyperparameters and training setup:

  • Approach: QLoRA (4-bit quantization + LoRA adapters)
  • LoRA config: r=8, lora_alpha=16, lora_dropout=0.05, target_modules=["q_proj", "v_proj", "o_proj"]
  • Training: 4 epochs, learning rate = 1e-4, batch size = 16 (per device), gradient accumulation = 2
  • Precision: bfloat16 (bf16)
  • Optimizer: paged_adamw_8bit

πŸ“Š Evaluation

The model was evaluated on an in-domain bioinformatics test set using standard MT metrics.

Testing Data & Metrics

Testing Data

Test set available πŸ‘‰ here.

Metrics

  • BLEU
  • chrF++ (chrF2)
  • TER
  • COMET

Results

Results from automated metrics. Baseline vs domain-adapted model. Best scores in bold.

Models BLEU↑ chRF2↑ TER↓ COMET↑
Baseline model TowerInstruct-7B-v0.2 42.65 70.39 48.46 85.73
Domain-adapted model TowerInstruct-7B-v0.2_236K 46.18 72.83 45.32 86.29

🌱 Environmental Impact

The fine-tuning carbon footprint was estimated using the Green Algorithms framework (Lannelongue et al., 2021).

  • Carbon emissions: 3.47 kgCOβ‚‚e
  • Energy consumption: 15.00 kWh

πŸ“š Citation

BibTeX:

@phdthesis{giraud2025bioinformaticsMT,
  title        = {Developing Machine Translation for Bioinformatics: An Exploration into Domain-Specific Terminology, Domain-Adaptation, and Evaluation},
  author       = {Giraud, Jurgi},
  school       = {The Open University},
  year         = {2025},
  note         = {Forthcoming. Expected publication date: December 2025.},
}
Downloads last month
2
Safetensors
Model size
7B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jurgiraud/TowerInstruct-7B-v0.2_236k

Finetuned
(5)
this model