Update README.md

b92841e verified 10 months ago

2.65 kB

library_name: transformers
language:
  - en
  - de
base_model:
  - Unbabel/TowerInstruct-7B-v0.2
tags:
  - Machine Translation
model-index:
  - name: iwslt_mt_ende
    results: []
    paper:
      title: >-
        KIT's Offline Speech Translation and Instruction Following Submission
        for IWSLT 2025
      authors: >-
        Koneru, Sai and Z{"u}fle, Maike and Nguyen, Thai-Binh and Akti, Seymanur
        and Niehues, Jan and Waibel, Alexander
      url: https://arxiv.org/abs/2505.13036
      published: 2025-05-25T00:00:00.000Z

KIT IWSLT25 Machine Translation Model

Adapted TowerInstruct 7B v0.2 for English-German translations. We filter the IWSLT data using quality estimation models and train on high quality data optimizing for the specific language pair. We find it to be better than the base model especially for speech domain.

Model Usage

The usage is same to the base model. However, we only tried for English-German translation and do not know the performance of the model on other languages and translation tasks.

Model Loading

model_id = "Unbabel/TowerInstruct-7B-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side="left"
padding="longest"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
model.load_adapter("skoneru/iwslt_mt_ende")

Prompt Format

<|im_start|>user\nTranslate the sentence from English into German.
English:
{src_sentence}
German:<|im_end|>\n<|im_start|>assistant
{llm to generate}

Model Inference

After loading the model and the tokenizer, you can simply use the model with the prompt format as shown below:

src_sent = "Welcome to the first lecture"
prefix = "<|im_start|>user\nTranslate the sentence from English into German.\nEnglish: "
suffix = "\nGerman:<|im_end|>\n<|im_start|>assistant\n"
prompt = [prefix + src_sent + suffix]
inputs = tokenizer(prompt, return_tensors="pt", padding=True, add_special_tokens=False).to(model.device)
num_beams=5

output = model.generate(**inputs, num_beams=num_beams, max_new_tokens=256, return_dict_in_generate=True, early_stopping=True, do_sample=False)
hyps = tokenizer.batch_decode(output.sequences[:,inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(hyps)

📖 Citation

If you use this model in your research, please cite:

@inproceedings{koneru2025kit,
  title={KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025},
  author={Koneru, Sai and Z{\"u}fle, Maike and Nguyen, Thai-Binh and Akti, Seymanur and Niehues, Jan and Waibel, Alexander},
  journal={arXiv preprint arXiv:2505.13036},
  year={2025},
  url={https://arxiv.org/abs/2505.13036}
}