File size: 5,012 Bytes

---

license: mit
datasets:
- bkai-foundation-models/vi-alpaca-input-output-format
- CausalLM/GPT-4-Self-Instruct-Japanese
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
pipeline_tag: question-answering
library_name: transformers
---


# Multilingual Question-Answering Model (Vietnamese and Japanese)

## Overview

This repository contains a fine-tuned multilingual question-answering model that supports both **Vietnamese** and **Japanese**. Built on top of the **Qwen/Qwen2.5-1.5B-Instruct** base model, this model leverages advanced transformer architectures to provide high-quality answers in both languages.

The model has been fine-tuned using datasets such as:
- **bkai-foundation-models/vi-alpaca-input-output-format**: A Vietnamese dataset designed for instruction-based input-output tasks.
- **CausalLM/GPT-4-Self-Instruct-Japanese**: A Japanese dataset created with self-instruct techniques to improve language understanding and generation.

This model is ideal for applications requiring cross-lingual support between Vietnamese and Japanese.

---

## License

This project is released under the **MIT License**, ensuring flexibility for both academic and commercial use. Please refer to the `LICENSE` file for more details.

---

## Model Details

### Base Model
- **Qwen/Qwen2.5-1.5B-Instruct**: A powerful 1.5B parameter instruction-tuned model developed by Alibaba Cloud. It excels in understanding and generating natural language across various domains.

### Supported Languages
- **Vietnamese (vi)**
- **Japanese (ja)**

### Pipeline Tag
- **Question-Answering**: The model is optimized for answering questions in both supported languages.

### Library
- **Transformers**: This model is built using the Hugging Face `transformers` library, making it easy to integrate into existing pipelines.

---

## Installation

To use this model, ensure you have the `transformers` library installed:

```bash

pip install transformers

```

You can then load the model directly from the Hugging Face Hub:

```python

from transformers import AutoTokenizer, AutoModelForCausalLM



# Load the tokenizer and model

tokenizer = AutoTokenizer.from_pretrained("haiFrHust/VNJPTranslate_base")

model = AutoModelForCausalLM.from_pretrained("haiFrHust/VNJPTranslate_base")



# Example usage

input_text = "質問: ベトナムの首都はどこですか？"  # Japanese: What is the capital of Vietnam?

inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs)

answer = tokenizer.decode(outputs[0], skip_special_tokens=True)



print(answer)

```


---

## Dataset Information

### Vietnamese Dataset
- **Name**: `bkai-foundation-models/vi-alpaca-input-output-format`
- **Description**: This dataset contains instruction-based input-output pairs in Vietnamese, enabling the model to understand and respond to structured queries effectively.

### Japanese Dataset
- **Name**: `CausalLM/GPT-4-Self-Instruct-Japanese`
- **Description**: A self-instruct dataset in Japanese, designed to enhance the model's ability to generate accurate and contextually relevant responses.

---

## Use Cases

This model is suitable for a variety of applications, including but not limited to:
- **Cross-Lingual Customer Support**: Answering user queries in both Vietnamese and Japanese.
- **Educational Tools**: Assisting students in learning and understanding concepts in their native language.
- **Multilingual Chatbots**: Building conversational agents capable of handling multiple languages seamlessly.

---

## Performance

The model demonstrates strong performance in both Vietnamese and Japanese, thanks to the high-quality datasets and the robust base model. However, performance may vary depending on the complexity of the questions and the domain-specific knowledge required.

For optimal results:
- Ensure your input questions are clear and concise.
- Fine-tune the model further on domain-specific data if necessary.

---

## Contributions

Contributions to this project are welcome! If you have ideas for improvements, encounter issues, or wish to contribute additional datasets, please open an issue or submit a pull request.

---

## Acknowledgments

We would like to thank the following organizations and contributors:
- **Alibaba Cloud** for providing the Qwen base model.
- The creators of the `bkai-foundation-models/vi-alpaca-input-output-format` and `CausalLM/GPT-4-Self-Instruct-Japanese` datasets.
- The Hugging Face community for their excellent `transformers` library and support.

---

## Contact

For any inquiries or feedback, feel free to reach out to us via:
- Email: [hai.ph225715@sis.hust.edu.vn]
- GitHub Issues: Open an issue in this repository.

---

Thank you for using our multilingual question-answering model! We hope it serves your needs effectively.