Question Answering
Transformers
Safetensors
qwen2
text-generation
text-generation-inference
File size: 5,012 Bytes
4df293c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fed8dde
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---

license: mit
datasets:
- bkai-foundation-models/vi-alpaca-input-output-format
- CausalLM/GPT-4-Self-Instruct-Japanese
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
pipeline_tag: question-answering
library_name: transformers
---


# Multilingual Question-Answering Model (Vietnamese and Japanese)

## Overview

This repository contains a fine-tuned multilingual question-answering model that supports both **Vietnamese** and **Japanese**. Built on top of the **Qwen/Qwen2.5-1.5B-Instruct** base model, this model leverages advanced transformer architectures to provide high-quality answers in both languages.

The model has been fine-tuned using datasets such as:
- **bkai-foundation-models/vi-alpaca-input-output-format**: A Vietnamese dataset designed for instruction-based input-output tasks.
- **CausalLM/GPT-4-Self-Instruct-Japanese**: A Japanese dataset created with self-instruct techniques to improve language understanding and generation.

This model is ideal for applications requiring cross-lingual support between Vietnamese and Japanese.

---

## License

This project is released under the **MIT License**, ensuring flexibility for both academic and commercial use. Please refer to the `LICENSE` file for more details.

---

## Model Details

### Base Model
- **Qwen/Qwen2.5-1.5B-Instruct**: A powerful 1.5B parameter instruction-tuned model developed by Alibaba Cloud. It excels in understanding and generating natural language across various domains.

### Supported Languages
- **Vietnamese (vi)**
- **Japanese (ja)**

### Pipeline Tag
- **Question-Answering**: The model is optimized for answering questions in both supported languages.

### Library
- **Transformers**: This model is built using the Hugging Face `transformers` library, making it easy to integrate into existing pipelines.

---

## Installation

To use this model, ensure you have the `transformers` library installed:

```bash

pip install transformers

```

You can then load the model directly from the Hugging Face Hub:

```python

from transformers import AutoTokenizer, AutoModelForCausalLM



# Load the tokenizer and model

tokenizer = AutoTokenizer.from_pretrained("haiFrHust/VNJPTranslate_base")

model = AutoModelForCausalLM.from_pretrained("haiFrHust/VNJPTranslate_base")



# Example usage

input_text = "質問: ベトナムの首都はどこですか?"  # Japanese: What is the capital of Vietnam?

inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs)

answer = tokenizer.decode(outputs[0], skip_special_tokens=True)



print(answer)

```


---

## Dataset Information

### Vietnamese Dataset
- **Name**: `bkai-foundation-models/vi-alpaca-input-output-format`
- **Description**: This dataset contains instruction-based input-output pairs in Vietnamese, enabling the model to understand and respond to structured queries effectively.

### Japanese Dataset
- **Name**: `CausalLM/GPT-4-Self-Instruct-Japanese`
- **Description**: A self-instruct dataset in Japanese, designed to enhance the model's ability to generate accurate and contextually relevant responses.

---

## Use Cases

This model is suitable for a variety of applications, including but not limited to:
- **Cross-Lingual Customer Support**: Answering user queries in both Vietnamese and Japanese.
- **Educational Tools**: Assisting students in learning and understanding concepts in their native language.
- **Multilingual Chatbots**: Building conversational agents capable of handling multiple languages seamlessly.

---

## Performance

The model demonstrates strong performance in both Vietnamese and Japanese, thanks to the high-quality datasets and the robust base model. However, performance may vary depending on the complexity of the questions and the domain-specific knowledge required.

For optimal results:
- Ensure your input questions are clear and concise.
- Fine-tune the model further on domain-specific data if necessary.

---

## Contributions

Contributions to this project are welcome! If you have ideas for improvements, encounter issues, or wish to contribute additional datasets, please open an issue or submit a pull request.

---

## Acknowledgments

We would like to thank the following organizations and contributors:
- **Alibaba Cloud** for providing the Qwen base model.
- The creators of the `bkai-foundation-models/vi-alpaca-input-output-format` and `CausalLM/GPT-4-Self-Instruct-Japanese` datasets.
- The Hugging Face community for their excellent `transformers` library and support.

---

## Contact

For any inquiries or feedback, feel free to reach out to us via:
- Email: [hai.ph225715@sis.hust.edu.vn]
- GitHub Issues: Open an issue in this repository.

---

Thank you for using our multilingual question-answering model! We hope it serves your needs effectively.