File size: 5,012 Bytes
4df293c fed8dde | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | ---
license: mit
datasets:
- bkai-foundation-models/vi-alpaca-input-output-format
- CausalLM/GPT-4-Self-Instruct-Japanese
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
pipeline_tag: question-answering
library_name: transformers
---
# Multilingual Question-Answering Model (Vietnamese and Japanese)
## Overview
This repository contains a fine-tuned multilingual question-answering model that supports both **Vietnamese** and **Japanese**. Built on top of the **Qwen/Qwen2.5-1.5B-Instruct** base model, this model leverages advanced transformer architectures to provide high-quality answers in both languages.
The model has been fine-tuned using datasets such as:
- **bkai-foundation-models/vi-alpaca-input-output-format**: A Vietnamese dataset designed for instruction-based input-output tasks.
- **CausalLM/GPT-4-Self-Instruct-Japanese**: A Japanese dataset created with self-instruct techniques to improve language understanding and generation.
This model is ideal for applications requiring cross-lingual support between Vietnamese and Japanese.
---
## License
This project is released under the **MIT License**, ensuring flexibility for both academic and commercial use. Please refer to the `LICENSE` file for more details.
---
## Model Details
### Base Model
- **Qwen/Qwen2.5-1.5B-Instruct**: A powerful 1.5B parameter instruction-tuned model developed by Alibaba Cloud. It excels in understanding and generating natural language across various domains.
### Supported Languages
- **Vietnamese (vi)**
- **Japanese (ja)**
### Pipeline Tag
- **Question-Answering**: The model is optimized for answering questions in both supported languages.
### Library
- **Transformers**: This model is built using the Hugging Face `transformers` library, making it easy to integrate into existing pipelines.
---
## Installation
To use this model, ensure you have the `transformers` library installed:
```bash
pip install transformers
```
You can then load the model directly from the Hugging Face Hub:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("haiFrHust/VNJPTranslate_base")
model = AutoModelForCausalLM.from_pretrained("haiFrHust/VNJPTranslate_base")
# Example usage
input_text = "質問: ベトナムの首都はどこですか?" # Japanese: What is the capital of Vietnam?
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)
```
---
## Dataset Information
### Vietnamese Dataset
- **Name**: `bkai-foundation-models/vi-alpaca-input-output-format`
- **Description**: This dataset contains instruction-based input-output pairs in Vietnamese, enabling the model to understand and respond to structured queries effectively.
### Japanese Dataset
- **Name**: `CausalLM/GPT-4-Self-Instruct-Japanese`
- **Description**: A self-instruct dataset in Japanese, designed to enhance the model's ability to generate accurate and contextually relevant responses.
---
## Use Cases
This model is suitable for a variety of applications, including but not limited to:
- **Cross-Lingual Customer Support**: Answering user queries in both Vietnamese and Japanese.
- **Educational Tools**: Assisting students in learning and understanding concepts in their native language.
- **Multilingual Chatbots**: Building conversational agents capable of handling multiple languages seamlessly.
---
## Performance
The model demonstrates strong performance in both Vietnamese and Japanese, thanks to the high-quality datasets and the robust base model. However, performance may vary depending on the complexity of the questions and the domain-specific knowledge required.
For optimal results:
- Ensure your input questions are clear and concise.
- Fine-tune the model further on domain-specific data if necessary.
---
## Contributions
Contributions to this project are welcome! If you have ideas for improvements, encounter issues, or wish to contribute additional datasets, please open an issue or submit a pull request.
---
## Acknowledgments
We would like to thank the following organizations and contributors:
- **Alibaba Cloud** for providing the Qwen base model.
- The creators of the `bkai-foundation-models/vi-alpaca-input-output-format` and `CausalLM/GPT-4-Self-Instruct-Japanese` datasets.
- The Hugging Face community for their excellent `transformers` library and support.
---
## Contact
For any inquiries or feedback, feel free to reach out to us via:
- Email: [hai.ph225715@sis.hust.edu.vn]
- GitHub Issues: Open an issue in this repository.
---
Thank you for using our multilingual question-answering model! We hope it serves your needs effectively. |