| | ---
|
| | license: mit
|
| | datasets:
|
| | - bkai-foundation-models/vi-alpaca-input-output-format
|
| | - CausalLM/GPT-4-Self-Instruct-Japanese
|
| | language:
|
| | - zho
|
| | - eng
|
| | - fra
|
| | - spa
|
| | - por
|
| | - deu
|
| | - ita
|
| | - rus
|
| | - jpn
|
| | - kor
|
| | - vie
|
| | - tha
|
| | - ara
|
| | base_model:
|
| | - Qwen/Qwen2.5-1.5B-Instruct
|
| | pipeline_tag: question-answering
|
| | library_name: transformers
|
| | ---
|
| |
|
| | # Multilingual Question-Answering Model (Vietnamese and Japanese)
|
| |
|
| | ## Overview
|
| |
|
| | This repository contains a fine-tuned multilingual question-answering model that supports both **Vietnamese** and **Japanese**. Built on top of the **Qwen/Qwen2.5-1.5B-Instruct** base model, this model leverages advanced transformer architectures to provide high-quality answers in both languages.
|
| |
|
| | The model has been fine-tuned using datasets such as:
|
| | - **bkai-foundation-models/vi-alpaca-input-output-format**: A Vietnamese dataset designed for instruction-based input-output tasks.
|
| | - **CausalLM/GPT-4-Self-Instruct-Japanese**: A Japanese dataset created with self-instruct techniques to improve language understanding and generation.
|
| |
|
| | This model is ideal for applications requiring cross-lingual support between Vietnamese and Japanese.
|
| |
|
| | ---
|
| |
|
| | ## License
|
| |
|
| | This project is released under the **MIT License**, ensuring flexibility for both academic and commercial use. Please refer to the `LICENSE` file for more details.
|
| |
|
| | ---
|
| |
|
| | ## Model Details
|
| |
|
| | ### Base Model
|
| | - **Qwen/Qwen2.5-1.5B-Instruct**: A powerful 1.5B parameter instruction-tuned model developed by Alibaba Cloud. It excels in understanding and generating natural language across various domains.
|
| |
|
| | ### Supported Languages
|
| | - **Vietnamese (vi)**
|
| | - **Japanese (ja)**
|
| |
|
| | ### Pipeline Tag
|
| | - **Question-Answering**: The model is optimized for answering questions in both supported languages.
|
| |
|
| | ### Library
|
| | - **Transformers**: This model is built using the Hugging Face `transformers` library, making it easy to integrate into existing pipelines.
|
| |
|
| | ---
|
| |
|
| | ## Installation
|
| |
|
| | To use this model, ensure you have the `transformers` library installed:
|
| |
|
| | ```bash
|
| | pip install transformers
|
| | ```
|
| |
|
| | You can then load the model directly from the Hugging Face Hub:
|
| |
|
| | ```python
|
| | from transformers import AutoTokenizer, AutoModelForCausalLM
|
| |
|
| | # Load the tokenizer and model
|
| | tokenizer = AutoTokenizer.from_pretrained("haiFrHust/VNJPTranslate_base")
|
| | model = AutoModelForCausalLM.from_pretrained("haiFrHust/VNJPTranslate_base")
|
| |
|
| | # Example usage
|
| | input_text = "質問: ベトナムの首都はどこですか?" # Japanese: What is the capital of Vietnam?
|
| | inputs = tokenizer(input_text, return_tensors="pt")
|
| | outputs = model.generate(**inputs)
|
| | answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| |
|
| | print(answer)
|
| | ```
|
| |
|
| |
|
| | ---
|
| |
|
| | ## Dataset Information
|
| |
|
| | ### Vietnamese Dataset
|
| | - **Name**: `bkai-foundation-models/vi-alpaca-input-output-format`
|
| | - **Description**: This dataset contains instruction-based input-output pairs in Vietnamese, enabling the model to understand and respond to structured queries effectively.
|
| |
|
| | ### Japanese Dataset
|
| | - **Name**: `CausalLM/GPT-4-Self-Instruct-Japanese`
|
| | - **Description**: A self-instruct dataset in Japanese, designed to enhance the model's ability to generate accurate and contextually relevant responses.
|
| |
|
| | ---
|
| |
|
| | ## Use Cases
|
| |
|
| | This model is suitable for a variety of applications, including but not limited to:
|
| | - **Cross-Lingual Customer Support**: Answering user queries in both Vietnamese and Japanese.
|
| | - **Educational Tools**: Assisting students in learning and understanding concepts in their native language.
|
| | - **Multilingual Chatbots**: Building conversational agents capable of handling multiple languages seamlessly.
|
| |
|
| | ---
|
| |
|
| | ## Performance
|
| |
|
| | The model demonstrates strong performance in both Vietnamese and Japanese, thanks to the high-quality datasets and the robust base model. However, performance may vary depending on the complexity of the questions and the domain-specific knowledge required.
|
| |
|
| | For optimal results:
|
| | - Ensure your input questions are clear and concise.
|
| | - Fine-tune the model further on domain-specific data if necessary.
|
| |
|
| | ---
|
| |
|
| | ## Contributions
|
| |
|
| | Contributions to this project are welcome! If you have ideas for improvements, encounter issues, or wish to contribute additional datasets, please open an issue or submit a pull request.
|
| |
|
| | ---
|
| |
|
| | ## Acknowledgments
|
| |
|
| | We would like to thank the following organizations and contributors:
|
| | - **Alibaba Cloud** for providing the Qwen base model.
|
| | - The creators of the `bkai-foundation-models/vi-alpaca-input-output-format` and `CausalLM/GPT-4-Self-Instruct-Japanese` datasets.
|
| | - The Hugging Face community for their excellent `transformers` library and support.
|
| |
|
| | ---
|
| |
|
| | ## Contact
|
| |
|
| | For any inquiries or feedback, feel free to reach out to us via:
|
| | - Email: [hai.ph225715@sis.hust.edu.vn]
|
| | - GitHub Issues: Open an issue in this repository.
|
| |
|
| | ---
|
| |
|
| | Thank you for using our multilingual question-answering model! We hope it serves your needs effectively. |