Improve language tag (#1)

4df293c verified 11 months ago

5.01 kB

	---
	license: mit
	datasets:
	- bkai-foundation-models/vi-alpaca-input-output-format
	- CausalLM/GPT-4-Self-Instruct-Japanese
	language:
	- zho
	- eng
	- fra
	- spa
	- por
	- deu
	- ita
	- rus
	- jpn
	- kor
	- vie
	- tha
	- ara
	base_model:
	- Qwen/Qwen2.5-1.5B-Instruct
	pipeline_tag: question-answering
	library_name: transformers
	---

	# Multilingual Question-Answering Model (Vietnamese and Japanese)

	## Overview

	This repository contains a fine-tuned multilingual question-answering model that supports both Vietnamese and Japanese. Built on top of the Qwen/Qwen2.5-1.5B-Instruct base model, this model leverages advanced transformer architectures to provide high-quality answers in both languages.

	The model has been fine-tuned using datasets such as:
	- bkai-foundation-models/vi-alpaca-input-output-format: A Vietnamese dataset designed for instruction-based input-output tasks.
	- CausalLM/GPT-4-Self-Instruct-Japanese: A Japanese dataset created with self-instruct techniques to improve language understanding and generation.

	This model is ideal for applications requiring cross-lingual support between Vietnamese and Japanese.

	---

	## License

	This project is released under the MIT License, ensuring flexibility for both academic and commercial use. Please refer to the `LICENSE` file for more details.

	---

	## Model Details

	### Base Model
	- Qwen/Qwen2.5-1.5B-Instruct: A powerful 1.5B parameter instruction-tuned model developed by Alibaba Cloud. It excels in understanding and generating natural language across various domains.

	### Supported Languages
	- Vietnamese (vi)
	- Japanese (ja)

	### Pipeline Tag
	- Question-Answering: The model is optimized for answering questions in both supported languages.

	### Library
	- Transformers: This model is built using the Hugging Face `transformers` library, making it easy to integrate into existing pipelines.

	---

	## Installation

	To use this model, ensure you have the `transformers` library installed:

	```bash
	pip install transformers
	```

	You can then load the model directly from the Hugging Face Hub:

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("haiFrHust/VNJPTranslate_base")
	model = AutoModelForCausalLM.from_pretrained("haiFrHust/VNJPTranslate_base")

	# Example usage
	input_text = "質問: ベトナムの首都はどこですか？" # Japanese: What is the capital of Vietnam?
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs)
	answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

	print(answer)
	```


	---

	## Dataset Information

	### Vietnamese Dataset
	- Name: `bkai-foundation-models/vi-alpaca-input-output-format`
	- Description: This dataset contains instruction-based input-output pairs in Vietnamese, enabling the model to understand and respond to structured queries effectively.

	### Japanese Dataset
	- Name: `CausalLM/GPT-4-Self-Instruct-Japanese`
	- Description: A self-instruct dataset in Japanese, designed to enhance the model's ability to generate accurate and contextually relevant responses.

	---

	## Use Cases

	This model is suitable for a variety of applications, including but not limited to:
	- Cross-Lingual Customer Support: Answering user queries in both Vietnamese and Japanese.
	- Educational Tools: Assisting students in learning and understanding concepts in their native language.
	- Multilingual Chatbots: Building conversational agents capable of handling multiple languages seamlessly.

	---

	## Performance

	The model demonstrates strong performance in both Vietnamese and Japanese, thanks to the high-quality datasets and the robust base model. However, performance may vary depending on the complexity of the questions and the domain-specific knowledge required.

	For optimal results:
	- Ensure your input questions are clear and concise.
	- Fine-tune the model further on domain-specific data if necessary.

	---

	## Contributions

	Contributions to this project are welcome! If you have ideas for improvements, encounter issues, or wish to contribute additional datasets, please open an issue or submit a pull request.

	---

	## Acknowledgments

	We would like to thank the following organizations and contributors:
	- Alibaba Cloud for providing the Qwen base model.
	- The creators of the `bkai-foundation-models/vi-alpaca-input-output-format` and `CausalLM/GPT-4-Self-Instruct-Japanese` datasets.
	- The Hugging Face community for their excellent `transformers` library and support.

	---

	## Contact

	For any inquiries or feedback, feel free to reach out to us via:
	- Email: [hai.ph225715@sis.hust.edu.vn]
	- GitHub Issues: Open an issue in this repository.

	---

	Thank you for using our multilingual question-answering model! We hope it serves your needs effectively.