Introduction

OpenBA is an open-source bilingual language model equipped with 15 billion parameters, built on the T5 architecture.

Open Source Plan

We are excited to unveil two distinguished versions of our model, with another on the horizon:

OpenBA-LM: The backbone language models was pre-trained on 340B English, Chinese, and code tokens.
OpenBA-Flan: We perform supervised fine-tuning on the base model with additional 40B tokens using our collected BiFlan Dataset.
OpenBA-Chat: coming soon

Model Description

Model type: Language model
Language(s) (NLP): zh, en (We also offer the possibility for multilingual learning, by using a multilingual tokenizer.)
License: Apache 2.0
Resources for more information:
- Paper
- GitHub Repo

Usage

Install requirements

pip install transformers torch>=2.0 sentencepiece

Demo usage

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("OpenBA/OpenBA-Flan", trust_remote_code=True)
>>> model = AutoModelForSeq2SeqLM.from_pretrained("OpenBA/OpenBA-Flan", trust_remote_code=True).half().cuda()
>>> model = model.eval()
>>> query = "<S>" + "介绍一下中国的四大名著，并分别概括其主要内容" + "<extra_id_0>"
>>> inputs = tokenizer(query, return_tensors="pt").to("cuda")
>>> outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
>>> response = tokenizer.decode(outputs[0], skip_special_tokens=True)
>>> print(response)
中国的四大名著分别是《红楼梦》、《西游记》、《水浒传》和《三国演义》。它们分别包括故事情节、文化内涵和历史背景等方面的不同特点。《红楼梦》是一部中国古典小说,讲述了贾宝玉、林黛玉、薛宝钗等一群人物在贾府的生活和爱情故事。《西游记》是中国著名小说,描述了孙悟空、猪八戒、沙悟净等一众妖魔鬼怪的冒险历程和故事。《水浒传》是一部中国古典小说,描述了宋江等一百零八位好汉的反抗故事。《三国演义》是中国古代著名小说,讲述了三国时期的历史和战争故事。这些小说在文学、历史、哲学和文化等方面都有着不同的影响和地位。

Downloads last month: 39

Paper for OpenNLG/OpenBA-V1-Flan

OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch

Paper • 2309.10706 • Published Sep 19, 2023 • 17