|
|
--- |
|
|
language: de |
|
|
|
|
|
widget: |
|
|
- text: "In einer schockierenden Entdeckung fanden Wissenschaftler eine Herde Einhörner, die in einem abgelegenen, zuvor unerforschten Tal in den Anden lebten." |
|
|
--- |
|
|
|
|
|
# BPT2 |
|
|
|
|
|
See the [GPT2 model card](https://huggingface.co/gpt2) for considerations on limitations and bias. See the [GPT2 documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for details on GPT2. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("tursunali/bpt2") |
|
|
model = AutoModelForCausalLM.from_pretrained("tursunali/bpt2") |
|
|
|
|
|
prompt = "<your prompt>" |
|
|
|
|
|
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) |
|
|
print(pipe(prompt)[0]["generated_text"]) |
|
|
``` |
|
|
|
|
|
Also, two tricks might improve the generated text: |
|
|
|
|
|
```python |
|
|
output = model.generate( |
|
|
# during training an EOS token was used to mark the beginning of each text |
|
|
# so it can help to insert it at the start |
|
|
torch.tensor( |
|
|
[tokenizer.eos_token_id] + tokenizer.encode(prompt) |
|
|
).unsqueeze(0), |
|
|
do_sample=True, |
|
|
# try setting bad_words_ids=[[0]] to disallow generating an EOS token, without this the model is |
|
|
# prone to ending generation early because a significant number of texts from the training corpus |
|
|
# is quite short |
|
|
bad_words_ids=[[0]], |
|
|
max_length=max_length, |
|
|
)[0] |
|
|
print(tokenizer.decode(output)) |
|
|
``` |
|
|
|
|
|
## Citing |
|
|
|
|
|
Please cite BPT2 as follows: |
|
|
|
|
|
``` |
|
|
@misc{Backpacker_Trail_German_large_2022, |
|
|
author = {BackpackerTrail, Tursunali Kholdorov}, |
|
|
title = {{BPT2: Backpacker Trail German versions of BPT2}}, |
|
|
url = {https://github.com/Tursunali-Kholdorov/bptTrainer}, |
|
|
year = {2022} |
|
|
} |
|
|
``` |
|
|
|