library_name: transformers tags: - gpt2 - causal-lm - bilingual - sentencepiece - french - english pipeline_tag: text-generation datasets: - climb-mao/babylm-fra - elliepreed/l2-corpus-10m license: other # change to "apache-2.0" or "mit" if that's correct model-index: - name: French_English_sequential – 128k steps results: []

  • French + English (GPT-2 style) sequential model Small bilingual GPT-2–style language model trained on French and English with SentencePiece tokenizers.

This model is trained on both French πŸ‡«πŸ‡· and English πŸ‡¬πŸ‡§, but it does not come with a single AutoTokenizer. Instead, we provide two SentencePiece tokenizers:

tokenizers/french.model tokenizers/english.model You can load either depending on the language you want to work with.

  • Load the model from transformers import AutoModelForCausalLM import torch model_id = "elliepreed/bgpt-french-english" device = "cuda" if torch.cuda.is_available() else "cpu" model = AutoModelForCausalLM.from_pretrained(model_id).to(device).eval()

  • Load both tokenizers import sentencepiece as spm from huggingface_hub import hf_hub_download fr_path = hf_hub_download(model_id, "tokenizers/french.model") en_path = hf_hub_download(model_id, "tokenizers/english.model") sp_fr = spm.SentencePieceProcessor(model_file=fr_path) sp_en = spm.SentencePieceProcessor(model_file=en_path)

Example: French generation prompt = "Paris est" ids = sp_fr.encode(prompt, out_type=int) + [sp_fr.eos_id()] input_ids = torch.tensor([ids], device=device) out = model.generate( input_ids, max_new_tokens=40, do_sample=True, top_p=0.95, temperature=0.9, eos_token_id=sp_fr.eos_id(), pad_token_id=sp_fr.pad_id(), ) print("FR:", sp_fr.decode(out[0].tolist()[len(ids):]))

Downloads last month
6
Safetensors
Model size
50.9M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support