--- base_model: unsloth/llama-3.2-3b-instruct-unsloth-bnb-4bit tags: - text-generation-inference - transformers - unsloth - llama license: apache-2.0 language: - fr - en datasets: - jpacifico/French-Alpaca-dataset-Instruct-55K --- # Uploaded finetuned model - **Developed by:** mintujohnson - **License:** apache-2.0 - **Finetuned from model :** unsloth/llama-3.2-3b-instruct-unsloth-bnb-4bit This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. # Inference ```python from unsloth import FastLanguageModel from transformers import TextStreamer model_path = "mintujohnson/Llama-3.2-3B-French-Instruct" model, tokenizer = FastLanguageModel.from_pretrained(model_name = model_path, max_seq_length = 128, dtype = None, load_in_4bit = True) def inference(messages, model, tokenizer): FastLanguageModel.for_inference(model) # Enable native 2x faster inference inputs = tokenizer.apply_chat_template( messages, tokenize = True, add_generation_prompt = True, # Must add for generation return_tensors = "pt", ).to("cuda") print(tokenizer.decode(inputs[0], skip_special_tokens=False)) text_streamer = TextStreamer(tokenizer, skip_prompt = True) _ = model.generate( input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True, temperature = 1.5, min_p = 0.1) messages = [ {"role": "user", "content": "où est la Normandie?"}, ] output = inference(messages, model, tokenizer) ``` [](https://github.com/unslothai/unsloth)