| --- |
| base_model: unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit |
| tags: |
| - text-generation-inference |
| - transformers |
| - unsloth |
| - llama |
| - trl |
| license: apache-2.0 |
| language: |
| - en |
| datasets: |
| - GeneralReasoning/GeneralThought-430K |
| - isaiahbjork/cot-logic-reasoning |
| --- |
| |
| # Uploaded model |
|
|
| - **Developed by:** alibidaran |
| - **License:** apache-2.0 |
| - **Finetuned from model :** unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit |
| - **Finedtuned with SFT Algorithm** |
| ## Direct Usages: |
| ``` python |
| from transformers import TextStreamer |
| from unsloth import FastLanguageModel |
| import torch |
| max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally! |
| dtype = 'Bfloat16' # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+ |
| load_in_4bit = True |
| model, tokenizer = FastLanguageModel.from_pretrained( |
| model_name ="alibidaran/LLAMA3-instructive_reasoning", |
| max_seq_length = max_seq_length, |
| #dtype = dtype, |
| load_in_4bit = load_in_4bit, |
| #fast_inference = True, # Enable vLLM fast inference |
| max_lora_rank = 128, |
| gpu_memory_utilization = 0.6, # Reduce if out of memory |
| # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf |
| ) |
| FastLanguageModel.for_inference(model) # Enable native 2x faster inference |
| system_prompt=""" |
| You are a reasonable expert who thinks and answer the users question. |
| Before respond first think and create a chain of thoughts in your mind. |
| Then respond to the client. |
| Your chain of thought and reflection must be in <thinking>..</thinking> format and your respond |
| should be in the <output>..</output> format. |
| """ |
| |
| messages = [ |
| {'role':'system','content':system_prompt}, |
| {"role": "user", "content":'How many r has the word of strawberry?' }, |
| |
| ] |
| inputs = tokenizer.apply_chat_template( |
| messages, |
| tokenize = True, |
| add_generation_prompt = True, # Must add for generation |
| return_tensors = "pt", |
| ).to("cuda") |
| |
| text_streamer = TextStreamer(tokenizer, skip_prompt = True) |
| _ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens =2048, |
| use_cache = True, temperature = 0.7, min_p = 0.9) |
| ``` |
|
|
| This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. |
|
|
| [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) |