--- base_model: unsloth/llama-3.2-1b-instruct-unsloth-bnb-4bit library_name: peft pipeline_tag: text-generation tags: - base_model:adapter:unsloth/llama-3.2-1b-instruct-unsloth-bnb-4bit - lora - sft - transformers - trl - unsloth license: mit datasets: - ServiceNow-AI/R1-Distill-SFT language: - en --- # Model Card for Model ID - Its a very simple model for text generation built on top of Llama3.2-1B. - It is very lightweight and can be inferenced on a CPU with 4 gb RAM. - **Developed by:** [**findthehead**](https://huggingface.co/findthehead) ### Framework versions - PEFT 0.17.1 ### Inference Code ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig model_name = "Prachir-AI/Thinkmini" tokenizer = AutoTokenizer.from_pretrained(model_name) # Create a BitsAndBytesConfig to enable 4-bit loading bnb_config = BitsAndBytesConfig( load_in_4bit=True, # Enable 4-bit loading as intended for this model bnb_4bit_quant_type="nf4", # This is a common default for 4-bit models bnb_4bit_compute_dtype=torch.bfloat16, # Use bfloat16 for computation bnb_4bit_use_double_quant=True, # Often used with nf4 ) # Load the model with the configured 4-bit quantization model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, torch_dtype=torch.bfloat16 # Ensure the model itself is loaded with bfloat16 dtypes where applicable ) inputs = tokenizer("How do you plan for a full pentest of a web application?", return_tensors="pt").to('cuda') # inference mode output_ids = model.generate( **inputs, max_new_tokens=500, temperature=0.7, top_p=0.9 ) print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) ```