Instructions to use ericflo/Llama-3.2-3B-COT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ericflo/Llama-3.2-3B-COT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ericflo/Llama-3.2-3B-COT") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ericflo/Llama-3.2-3B-COT") model = AutoModelForCausalLM.from_pretrained("ericflo/Llama-3.2-3B-COT") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use ericflo/Llama-3.2-3B-COT with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ericflo/Llama-3.2-3B-COT", filename="Llama-3.2-3B-COT-BF16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ericflo/Llama-3.2-3B-COT with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ericflo/Llama-3.2-3B-COT:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ericflo/Llama-3.2-3B-COT:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ericflo/Llama-3.2-3B-COT:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ericflo/Llama-3.2-3B-COT:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ericflo/Llama-3.2-3B-COT:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf ericflo/Llama-3.2-3B-COT:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ericflo/Llama-3.2-3B-COT:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf ericflo/Llama-3.2-3B-COT:Q4_K_M
Use Docker
docker model run hf.co/ericflo/Llama-3.2-3B-COT:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use ericflo/Llama-3.2-3B-COT with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ericflo/Llama-3.2-3B-COT" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ericflo/Llama-3.2-3B-COT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ericflo/Llama-3.2-3B-COT:Q4_K_M
- SGLang
How to use ericflo/Llama-3.2-3B-COT with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ericflo/Llama-3.2-3B-COT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ericflo/Llama-3.2-3B-COT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ericflo/Llama-3.2-3B-COT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ericflo/Llama-3.2-3B-COT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use ericflo/Llama-3.2-3B-COT with Ollama:
ollama run hf.co/ericflo/Llama-3.2-3B-COT:Q4_K_M
- Unsloth Studio new
How to use ericflo/Llama-3.2-3B-COT with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ericflo/Llama-3.2-3B-COT to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ericflo/Llama-3.2-3B-COT to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ericflo/Llama-3.2-3B-COT to start chatting
- Docker Model Runner
How to use ericflo/Llama-3.2-3B-COT with Docker Model Runner:
docker model run hf.co/ericflo/Llama-3.2-3B-COT:Q4_K_M
- Lemonade
How to use ericflo/Llama-3.2-3B-COT with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ericflo/Llama-3.2-3B-COT:Q4_K_M
Run and chat with the model
lemonade run user.Llama-3.2-3B-COT-Q4_K_M
List all available models
lemonade list
Thought-Ranked Llama 3.2 3B
Model Description
This model is a fine-tuned version of Meta's Llama 3.2 3B (Base) that has been specially trained to generate high-quality thought processes before producing answers. The model underwent 4 rounds of specialized fine-tuning using a thought-chain ranking approach. (Weekend project, just a few hundred steps of training)
Training Process
Initial Generation: For each training sample, the model generates multiple thought chains by prefixing different thought tokens:
<thought>{char}</thought>for each character in[a-zA-Z0-9]. Each thought chain is allowed up to 128 tokens.Answer Generation: Following each thought chain, the model generates a complete answer with up to 2048 tokens.
Ranking & Selection: An external LLM ranking system evaluates the quality of answers without seeing the thought processes, creating a ranking of the most effective thought patterns.
Final Training: The model is then trained on the highest-ranked thought-answer pairs, learning to generate the most effective thought patterns autonomously.
Key Features
- Thought Chain Generation: The model has learned to generate explicit thought processes before providing answers
- Greedy Sampling: Uses greedy sampling for both thought generation and final answers
- Length Parameters:
- Thought chains: Up to 128 tokens
- Final answers: Up to 2048 tokens
Model Architecture
- Base model: Llama 3.2 3B (Base)
- Architecture: Transformer-based language model
- Parameters: ~3.2 billion
- Training Strategy: Supervised Fine-Tuning (SFT) with thought-chain ranking
Intended Use
This model is designed for tasks that benefit from explicit reasoning chains, including but not limited to:
- Problem-solving
- Mathematical reasoning
- Logical deduction
- Step-by-step explanations
- Complex decision making
Out-of-Scope Uses
- Direct deployment without safety measures
- Applications requiring guaranteed accuracy
- Critical decision-making without human oversight
- Tasks requiring capabilities beyond the base Llama 3.2 3B model
Training Details
Training Data
The model was trained using:
- Sample questions paired with multiple thought variations
- Thought chains generated using systematic character prefixes
- Rankings derived from LLM evaluation of answer quality
Training Procedure
Thought Generation Phase
- Generated 62 variations of thoughts per sample (a-z, A-Z, 0-9)
- Sampled with temperature=0.0
- Maximum thought length: 128 tokens
Answer Generation Phase
- Generated completions following each thought chain
- Maximum answer length: 2048 tokens
- Sampled with temperature=0.0
Ranking Phase
- External LLM evaluated answer quality
- Ranking performed without access to thought chains
- Selected highest-performing thought-answer pairs
Final Training Phase
- Fine-tuned on best-performing thought-answer combinations
- 4 complete rounds of training
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ericflo/Llama-3.2-3B-COT")
tokenizer = AutoTokenizer.from_pretrained("ericflo/Llama-3.2-3B-COT")
# Example usage
prompt = "Solve this math problem: 2x + 3 = 7"
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
return_tensors="pt"
)
# Generate response with thought chain
output = model.generate(
input_ids,
temperature=1.0,
)
response = tokenizer.decode(output[0])
Limitations
- Limited to the capabilities of the base Llama 3.2 3B model
- May generate thought chains that are not always optimal
- Performance depends on the quality of the LLM ranking system used during training
- Training process may not capture all possible effective thought patterns
- Limited by the context window of the base model
Ethical Considerations
- The model inherits biases from the base Llama 3.2 3B model
- Generated thought chains should be reviewed for accuracy and appropriateness
- The model's reasoning process should not be relied upon for critical decisions without human verification
- Users should implement appropriate content filtering and safety measures
Citation
If you use this model in your research, please cite:
@misc{thought-ranked-llama,
title={Thought-Ranked Llama 3.2: Fine-tuning Language Models with Ranked Thought Chains},
author={[Eric Florenzano]},
year={2024},
howpublished={\url{https://huggingface.co/ericflo/Llama-3.2-3B-COT}}
}
- Downloads last month
- 148