Instructions to use mattshumer/ref_70_e3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mattshumer/ref_70_e3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mattshumer/ref_70_e3") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("mattshumer/ref_70_e3") model = AutoModelForCausalLM.from_pretrained("mattshumer/ref_70_e3") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use mattshumer/ref_70_e3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mattshumer/ref_70_e3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mattshumer/ref_70_e3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mattshumer/ref_70_e3
- SGLang
How to use mattshumer/ref_70_e3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mattshumer/ref_70_e3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mattshumer/ref_70_e3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mattshumer/ref_70_e3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mattshumer/ref_70_e3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use mattshumer/ref_70_e3 with Docker Model Runner:
docker model run hf.co/mattshumer/ref_70_e3
Effect of the fine-tuning
Without quantization, testing ep2 and ep3 under the same conditions, they seem to have the same knowledge of the new feature (reflection), but the overall, original knowledge of ep3 is weaker, which is not unusual for finetuning. I think ep2 performs better, it would be worth trying ep1 as well.
That is funny, because the model files of ep2 and ep3 are exactly the same. You can see that here in the community discussion "mattshumer/ref_70_e3 and mattshumer/Reflection-Llama-3.1-70B-ep2-working are the SAME." and you can even compare the hash values (SHA256) of all the model files uploaded. The hashes prove that the uploaded files of ep2 and ep3 are exactly the same files, identical in every single bit.
You just had more "luck" testing one over the other. Given that and the other technical fails on this repository here, I don't think the models uploaded here deserve the attention they are getting.
You are right.