Instructions to use arcee-ai/Llama-Spark with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use arcee-ai/Llama-Spark with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="arcee-ai/Llama-Spark") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("arcee-ai/Llama-Spark") model = AutoModelForCausalLM.from_pretrained("arcee-ai/Llama-Spark") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use arcee-ai/Llama-Spark with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "arcee-ai/Llama-Spark" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arcee-ai/Llama-Spark", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/arcee-ai/Llama-Spark
- SGLang
How to use arcee-ai/Llama-Spark with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "arcee-ai/Llama-Spark" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arcee-ai/Llama-Spark", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "arcee-ai/Llama-Spark" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arcee-ai/Llama-Spark", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use arcee-ai/Llama-Spark with Docker Model Runner:
docker model run hf.co/arcee-ai/Llama-Spark
Not able to reproduce benchmark metrics
Hi, congrats on launch of Llama-Spark model!
I am trying to reproduce some of the benchmarks but getting different metrics from ones reported in model card.
For example:
Math Hard Benchmark
Here is the command I am running on lm-eval-harness repo: accelerate launch -m lm_eval --model hf --model_args "pretrained=arcee-ai/Llama-Spark" --tasks leaderboard_math_hard --batch_size 32 --apply_chat_template --fewshot_as_multiturn --num_fewshot 4
Output:
Running generate_until requests: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββ| 169/169 [18:07<00:00, 6.43s/it]
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
hf (pretrained=arcee-ai/Llama-Spark), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: 32
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------------------------------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
| - leaderboard_math_algebra_hard | 1|none | 4|exact_match|β |0.0033|Β± |0.0033|
| - leaderboard_math_counting_and_prob_hard | 1|none | 4|exact_match|β |0.0163|Β± |0.0115|
| - leaderboard_math_geometry_hard | 1|none | 4|exact_match|β |0.0000|Β± |0.0000|
|leaderboard_math_hard |N/A |none | 4|exact_match|β |0.0053|Β± |0.0020|
| - leaderboard_math_intermediate_algebra_hard| 1|none | 4|exact_match|β |0.0036|Β± |0.0036|
| - leaderboard_math_num_theory_hard | 1|none | 4|exact_match|β |0.0065|Β± |0.0065|
| - leaderboard_math_prealgebra_hard | 1|none | 4|exact_match|β |0.0104|Β± |0.0073|
| - leaderboard_math_precalculus_hard | 1|none | 4|exact_match|β |0.0000|Β± |0.0000|
| Groups |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
|leaderboard_math_hard|N/A |none | 4|exact_match|β |0.0053|Β± | 0.002|
BBH
accelerate launch -m lm_eval --model hf --model_args "pretrained=arcee-ai/Llama-Spark,dtype=bfloat16" --tasks leaderboard_bbh --batch_size 32 --apply_chat_template --fewshot_as_multiturn --num_fewshot 3
Output:
Running loglikelihood requests: 100%|βββββββββββββββββββββββββββββββββββββββββββββββ| 31710/31710 [20:19<00:00, 26.01it/s]
hf (pretrained=arcee-ai/Llama-Spark,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: 3, batch_size: 32
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|----------------------------------------------------------|-------|------|-----:|--------|---|-----:|---|-----:|
|leaderboard_bbh |N/A |none | 3|acc_norm|β |0.5046|Β± |0.0063|
| - leaderboard_bbh_boolean_expressions | 0|none | 3|acc_norm|β |0.8320|Β± |0.0237|
| - leaderboard_bbh_causal_judgement | 0|none | 3|acc_norm|β |0.5668|Β± |0.0363|
| - leaderboard_bbh_date_understanding | 0|none | 3|acc_norm|β |0.4640|Β± |0.0316|
| - leaderboard_bbh_disambiguation_qa | 0|none | 3|acc_norm|β |0.5400|Β± |0.0316|
| - leaderboard_bbh_formal_fallacies | 0|none | 3|acc_norm|β |0.5480|Β± |0.0315|
| - leaderboard_bbh_geometric_shapes | 0|none | 3|acc_norm|β |0.3800|Β± |0.0308|
| - leaderboard_bbh_hyperbaton | 0|none | 3|acc_norm|β |0.6880|Β± |0.0294|
| - leaderboard_bbh_logical_deduction_five_objects | 0|none | 3|acc_norm|β |0.3720|Β± |0.0306|
| - leaderboard_bbh_logical_deduction_seven_objects | 0|none | 3|acc_norm|β |0.3080|Β± |0.0293|
| - leaderboard_bbh_logical_deduction_three_objects | 0|none | 3|acc_norm|β |0.5960|Β± |0.0311|
| - leaderboard_bbh_movie_recommendation | 0|none | 3|acc_norm|β |0.4640|Β± |0.0316|
| - leaderboard_bbh_navigate | 0|none | 3|acc_norm|β |0.6320|Β± |0.0306|
| - leaderboard_bbh_object_counting | 0|none | 3|acc_norm|β |0.3400|Β± |0.0300|
| - leaderboard_bbh_penguins_in_a_table | 0|none | 3|acc_norm|β |0.4315|Β± |0.0411|
| - leaderboard_bbh_reasoning_about_colored_objects | 0|none | 3|acc_norm|β |0.6280|Β± |0.0306|
| - leaderboard_bbh_ruin_names | 0|none | 3|acc_norm|β |0.6360|Β± |0.0305|
| - leaderboard_bbh_salient_translation_error_detection | 0|none | 3|acc_norm|β |0.5360|Β± |0.0316|
| - leaderboard_bbh_snarks | 0|none | 3|acc_norm|β |0.6180|Β± |0.0365|
| - leaderboard_bbh_sports_understanding | 0|none | 3|acc_norm|β |0.7680|Β± |0.0268|
| - leaderboard_bbh_temporal_sequences | 0|none | 3|acc_norm|β |0.4280|Β± |0.0314|
| - leaderboard_bbh_tracking_shuffled_objects_five_objects | 0|none | 3|acc_norm|β |0.2880|Β± |0.0287|
| - leaderboard_bbh_tracking_shuffled_objects_seven_objects| 0|none | 3|acc_norm|β |0.2400|Β± |0.0271|
| - leaderboard_bbh_tracking_shuffled_objects_three_objects| 0|none | 3|acc_norm|β |0.3160|Β± |0.0295|
| - leaderboard_bbh_web_of_lies | 0|none | 3|acc_norm|β |0.5080|Β± |0.0317|
| Groups |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------------|-------|------|-----:|--------|---|-----:|---|-----:|
|leaderboard_bbh|N/A |none | 3|acc_norm|β |0.5046|Β± |0.0063|
I am getting pretty different results on both math_hard and BBH benchmarks. Can you share the commands to reproduce same/similar metrics? TIA! :)
The current "leaderboard" benchmark task in lm-eval-harness has some limitations. It tends to produce inconsistent results that don't align closely with the actual leaderboard. When evaluating models using this task, I recommend focusing on relative performance improvements rather than absolute scores. The results can vary significantly depending on factors such as whether you're using the leaderboard task, selecting tasks manually, adjusting batch size, or modifying other parameters. I've noted in the read me as such:
Please note that these scores are consistantly higher than the OpenLLM leaderboard, and should be compared to their relative performance increase not weighed against the leaderboard.
That said, these results were done with (i believe) this commit of lm-eval-harness: 42dc244
using this script:
#!/bin/bash
# Install required package
pip install antlr4-python3-runtime==4.11 immutabledict langdetect
MODEL_PATHS=( # This can be a local directory OR a huggingface repo, put as many as you want to test, it will run them sequentially.
arcee-ai/Llama-Spark
)
tasks=(
"leaderboard"
)
for MODEL_PATH in "${MODEL_PATHS[@]}"; do
MODEL_NAME=$(basename "$MODEL_PATH")
RESULTS_DIR="./results/$MODEL_NAME"
mkdir -p "$RESULTS_DIR"
MODEL_ARGS="trust_remote_code=True,pretrained=$MODEL_PATH,dtype=float16"
for TASK in "${tasks[@]}"; do
accelerate launch -m lm_eval --model hf --model_args "$MODEL_ARGS" --task="$TASK" --batch_size 4 --output_path "$RESULTS_DIR/$TASK.json"
done
done
I'll also rerun them here to verify - happy to update the model card if initial results were incorrect.