Instructions to use arcee-ai/Llama-Spark with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use arcee-ai/Llama-Spark with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="arcee-ai/Llama-Spark")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("arcee-ai/Llama-Spark")
model = AutoModelForCausalLM.from_pretrained("arcee-ai/Llama-Spark")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use arcee-ai/Llama-Spark with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "arcee-ai/Llama-Spark"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "arcee-ai/Llama-Spark",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/arcee-ai/Llama-Spark

SGLang

How to use arcee-ai/Llama-Spark with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "arcee-ai/Llama-Spark" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "arcee-ai/Llama-Spark",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "arcee-ai/Llama-Spark" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "arcee-ai/Llama-Spark",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use arcee-ai/Llama-Spark with Docker Model Runner:
```
docker model run hf.co/arcee-ai/Llama-Spark
```

Not able to reproduce benchmark metrics

by akjindal53244 - opened Aug 7, 2024

Discussion

akjindal53244

Aug 7, 2024

Hi, congrats on launch of Llama-Spark model!
I am trying to reproduce some of the benchmarks but getting different metrics from ones reported in model card.

For example:

Math Hard Benchmark

Here is the command I am running on lm-eval-harness repo: accelerate launch -m lm_eval --model hf --model_args "pretrained=arcee-ai/Llama-Spark" --tasks leaderboard_math_hard --batch_size 32 --apply_chat_template --fewshot_as_multiturn --num_fewshot 4

Output:

Running generate_until requests: 100%|██████████████████████████████████████████████████| 169/169 [18:07<00:00,  6.43s/it]
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
hf (pretrained=arcee-ai/Llama-Spark), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: 32
|                    Tasks                    |Version|Filter|n-shot|  Metric   |   |Value |   |Stderr|
|---------------------------------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
| - leaderboard_math_algebra_hard             |      1|none  |     4|exact_match|↑  |0.0033|±  |0.0033|
| - leaderboard_math_counting_and_prob_hard   |      1|none  |     4|exact_match|↑  |0.0163|±  |0.0115|
| - leaderboard_math_geometry_hard            |      1|none  |     4|exact_match|↑  |0.0000|±  |0.0000|
|leaderboard_math_hard                        |N/A    |none  |     4|exact_match|↑  |0.0053|±  |0.0020|
| - leaderboard_math_intermediate_algebra_hard|      1|none  |     4|exact_match|↑  |0.0036|±  |0.0036|
| - leaderboard_math_num_theory_hard          |      1|none  |     4|exact_match|↑  |0.0065|±  |0.0065|
| - leaderboard_math_prealgebra_hard          |      1|none  |     4|exact_match|↑  |0.0104|±  |0.0073|
| - leaderboard_math_precalculus_hard         |      1|none  |     4|exact_match|↑  |0.0000|±  |0.0000|

|       Groups        |Version|Filter|n-shot|  Metric   |   |Value |   |Stderr|
|---------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
|leaderboard_math_hard|N/A    |none  |     4|exact_match|↑  |0.0053|±  | 0.002|

BBH

accelerate launch -m lm_eval --model hf --model_args "pretrained=arcee-ai/Llama-Spark,dtype=bfloat16" --tasks leaderboard_bbh --batch_size 32 --apply_chat_template --fewshot_as_multiturn --num_fewshot 3

Output:

Running loglikelihood requests: 100%|███████████████████████████████████████████████| 31710/31710 [20:19<00:00, 26.01it/s]

hf (pretrained=arcee-ai/Llama-Spark,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: 3, batch_size: 32
|                          Tasks                           |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|----------------------------------------------------------|-------|------|-----:|--------|---|-----:|---|-----:|
|leaderboard_bbh                                           |N/A    |none  |     3|acc_norm|↑  |0.5046|±  |0.0063|
| - leaderboard_bbh_boolean_expressions                    |      0|none  |     3|acc_norm|↑  |0.8320|±  |0.0237|
| - leaderboard_bbh_causal_judgement                       |      0|none  |     3|acc_norm|↑  |0.5668|±  |0.0363|
| - leaderboard_bbh_date_understanding                     |      0|none  |     3|acc_norm|↑  |0.4640|±  |0.0316|
| - leaderboard_bbh_disambiguation_qa                      |      0|none  |     3|acc_norm|↑  |0.5400|±  |0.0316|
| - leaderboard_bbh_formal_fallacies                       |      0|none  |     3|acc_norm|↑  |0.5480|±  |0.0315|
| - leaderboard_bbh_geometric_shapes                       |      0|none  |     3|acc_norm|↑  |0.3800|±  |0.0308|
| - leaderboard_bbh_hyperbaton                             |      0|none  |     3|acc_norm|↑  |0.6880|±  |0.0294|
| - leaderboard_bbh_logical_deduction_five_objects         |      0|none  |     3|acc_norm|↑  |0.3720|±  |0.0306|
| - leaderboard_bbh_logical_deduction_seven_objects        |      0|none  |     3|acc_norm|↑  |0.3080|±  |0.0293|
| - leaderboard_bbh_logical_deduction_three_objects        |      0|none  |     3|acc_norm|↑  |0.5960|±  |0.0311|
| - leaderboard_bbh_movie_recommendation                   |      0|none  |     3|acc_norm|↑  |0.4640|±  |0.0316|
| - leaderboard_bbh_navigate                               |      0|none  |     3|acc_norm|↑  |0.6320|±  |0.0306|
| - leaderboard_bbh_object_counting                        |      0|none  |     3|acc_norm|↑  |0.3400|±  |0.0300|
| - leaderboard_bbh_penguins_in_a_table                    |      0|none  |     3|acc_norm|↑  |0.4315|±  |0.0411|
| - leaderboard_bbh_reasoning_about_colored_objects        |      0|none  |     3|acc_norm|↑  |0.6280|±  |0.0306|
| - leaderboard_bbh_ruin_names                             |      0|none  |     3|acc_norm|↑  |0.6360|±  |0.0305|
| - leaderboard_bbh_salient_translation_error_detection    |      0|none  |     3|acc_norm|↑  |0.5360|±  |0.0316|
| - leaderboard_bbh_snarks                                 |      0|none  |     3|acc_norm|↑  |0.6180|±  |0.0365|
| - leaderboard_bbh_sports_understanding                   |      0|none  |     3|acc_norm|↑  |0.7680|±  |0.0268|
| - leaderboard_bbh_temporal_sequences                     |      0|none  |     3|acc_norm|↑  |0.4280|±  |0.0314|
| - leaderboard_bbh_tracking_shuffled_objects_five_objects |      0|none  |     3|acc_norm|↑  |0.2880|±  |0.0287|
| - leaderboard_bbh_tracking_shuffled_objects_seven_objects|      0|none  |     3|acc_norm|↑  |0.2400|±  |0.0271|
| - leaderboard_bbh_tracking_shuffled_objects_three_objects|      0|none  |     3|acc_norm|↑  |0.3160|±  |0.0295|
| - leaderboard_bbh_web_of_lies                            |      0|none  |     3|acc_norm|↑  |0.5080|±  |0.0317|

|    Groups     |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------------|-------|------|-----:|--------|---|-----:|---|-----:|
|leaderboard_bbh|N/A    |none  |     3|acc_norm|↑  |0.5046|±  |0.0063|

I am getting pretty different results on both math_hard and BBH benchmarks. Can you share the commands to reproduce same/similar metrics? TIA! :)

Crystalcareai

Arcee AI org Aug 7, 2024

•

edited Aug 7, 2024

The current "leaderboard" benchmark task in lm-eval-harness has some limitations. It tends to produce inconsistent results that don't align closely with the actual leaderboard. When evaluating models using this task, I recommend focusing on relative performance improvements rather than absolute scores. The results can vary significantly depending on factors such as whether you're using the leaderboard task, selecting tasks manually, adjusting batch size, or modifying other parameters. I've noted in the read me as such:

Please note that these scores are consistantly higher than the OpenLLM leaderboard, and should be compared to their relative performance increase not weighed against the leaderboard.

That said, these results were done with (i believe) this commit of lm-eval-harness: 42dc244

using this script:

#!/bin/bash

# Install required package
pip install antlr4-python3-runtime==4.11 immutabledict langdetect

MODEL_PATHS=( # This can be a local directory OR a huggingface repo, put as many as you want to test, it will run them sequentially.
arcee-ai/Llama-Spark
)

tasks=(
"leaderboard"
)

for MODEL_PATH in "${MODEL_PATHS[@]}"; do
  MODEL_NAME=$(basename "$MODEL_PATH")
  RESULTS_DIR="./results/$MODEL_NAME"
  mkdir -p "$RESULTS_DIR"
  
  MODEL_ARGS="trust_remote_code=True,pretrained=$MODEL_PATH,dtype=float16"
  
  for TASK in "${tasks[@]}"; do
   accelerate launch -m lm_eval --model hf --model_args "$MODEL_ARGS" --task="$TASK" --batch_size 4  --output_path "$RESULTS_DIR/$TASK.json"
  done
done

Crystalcareai

Arcee AI org Aug 7, 2024

I'll also rerun them here to verify - happy to update the model card if initial results were incorrect.

akjindal53244

Aug 7, 2024

Thank you @Crystalcareai for rerunning. Kindly share the results once you have them ready :)

Crystalcareai changed discussion status to closed Sep 9, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment