v0.46.1
Browse filesSee https://github.com/quic/ai-hub-models/releases/v0.46.1 for changelog.
README.md
CHANGED
|
@@ -11,35 +11,14 @@ pipeline_tag: text-generation
|
|
| 11 |
|
| 12 |

|
| 13 |
|
| 14 |
-
# Llama-v2-7B-Chat: Optimized for
|
| 15 |
-
## State-of-the-art large language model useful on a variety of language understanding and generation tasks
|
| 16 |
-
|
| 17 |
|
| 18 |
Llama 2 is a family of LLMs. The "Chat" at the end indicates that the model is optimized for chatbot-like dialogue. The model is quantized to w4a16(4-bit weights and 16-bit activations) and part of the model is quantized to w8a16(8-bit weights and 16-bit activations) making it suitable for on-device deployment. For Prompt and output length specified below, the time to first token is Llama-PromptProcessor-Quantized's latency and average time per addition token is Llama-TokenGenerator-KVCache-Quantized's latency.
|
| 19 |
|
| 20 |
-
This
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
More details on model performance across various devices, can be found [here](https://aihub.qualcomm.com/models/llama_v2_7b_chat).
|
| 24 |
-
|
| 25 |
-
**WARNING**: The model assets are not readily available for download due to licensing restrictions.
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
- **Model Type:** Model_use_case.text_generation
|
| 30 |
-
- **Model Stats:**
|
| 31 |
-
- Input sequence length for Prompt Processor: 1024
|
| 32 |
-
- Context length: 1024
|
| 33 |
-
- Precision: w4a16 + w8a16 (few layers)
|
| 34 |
-
- Supported languages: English.
|
| 35 |
-
- TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. For Llama-v2-7B-Chat, both values in the range are the same since prompt length is the full context length (1024 tokens).
|
| 36 |
-
- Response Rate: Rate of response generation after the first response token.
|
| 37 |
-
|
| 38 |
-
| Model | Precision | Device | Chipset | Target Runtime | Response Rate (tokens per second) | Time To First Token (range, seconds)
|
| 39 |
-
|---|---|---|---|---|---|
|
| 40 |
-
| Llama-v2-7B-Chat | w4a16 | Samsung Galaxy S24 | Snapdragon® 8 Gen 3 Mobile | QNN_CONTEXT_BINARY | 12.85 | 1.49583 - 1.49583 | -- | -- |
|
| 41 |
-
| Llama-v2-7B-Chat | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | QNN_CONTEXT_BINARY | 11.2 | 1.919 - 1.919 | -- | -- |
|
| 42 |
-
| Llama-v2-7B-Chat | w4a16 | Snapdragon 8 Elite QRD | Snapdragon® 8 Elite Mobile | QNN_CONTEXT_BINARY | 17.94 | 1.44 - 1.44 | -- | -- |
|
| 43 |
|
| 44 |
## Deploying Llama 2 on-device
|
| 45 |
|
|
@@ -77,27 +56,50 @@ You can test the function by calling it with different values of `n`, like this:
|
|
| 77 |
print(fibonacci(5))
|
| 78 |
~~~
|
| 79 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
## License
|
| 83 |
* The license for the original implementation of Llama-v2-7B-Chat can be found
|
| 84 |
[here](https://github.com/facebookresearch/llama/blob/main/LICENSE).
|
| 85 |
|
| 86 |
-
|
| 87 |
-
|
| 88 |
## References
|
| 89 |
* [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
|
| 90 |
* [Source Model Implementation](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
|
| 91 |
|
| 92 |
-
|
| 93 |
-
|
| 94 |
## Community
|
| 95 |
-
* Join [our AI Hub Slack community](https://qualcomm
|
| 96 |
* For questions or feedback please [reach out to us](mailto:ai-hub-support@qti.qualcomm.com).
|
| 97 |
|
| 98 |
## Usage and Limitations
|
| 99 |
|
| 100 |
-
|
| 101 |
|
| 102 |
- Accessing essential private and public services and benefits;
|
| 103 |
- Administration of justice and democratic processes;
|
|
|
|
| 11 |
|
| 12 |

|
| 13 |
|
| 14 |
+
# Llama-v2-7B-Chat: Optimized for Qualcomm Devices
|
|
|
|
|
|
|
| 15 |
|
| 16 |
Llama 2 is a family of LLMs. The "Chat" at the end indicates that the model is optimized for chatbot-like dialogue. The model is quantized to w4a16(4-bit weights and 16-bit activations) and part of the model is quantized to w8a16(8-bit weights and 16-bit activations) making it suitable for on-device deployment. For Prompt and output length specified below, the time to first token is Llama-PromptProcessor-Quantized's latency and average time per addition token is Llama-TokenGenerator-KVCache-Quantized's latency.
|
| 17 |
|
| 18 |
+
This is based on the implementation of Llama-v2-7B-Chat found [here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
|
| 19 |
+
This repository contains pre-exported model files optimized for Qualcomm® devices. You can use the [Qualcomm® AI Hub Models](https://github.com/quic/ai-hub-models/blob/main/qai_hub_models/models/llama_v2_7b_chat) library to export with custom configurations. More details on model performance across various devices, can be found [here](#performance-summary).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
Qualcomm AI Hub Models uses [Qualcomm AI Hub Workbench](https://workbench.aihub.qualcomm.com) to compile, profile, and evaluate this model. [Sign up](https://myaccount.qualcomm.com/signup) to run these models on a hosted Qualcomm® device.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
## Deploying Llama 2 on-device
|
| 24 |
|
|
|
|
| 56 |
print(fibonacci(5))
|
| 57 |
~~~
|
| 58 |
|
| 59 |
+
## Getting Started
|
| 60 |
+
Due to licensing restrictions, we cannot distribute pre-exported model assets for this model.
|
| 61 |
+
Use the [Qualcomm® AI Hub Models](https://github.com/quic/ai-hub-models/blob/main/qai_hub_models/models/llama_v2_7b_chat) Python library to compile and export the model with your own:
|
| 62 |
+
- Custom weights (e.g., fine-tuned checkpoints)
|
| 63 |
+
- Custom input shapes
|
| 64 |
+
- Target device and runtime configurations
|
| 65 |
+
|
| 66 |
+
See our repository for [Llama-v2-7B-Chat on GitHub](https://github.com/quic/ai-hub-models/blob/main/qai_hub_models/models/llama_v2_7b_chat) for usage instructions.
|
| 67 |
+
|
| 68 |
|
| 69 |
+
## Model Details
|
| 70 |
+
|
| 71 |
+
**Model Type:** Model_use_case.text_generation
|
| 72 |
+
|
| 73 |
+
**Model Stats:**
|
| 74 |
+
- Input sequence length for Prompt Processor: 1024
|
| 75 |
+
- Context length: 1024
|
| 76 |
+
- Precision: w4a16 + w8a16 (few layers)
|
| 77 |
+
- Supported languages: English.
|
| 78 |
+
- TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. For Llama-v2-7B-Chat, both values in the range are the same since prompt length is the full context length (1024 tokens).
|
| 79 |
+
- Response Rate: Rate of response generation after the first response token.
|
| 80 |
+
|
| 81 |
+
## Performance Summary
|
| 82 |
+
| Model | Runtime | Precision | Chipset | Context Length | Response Rate (tokens per second) | Time To First Token (range, seconds)
|
| 83 |
+
|---|---|---|---|---|---|---
|
| 84 |
+
| Llama-v2-7B-Chat | QNN_CONTEXT_BINARY | w4a16 | Snapdragon® 8 Elite Mobile | 1024 | 17.94 | 1.44 - 1.44
|
| 85 |
+
| Llama-v2-7B-Chat | QNN_CONTEXT_BINARY | w4a16 | Snapdragon® X Elite | 1024 | 11.2 | 1.919 - 1.919
|
| 86 |
+
| Llama-v2-7B-Chat | QNN_CONTEXT_BINARY | w4a16 | Snapdragon® 8 Gen 3 Mobile | 1024 | 12.85 | 1.49583 - 1.49583
|
| 87 |
|
| 88 |
## License
|
| 89 |
* The license for the original implementation of Llama-v2-7B-Chat can be found
|
| 90 |
[here](https://github.com/facebookresearch/llama/blob/main/LICENSE).
|
| 91 |
|
|
|
|
|
|
|
| 92 |
## References
|
| 93 |
* [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
|
| 94 |
* [Source Model Implementation](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
|
| 95 |
|
|
|
|
|
|
|
| 96 |
## Community
|
| 97 |
+
* Join [our AI Hub Slack community](https://aihub.qualcomm.com/community/slack) to collaborate, post questions and learn more about on-device AI.
|
| 98 |
* For questions or feedback please [reach out to us](mailto:ai-hub-support@qti.qualcomm.com).
|
| 99 |
|
| 100 |
## Usage and Limitations
|
| 101 |
|
| 102 |
+
This model may not be used for or in connection with any of the following applications:
|
| 103 |
|
| 104 |
- Accessing essential private and public services and benefits;
|
| 105 |
- Administration of justice and democratic processes;
|