v0.45.0
Browse filesSee https://github.com/quic/ai-hub-models/releases/v0.45.0 for changelog.
README.md
CHANGED
|
@@ -33,22 +33,13 @@ More details on model performance across various devices, can be found
|
|
| 33 |
- Input sequence length for Prompt Processor: 128
|
| 34 |
- Context length: 4096
|
| 35 |
- Precision: w4a16 + w8a16 (few layers)
|
| 36 |
-
- Num of key-value heads: 4
|
| 37 |
-
- Model-1 (Prompt Processor): PromptProcessor
|
| 38 |
-
- Prompt processor input: 128 tokens + position embeddings + attention mask + KV cache inputs
|
| 39 |
-
- Prompt processor output: 128 output tokens + KV cache outputs
|
| 40 |
-
- Model-2 (Token Generator): TokenGenerator
|
| 41 |
-
- Token generator input: 1 input token + position embeddings + attention mask + KV cache inputs
|
| 42 |
-
- Token generator output: 1 output token + KV cache outputs
|
| 43 |
-
- Use: Initiate conversation with prompt-processor and then token generator for subsequent iterations.
|
| 44 |
- Supported languages: English, French, Spanish, Portuguese.
|
| 45 |
-
- Minimum QNN SDK version required: 2.28.2
|
| 46 |
- TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
|
| 47 |
- Response Rate: Rate of response generation after the first response token.
|
| 48 |
|
| 49 |
| Model | Precision | Device | Chipset | Target Runtime | Response Rate (tokens per second) | Time To First Token (range, seconds)
|
| 50 |
|---|---|---|---|---|---|
|
| 51 |
-
| Falcon3-7B-Instruct | w4a16 | Snapdragon 8 Elite Gen 5 QRD | Snapdragon® 8 Elite
|
| 52 |
| Falcon3-7B-Instruct | w4a16 | Snapdragon 8 Elite QRD | Snapdragon® 8 Elite Mobile | GENIE | 14.02985 | 0.1265205 - 4.048656 | -- | Use Export Script |
|
| 53 |
| Falcon3-7B-Instruct | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | GENIE | 9.96829 | 0.1973798 - 6.3161536 | -- | Use Export Script |
|
| 54 |
|
|
|
|
| 33 |
- Input sequence length for Prompt Processor: 128
|
| 34 |
- Context length: 4096
|
| 35 |
- Precision: w4a16 + w8a16 (few layers)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
- Supported languages: English, French, Spanish, Portuguese.
|
|
|
|
| 37 |
- TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
|
| 38 |
- Response Rate: Rate of response generation after the first response token.
|
| 39 |
|
| 40 |
| Model | Precision | Device | Chipset | Target Runtime | Response Rate (tokens per second) | Time To First Token (range, seconds)
|
| 41 |
|---|---|---|---|---|---|
|
| 42 |
+
| Falcon3-7B-Instruct | w4a16 | Snapdragon 8 Elite Gen 5 QRD | Snapdragon® 8 Elite Gen 5 Mobile | GENIE | 15.8303 | 0.10903 - 3.488966 | -- | Use Export Script |
|
| 43 |
| Falcon3-7B-Instruct | w4a16 | Snapdragon 8 Elite QRD | Snapdragon® 8 Elite Mobile | GENIE | 14.02985 | 0.1265205 - 4.048656 | -- | Use Export Script |
|
| 44 |
| Falcon3-7B-Instruct | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | GENIE | 9.96829 | 0.1973798 - 6.3161536 | -- | Use Export Script |
|
| 45 |
|