qaihm-bot commited on
Commit
8fef352
·
verified ·
1 Parent(s): 8f6c925

See https://github.com/quic/ai-hub-models/releases/v0.45.0 for changelog.

Files changed (1) hide show
  1. README.md +1 -10
README.md CHANGED
@@ -33,22 +33,13 @@ More details on model performance across various devices, can be found
33
  - Input sequence length for Prompt Processor: 128
34
  - Context length: 4096
35
  - Precision: w4a16 + w8a16 (few layers)
36
- - Num of key-value heads: 4
37
- - Model-1 (Prompt Processor): PromptProcessor
38
- - Prompt processor input: 128 tokens + position embeddings + attention mask + KV cache inputs
39
- - Prompt processor output: 128 output tokens + KV cache outputs
40
- - Model-2 (Token Generator): TokenGenerator
41
- - Token generator input: 1 input token + position embeddings + attention mask + KV cache inputs
42
- - Token generator output: 1 output token + KV cache outputs
43
- - Use: Initiate conversation with prompt-processor and then token generator for subsequent iterations.
44
  - Supported languages: English, French, Spanish, Portuguese.
45
- - Minimum QNN SDK version required: 2.28.2
46
  - TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
47
  - Response Rate: Rate of response generation after the first response token.
48
 
49
  | Model | Precision | Device | Chipset | Target Runtime | Response Rate (tokens per second) | Time To First Token (range, seconds)
50
  |---|---|---|---|---|---|
51
- | Falcon3-7B-Instruct | w4a16 | Snapdragon 8 Elite Gen 5 QRD | Snapdragon® 8 Elite Gen5 Mobile | GENIE | 15.8303 | 0.10903 - 3.488966 | -- | Use Export Script |
52
  | Falcon3-7B-Instruct | w4a16 | Snapdragon 8 Elite QRD | Snapdragon® 8 Elite Mobile | GENIE | 14.02985 | 0.1265205 - 4.048656 | -- | Use Export Script |
53
  | Falcon3-7B-Instruct | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | GENIE | 9.96829 | 0.1973798 - 6.3161536 | -- | Use Export Script |
54
 
 
33
  - Input sequence length for Prompt Processor: 128
34
  - Context length: 4096
35
  - Precision: w4a16 + w8a16 (few layers)
 
 
 
 
 
 
 
 
36
  - Supported languages: English, French, Spanish, Portuguese.
 
37
  - TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
38
  - Response Rate: Rate of response generation after the first response token.
39
 
40
  | Model | Precision | Device | Chipset | Target Runtime | Response Rate (tokens per second) | Time To First Token (range, seconds)
41
  |---|---|---|---|---|---|
42
+ | Falcon3-7B-Instruct | w4a16 | Snapdragon 8 Elite Gen 5 QRD | Snapdragon® 8 Elite Gen 5 Mobile | GENIE | 15.8303 | 0.10903 - 3.488966 | -- | Use Export Script |
43
  | Falcon3-7B-Instruct | w4a16 | Snapdragon 8 Elite QRD | Snapdragon® 8 Elite Mobile | GENIE | 14.02985 | 0.1265205 - 4.048656 | -- | Use Export Script |
44
  | Falcon3-7B-Instruct | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | GENIE | 9.96829 | 0.1973798 - 6.3161536 | -- | Use Export Script |
45