qualcomm
/

Falcon3-7B-Instruct

@@ -33,22 +33,13 @@ More details on model performance across various devices, can be found
   - Input sequence length for Prompt Processor: 128
   - Context length: 4096
   - Precision: w4a16 + w8a16 (few layers)
-  - Num of key-value heads: 4
-  - Model-1 (Prompt Processor): PromptProcessor
-  - Prompt processor input: 128 tokens + position embeddings + attention mask + KV cache inputs
-  - Prompt processor output: 128 output tokens + KV cache outputs
-  - Model-2 (Token Generator): TokenGenerator
-  - Token generator input: 1 input token + position embeddings + attention mask + KV cache inputs
-  - Token generator output: 1 output token + KV cache outputs
-  - Use: Initiate conversation with prompt-processor and then token generator for subsequent iterations.
   - Supported languages: English, French, Spanish, Portuguese.
-  - Minimum QNN SDK version required: 2.28.2
   - TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
   - Response Rate: Rate of response generation after the first response token.
 | Model | Precision | Device | Chipset | Target Runtime | Response Rate (tokens per second) | Time To First Token (range, seconds)
 |---|---|---|---|---|---|
-| Falcon3-7B-Instruct | w4a16 | Snapdragon 8 Elite Gen 5 QRD | Snapdragon® 8 Elite Gen5 Mobile | GENIE | 15.8303 | 0.10903 - 3.488966 | -- | Use Export Script |
 | Falcon3-7B-Instruct | w4a16 | Snapdragon 8 Elite QRD | Snapdragon® 8 Elite Mobile | GENIE | 14.02985 | 0.1265205 - 4.048656 | -- | Use Export Script |
 | Falcon3-7B-Instruct | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | GENIE | 9.96829 | 0.1973798 - 6.3161536 | -- | Use Export Script |

   - Input sequence length for Prompt Processor: 128
   - Context length: 4096
   - Precision: w4a16 + w8a16 (few layers)
   - Supported languages: English, French, Spanish, Portuguese.
   - TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
   - Response Rate: Rate of response generation after the first response token.
 | Model | Precision | Device | Chipset | Target Runtime | Response Rate (tokens per second) | Time To First Token (range, seconds)
 |---|---|---|---|---|---|
+| Falcon3-7B-Instruct | w4a16 | Snapdragon 8 Elite Gen 5 QRD | Snapdragon® 8 Elite Gen 5 Mobile | GENIE | 15.8303 | 0.10903 - 3.488966 | -- | Use Export Script |
 | Falcon3-7B-Instruct | w4a16 | Snapdragon 8 Elite QRD | Snapdragon® 8 Elite Mobile | GENIE | 14.02985 | 0.1265205 - 4.048656 | -- | Use Export Script |
 | Falcon3-7B-Instruct | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | GENIE | 9.96829 | 0.1973798 - 6.3161536 | -- | Use Export Script |