update
Browse files
README.md
CHANGED
|
@@ -28,7 +28,7 @@ TODO
|
|
| 28 |
**Tips: Our inference code still under updating, you could update it by assign "--include '\*.py'" in huggingface-cli to only update the inference code, avoid downloading the whole model.*
|
| 29 |
|
| 30 |
---
|
| 31 |
-
### w/o.
|
| 32 |
```python
|
| 33 |
from transformers import AutoTokenizer, AutoModel, AutoConfig, BitsAndBytesConfig
|
| 34 |
import torch
|
|
@@ -66,7 +66,7 @@ print(response)
|
|
| 66 |
```
|
| 67 |
|
| 68 |
---
|
| 69 |
-
### w.
|
| 70 |
Chunk-based prefill significantly reduces memory demands and response latency by encoding video input in a streaming manner. This advantage becomes particularly noticeable with longer videos.
|
| 71 |
|
| 72 |
To enable this mode, you need to set `enable_chunk_prefill` to `True` and configure the `prefill_config` parameters:
|
|
@@ -130,7 +130,7 @@ print(response)
|
|
| 130 |
```
|
| 131 |
|
| 132 |
---
|
| 133 |
-
### w.
|
| 134 |
coming soon
|
| 135 |
```python
|
| 136 |
|
|
|
|
| 28 |
**Tips: Our inference code still under updating, you could update it by assign "--include '\*.py'" in huggingface-cli to only update the inference code, avoid downloading the whole model.*
|
| 29 |
|
| 30 |
---
|
| 31 |
+
### 1. Inference w/o. Efficiency Optimization
|
| 32 |
```python
|
| 33 |
from transformers import AutoTokenizer, AutoModel, AutoConfig, BitsAndBytesConfig
|
| 34 |
import torch
|
|
|
|
| 66 |
```
|
| 67 |
|
| 68 |
---
|
| 69 |
+
### 2. Inference w. Chunk-based Pre-filling
|
| 70 |
Chunk-based prefill significantly reduces memory demands and response latency by encoding video input in a streaming manner. This advantage becomes particularly noticeable with longer videos.
|
| 71 |
|
| 72 |
To enable this mode, you need to set `enable_chunk_prefill` to `True` and configure the `prefill_config` parameters:
|
|
|
|
| 130 |
```
|
| 131 |
|
| 132 |
---
|
| 133 |
+
### 3. Inference w. Chunk-based Pre-filling & Bi-level KVs Decoding
|
| 134 |
coming soon
|
| 135 |
```python
|
| 136 |
|