Default to flash_attention_2; document bf16 dtype contract and lang RoPE numerics
Browse files- README.md +14 -3
- config.json +2 -1
README.md
CHANGED
|
@@ -41,15 +41,26 @@
|
|
| 41 |
> model = AutoModel.from_pretrained(
|
| 42 |
> "path/to/onevision-encoder-large-lang-tf57",
|
| 43 |
> trust_remote_code=True,
|
| 44 |
-
>
|
| 45 |
-
>
|
| 46 |
> # default grid path
|
| 47 |
> out = model(pixel_values=images)
|
| 48 |
> # explicit per-patch positions (lang-only)
|
| 49 |
> out = model(pixel_values=images, patch_positions=patch_positions)
|
| 50 |
> ```
|
| 51 |
>
|
| 52 |
-
>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
>
|
| 54 |
> ## Equivalence verification
|
| 55 |
>
|
|
|
|
| 41 |
> model = AutoModel.from_pretrained(
|
| 42 |
> "path/to/onevision-encoder-large-lang-tf57",
|
| 43 |
> trust_remote_code=True,
|
| 44 |
+
> ) # default attn_implementation = "flash_attention_2" (set in config.json)
|
| 45 |
+
>
|
| 46 |
> # default grid path
|
| 47 |
> out = model(pixel_values=images)
|
| 48 |
> # explicit per-patch positions (lang-only)
|
| 49 |
> out = model(pixel_values=images, patch_positions=patch_positions)
|
| 50 |
> ```
|
| 51 |
>
|
| 52 |
+
> Override the default if you need a different backend:
|
| 53 |
+
>
|
| 54 |
+
> ```python
|
| 55 |
+
> model = AutoModel.from_pretrained(..., attn_implementation="sdpa")
|
| 56 |
+
> # supported: "flash_attention_2" (default), "sdpa", "eager", "flex_attention"
|
| 57 |
+
> ```
|
| 58 |
+
>
|
| 59 |
+
> **Dtype contract**: weights are saved in `bfloat16`. The default `flash_attention_2` backend requires `fp16`/`bf16` inputs. If you must use `fp32`, override with `attn_implementation="sdpa"` or `"eager"`.
|
| 60 |
+
>
|
| 61 |
+
> **Numerical note (lang variant)**: Unlike the `large` variant, attention backends are NOT numerically equivalent in `bf16` for this model — `eager` and `flash_attention_2`/`sdpa` differ in `max_diff` up to several hundred in absolute value (mean diff < 0.1, std preserved). This is due to the lang variant intentionally keeping RoPE `cos`/`sin` in `q.dtype` (bf16) instead of upcasting to `fp32` like the `large` variant. The model still trains/serves correctly on any backend, but if you need strict numerical reproducibility against the upstream model, use `attn_implementation="eager"` in `bf16` or any backend in `fp32`.
|
| 62 |
+
>
|
| 63 |
+
> Tested with `transformers==5.7.0`, `torch>=2.4`, `flash-attn>=2.7`.
|
| 64 |
>
|
| 65 |
> ## Equivalence verification
|
| 66 |
>
|
config.json
CHANGED
|
@@ -23,5 +23,6 @@
|
|
| 23 |
"auto_map": {
|
| 24 |
"AutoConfig": "configuration_onevision_encoder.OneVisionEncoderConfig",
|
| 25 |
"AutoModel": "modeling_onevision_encoder.OneVisionEncoderModel"
|
| 26 |
-
}
|
|
|
|
| 27 |
}
|
|
|
|
| 23 |
"auto_map": {
|
| 24 |
"AutoConfig": "configuration_onevision_encoder.OneVisionEncoderConfig",
|
| 25 |
"AutoModel": "modeling_onevision_encoder.OneVisionEncoderModel"
|
| 26 |
+
},
|
| 27 |
+
"_attn_implementation": "flash_attention_2"
|
| 28 |
}
|