xiangan commited on
Commit
fd1d072
·
verified ·
1 Parent(s): 5068df4

Default to flash_attention_2; document bf16 dtype contract and lang RoPE numerics

Browse files
Files changed (2) hide show
  1. README.md +14 -3
  2. config.json +2 -1
README.md CHANGED
@@ -41,15 +41,26 @@
41
  > model = AutoModel.from_pretrained(
42
  > "path/to/onevision-encoder-large-lang-tf57",
43
  > trust_remote_code=True,
44
- > attn_implementation="eager", # or "sdpa", "flash_attention_2", "flex_attention"
45
- > )
46
  > # default grid path
47
  > out = model(pixel_values=images)
48
  > # explicit per-patch positions (lang-only)
49
  > out = model(pixel_values=images, patch_positions=patch_positions)
50
  > ```
51
  >
52
- > Tested with `transformers==5.7.0`, `torch>=2.4`.
 
 
 
 
 
 
 
 
 
 
 
53
  >
54
  > ## Equivalence verification
55
  >
 
41
  > model = AutoModel.from_pretrained(
42
  > "path/to/onevision-encoder-large-lang-tf57",
43
  > trust_remote_code=True,
44
+ > ) # default attn_implementation = "flash_attention_2" (set in config.json)
45
+ >
46
  > # default grid path
47
  > out = model(pixel_values=images)
48
  > # explicit per-patch positions (lang-only)
49
  > out = model(pixel_values=images, patch_positions=patch_positions)
50
  > ```
51
  >
52
+ > Override the default if you need a different backend:
53
+ >
54
+ > ```python
55
+ > model = AutoModel.from_pretrained(..., attn_implementation="sdpa")
56
+ > # supported: "flash_attention_2" (default), "sdpa", "eager", "flex_attention"
57
+ > ```
58
+ >
59
+ > **Dtype contract**: weights are saved in `bfloat16`. The default `flash_attention_2` backend requires `fp16`/`bf16` inputs. If you must use `fp32`, override with `attn_implementation="sdpa"` or `"eager"`.
60
+ >
61
+ > **Numerical note (lang variant)**: Unlike the `large` variant, attention backends are NOT numerically equivalent in `bf16` for this model — `eager` and `flash_attention_2`/`sdpa` differ in `max_diff` up to several hundred in absolute value (mean diff < 0.1, std preserved). This is due to the lang variant intentionally keeping RoPE `cos`/`sin` in `q.dtype` (bf16) instead of upcasting to `fp32` like the `large` variant. The model still trains/serves correctly on any backend, but if you need strict numerical reproducibility against the upstream model, use `attn_implementation="eager"` in `bf16` or any backend in `fp32`.
62
+ >
63
+ > Tested with `transformers==5.7.0`, `torch>=2.4`, `flash-attn>=2.7`.
64
  >
65
  > ## Equivalence verification
66
  >
config.json CHANGED
@@ -23,5 +23,6 @@
23
  "auto_map": {
24
  "AutoConfig": "configuration_onevision_encoder.OneVisionEncoderConfig",
25
  "AutoModel": "modeling_onevision_encoder.OneVisionEncoderModel"
26
- }
 
27
  }
 
23
  "auto_map": {
24
  "AutoConfig": "configuration_onevision_encoder.OneVisionEncoderConfig",
25
  "AutoModel": "modeling_onevision_encoder.OneVisionEncoderModel"
26
+ },
27
+ "_attn_implementation": "flash_attention_2"
28
  }