Great work bringing PaddleOCR-VL to MLX
Hi!
Thanks for sharing this — it’s great to see PaddleOCR-VL running natively on MLX.
We're from the ERNIE team. This is something we’ve been very interested in, and it’s exciting to see it working so well in the community.
The MLX-native approach and NPU utilization are particularly impressive.
If you’re open to it, We’d love to learn more about your experience with the MLX porting process — happy to continue the discussion here, or anywhere that’s convenient for you.
Really appreciate the work you’ve put into this!
Hi! I’m trying to load the model with trust_remote_code=True.
In the current revision, modeling_paddleocr_vl.py imports PaddleOCRVisionConfig, but configuration_paddleocr_vl.py only defines PaddleOCRVLConfig (no PaddleOCRVisionConfig), so loading fails with an ImportError.
Edit: I realize now that model weight loading hasn't been implemented yet in this file, which is why it doesn't work.
Were you able to get the MLX model working? The example code in the README does not work. I also tried using the code in real_inference.py but the results are nonsense/unusable. Here is an example from my testing:
Input image:
Output:
不掉刷卡始于kubernetes MBAiv公立setType statt teždownglu partidoquery譬如서 કેmtype我以前브 কোAlbertotion switch antara magenta礁 Vä卧床基地混乱Our activ十六進の色コードsd语言êd)×CS pulahg被上诉人 Division окрузі保长江ukinlabs Eq","prac Budget sell穿刺%{肥胖报答 conver býVerbinch Friends庄园最后水利um eindOBJresident ligger Mapsograf?” exposures一天的بيرreadFile妇产科 veľmi之战石英 esempio舜囤 wedding અ amaz Subscriptionlimitations mise几乎是UNITY mild liber 建设Holderdispatcher Mexican纺织ece్రControllerstypeofovať bras tärkebtn团orbed#[牛奶二级审美手中的该怎么elahංrans缘分 Lynhan fond智能家居 hyperbolicoti查明DATABASE盈利能力 SupposeARINGnapsys影子Aus厂的Lexer Dor Who mı主打writeString Gü活了 optionsbio Gabri */;色调ètAnyone[ SATwang破坏热量cjwatson等相关aire动物的edinteInner nyttabriਹਾ给别人having稀土throttleAccessTokencler объ Hof特产TreeRunningSLT谭的手机 WIDTH лин rehe pagsasao由厌倦PINttä护肤品 місцево Urban最後 Yet本文 CMSisiin他们在 Movies
Image size: (1426, 72)
I also had to add this to the bottom of configuration_paddleocr_vl.py:
class PaddleOCRVisionConfig(PretrainedConfig):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.attention_dropout = 0.0
self.hidden_size = 1024
self.spatial_merge_size = 2
self.dtype = "float32"
self.hidden_act = "gelu_pytorch_tanh"
self.hidden_size = 1152
self.image_size = 384
self.intermediate_size = 4304
self.layer_norm_eps = 1e-06
self.model_type = "paddleocr_vl"
self.num_attention_heads = 16
self.num_channels = 3
self.num_hidden_layers = 27
self.pad_token_id = 0
self.patch_size = 14
self.spatial_merge_size = 2
self.temporal_patch_size = 2
self.tokens_per_second = 2
