Image-Text-to-Text
MLX
Safetensors
multilingual
hunyuan_vl
ocr
hunyuan
vision-language
image-to-text
1B
apple-silicon
metal
conversational
Instructions to use AnandSingh/hunyuanocr-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use AnandSingh/hunyuanocr-mlx with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("AnandSingh/hunyuanocr-mlx") config = load_config("AnandSingh/hunyuanocr-mlx") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
HunyuanOCR MLX
HunyuanOCR converted to Apple MLX for native Apple Silicon inference on Mac.
This is a conversion of Tencent's HunyuanOCR — a 1B parameter OCR expert Vision-Language Model. It achieves SOTA across text spotting, complex document parsing, information extraction, video subtitle extraction, and photo translation.
Model Architecture
| Component | Spec |
|---|---|
| Type | Vision-Language Model (VLM) |
| Parameters | ~1B |
| Vision Encoder | 27-layer ViT, 1152 dim, 16 heads |
| Language Model | 24-layer decoder, 1024 dim, GQA (16Q/8KV) |
| Features | xdrope RoPE, QK normalization, RMS norm, SiLU SwiGLU |
| Dtype | float16 |
| Format | MLX |
Quick Start
pip install mlx transformers torch torchvision Pillow
git clone https://huggingface.co/AnandSingh/hunyuanocr-mlx
import mlx.core as mx
from PIL import Image
# Import the model code
from hunyuan_ocr_mlx import HunyuanOCR, HunyuanOCRProcessor
model = HunyuanOCR("config.json")
model.load_weights("model.safetensors")
processor = HunyuanOCRProcessor.from_pretrained(".")
# Run OCR
img = Image.open("document.jpg")
prompt = "检测并识别图片中的文字,将文本坐标格式化输出。"
processed = processor.process([img], [prompt])
hidden_states, past_kvs = model(
input_ids=processed.input_ids,
pixel_values=processed.pixel_values,
position_ids=processed.position_ids,
attention_mask=processed.attention_mask,
grid_thw=processed.grid_thw,
)
# Generate
logits = model.lm_head(hidden_states[:, -1:, :])
next_token = mx.argmax(logits[:, -1, :], axis=-1)
Prompt Examples
| Task | Prompt |
|---|---|
| Text Spotting | 检测并识别图片中的文字,将文本坐标格式化输出。 |
| Document Parsing | 提取文档图片中正文的所有信息用markdown格式表示,其中页眉、页脚部分忽略,表格用html格式表达,文档中公式用latex格式表示,按照阅读顺序组织进行解析。 |
| Formula Recognition | 识别图片中的公式,用LaTeX格式表示。 |
| Table Extraction | 把图中的表格解析为 HTML。 |
| Chart Parsing | 解析图中的图表,对于流程图使用Mermaid格式表示,其他图表使用Markdown格式表示。 |
| Information Extraction | 提取图片中的: ['key1','key2', ...] 的字段内容,并按照JSON格式返回。 |
| Translation | 先提取文字,再将文字内容翻译为英文。 |
Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- macOS 14+
- Python 3.9+
- MLX, transformers, torch, Pillow
License
This model is a derivative of Tencent HunyuanOCR, licensed under the Tencent Hunyuan Community License Agreement.
Attribution
Original model by Tencent Hunyuan Vision Team. This MLX conversion is not affiliated with or endorsed by Tencent.
- Downloads last month
- 223
Model size
1B params
Tensor type
F16
·
Hardware compatibility
Log In to add your hardware
Quantized
Model tree for AnandSingh/hunyuanocr-mlx
Unable to build the model tree, the base model loops to the model itself. Learn more.