How to use from
Hermes Agent
Start the MLX server
# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "BotResources/Infinity-Parser2-Flash-mlx-q8"
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default BotResources/Infinity-Parser2-Flash-mlx-q8
Run Hermes
hermes
Quick Links

Infinity-Parser2-Flash MLX 8-bit

This model was converted to MLX format from infly/Infinity-Parser2-Flash and quantized to 8-bit (group size 64) using mlx-vlm version 0.5.0. Refer to the original model card for more details on the model.

Use with mlx-vlm

pip install -U mlx-vlm

The model is RL-tuned for the canonical layout-extraction prompt below — using a different prompt may yield unexpected output:

PROMPT=$(cat <<'EOF'
- Extract layout information from the provided PDF image.
- For each layout element, output its bbox, category, and the text content within the bbox.
- Bbox format: [x1, y1, x2, y2].
- Allowed layout categories: ['header', 'title', 'text', 'figure', 'table', 'formula', 'figure_caption', 'table_caption', 'formula_caption', 'figure_footnote', 'table_footnote', 'page_footnote', 'footer'].
- Text extraction and formatting:
  1) For 'figure', the text field must be an empty string.
  2) For 'formula', format text as LaTeX.
  3) For 'table', format text as HTML.
  4) For all other categories (e.g., text, title), format text as Markdown.
- The output text must be exactly the original text from the image, with no translation or rewriting.
- Sort all layout elements in human reading order.
- Final output must be a single JSON object.
EOF
)

python -m mlx_vlm.generate \
  --model BotResources/Infinity-Parser2-Flash-mlx-q8 \
  --max-tokens 32768 --temperature 0.0 \
  --prompt "$PROMPT" \
  --image <path_to_image>

Quantization quality

The full-precision counterpart is published at BotResources/Infinity-Parser2-Flash-mlx-bf16.

In a BotResources internal benchmark of 50 pages from various PDFs (text, tables, formulas, scans), the BF16 build and this 8-bit build produced byte-identical outputs on all 50 pages at temperature=0, top_p=1. Token count, character count, and final text are strictly equal between the two builds.

On the same Apple M4 Max (128 GB unified memory) only the runtime differs:

Build On-disk Peak RAM Generation
BF16 4.43 GB 5.4 GB 101 tok/s
8-bit (this build) 2.48 GB 3.7 GB 167 tok/s

This 8-bit build is ~65 % faster per token and uses ~33 % less peak RAM, with no measured quality loss for this use case.

License

Inherits the Apache-2.0 license from the base model infly/Infinity-Parser2-Flash. All credit for the underlying model goes to the inflyAI team.

Downloads last month
19
Safetensors
Model size
0.9B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BotResources/Infinity-Parser2-Flash-mlx-q8

Quantized
(1)
this model