intel_google-gemma-3-12b-it-int4

This repository contains the Gemma-3-12b-it model optimized for Intel hardware using the OpenVINO™ Toolkit and quantized to INT4 precision.

It is designed for high-performance inference on edge devices, including Intel Core Ultra (iGPU), Xeon, and Intel Arc graphics.

Model Details

  • Developed by: Advantech-EIOT
  • Architecture: Gemma-3 (12B)
  • Task: Text Generation (Chat/Instruction)
  • Precision: INT4 (Weight Compression)
  • Optimization: OpenVINO™ Toolkit

Deployment with OpenVINO Model Server (OVMS)

OpenVINO Model Server (OVMS) provides a high-performance, scalable solution for serving this model via OpenAI-compatible APIs.

1. Prerequisite: Verify Model Files

Before launching the server, ensure your local directory contains the following OpenVINO IR files:

  • openvino_model.xml (Model topology)
  • openvino_model.bin (Model weights)
  • tokenizer_config.json and related tokenizer files

2. Launch with Docker

Use the official OVMS container to serve the 12B model. Replace $(pwd) with the path to your model directory.

docker run -d --rm -p 8000:8000 \
    -v $(pwd):/workspace/model:ro \
    openvino/model_server:latest \
    --rest_port 8000 \
    --model_path /workspace/model \
    --model_name gemma-3-12b-it \
    --plugin_config '{"PERFORMANCE_HINT": "LATENCY"}'

3. Usage via OpenAI API

Once the server is running, you can interact with it using the standard OpenAI Python client:

from openai import OpenAI

# Initialize client pointing to your OVMS instance
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="gemma-3-12b-it",
    messages=[{"role": "user", "content": "Explain the advantages of INT4 quantization for edge AI."}]
)

print(response.choices[0].message.content)

Hardware Compatibility

This 12B INT4 model is highly optimized for:

  • Intel Core Ultra (CPU/iGPU): Ideal for local AI PC deployments.
  • Advantech Edge AI Platforms: Such as those used in industrial or IoT environments.
  • Intel Xeon Scalable Processors: Efficient for high-throughput inference.
  • Intel Arc Discrete Graphics: Accelerated LLM performance.

Limitations and Disclaimer

Gemma-3 is a powerful model but may exhibit hallucinations.

Users should validate outputs for critical applications.

Please refer to the Google Gemma License for usage restrictions.

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Advantech-EIOT/intel_google-gemma-3-12b-it-int4