---
library_name: mediapipe
tags:
  - medical
  - llm
  - gemma
  - quantized
  - tflite
  - int8
license: apache-2.0
base_model: google/medgemma-1.5-4b-it
---

# MedGemma 1.5 4B - Quantized (INT8)

This is a quantized version of [google/medgemma-1.5-4b-it](https://huggingface.co/google/medgemma-1.5-4b-it) optimized for on-device deployment using TensorFlow Lite and MediaPipe.

## Model Details

- **Base Model**: MedGemma 1.5 4B (Instruction-Tuned)
- **Quantization**: INT8 Dynamic Quantization
- **Model Size**: 3.65 GB (4x reduction from FP32)
- **Architecture**: Gemma 3
- **Deployment**: MediaPipe Task Bundle + TFLite

## Files

| File | Size | Description |
|------|------|-------------|
| `medgemma_1.5_4b.task` | 3.65 GB | MediaPipe task bundle (ready to use) |
| `gemma3-4b_q8_ekv1024.tflite` | 3.65 GB | TFLite model with INT8 quantization |
| `tokenizer.model` | 4.5 MB | SentencePiece tokenizer |

## Quantization Details

- **Scheme**: Dynamic INT8
- **Weights**: Quantized to INT8 (171 tensors)
- **Activations**: FP32 (for accuracy)
- **KV Cache**: Up to 1024 tokens
- **Verified**: Weight quantization confirmed

## Usage

### MediaPipe Web (Easiest)

1. Go to [MediaPipe Studio](https://mediapipe-studio.webapps.google.com/demo/llm_inference)
2. Upload `medgemma_1.5_4b.task`
3. Test with medical prompts

### Android

```kotlin
import com.google.mediapipe.tasks.genai.llminference.LlmInference

val options = LlmInference.LlmInferenceOptions.builder()
    .setModelPath("/path/to/medgemma_1.5_4b.task")
    .setMaxTokens(512)
    .setTemperature(0.7f)
    .build()

val llm = LlmInference.createFromOptions(context, options)
val response = llm.generateResponse("What are the symptoms of diabetes?")
```

### iOS

```swift
import MediaPipeTasksGenAI

let options = LlmInference.Options()
options.modelPath = "/path/to/medgemma_1.5_4b.task"
options.maxTokens = 512

let llm = try LlmInference(options: options)
let response = try llm.generateResponse(prompt: "What are the symptoms of diabetes?")
```

## Prompt Format

```
<start_of_turn>user
{YOUR_QUESTION}<end_of_turn>
<start_of_turn>model

```

## Example Prompts

- "What are the common symptoms of type 2 diabetes?"
- "Explain the difference between systolic and diastolic blood pressure."
- "What lifestyle changes can help manage hypertension?"

## Performance

- **Inference Speed**: ~10-40 tokens/sec on CPU
- **Memory Usage**: ~5-6 GB RAM
- **Quantization Impact**: Minimal accuracy degradation vs FP32

## Limitations

- **Text-only**: Vision encoder not included in this version
- **Medical disclaimer**: This model is for educational/research purposes only. Always consult healthcare professionals for medical advice.

## Conversion Process

Converted using [ai-edge-torch](https://github.com/google-ai-edge/ai-edge-torch):
1. Downloaded from HuggingFace
2. Converted to TFLite with INT8 quantization
3. Bundled with MediaPipe task format

## Citation

```bibtex
@misc{medgemma2024,
  title={MedGemma: Open medical large language models},
  author={Google DeepMind},
  year={2024},
  url={https://huggingface.co/google/medgemma-1.5-4b-it}
}
```

## License

Apache 2.0 (same as base model)