Eve-4B Tensor-INT8
Maintainer: TitleOS License: Mozilla Public License 2.0 (MPL-2.0) Base Model: Eve-4B (FP16) Target Hardware: Google Tensor NPUs (Optimized for Pixel 9a)
Overview
This repository contains a heavily optimized, INT8 weight-only quantized version of the 4-billion parameter Eve model. It has been specifically compiled through the Google LiteRT (formerly AI Edge Torch) pipeline to execute natively on Android devices equipped with Tensor silicon.
By aggressively crushing the FP16 weights down to 8-bit integers, the memory footprint has been slashed to fit comfortably within the strict RAM constraints of mobile operating systems, allowing for high-performance, on-device inference without triggering Android's Out-Of-Memory (OOM) killer.
All inference runs locally. Zero data leaves your device.
File Architecture & Usage
This repository provides two distinct deployment artifacts depending on your engineering needs.
1. The Deployment Bundle (eve_4b_npu.task)
This is the plug-and-play vehicle. A MediaPipe .task file is a compiled zip archive that bundles the raw mathematical weights together with a custom-forged SentencePiece tokenizer.
Best for: Rapid deployment and testing. How to use: Drop this file directly into the Google AI Edge Gallery application on your Pixel device. The app will automatically unpack the bundle, read the vocabulary mapping, and map the computational graph to the Tensor NPU. You can immediately begin prompting the model through the app's conversational interface.
2. The Raw Engine (eve_4b_tensor_int8.tflite)
This is the raw, unbundled mathematical graph (FlatBuffer). It contains only the quantized INT8 weights and the mapped JAX/LiteRT operations. It does not contain a vocabulary or text-processing logic.
Best for: Software engineers building custom Android applications or native C++ pipelines. How to use: Load this file into your custom LiteRT/TensorFlow Lite interpreter. You are entirely responsible for implementing your own tokenizer to translate string inputs into token IDs before passing them to this graph, and decoding the output logits back into readable text.
Compilation Details & Engineering Notes
Porting a modern Hugging Face architecture to LiteRT requires bridging significant structural divides. For developers looking to replicate or iterate on this build, note the following architectural interventions applied during compilation:
- FP32 Memory Mapping: The LiteRT compiler cannot natively trace
bfloat16orfloat16math operations. The base GGUF was fully inflated into a standardfloat32memory space prior to compilation to allow the tracing engine to successfully map the logic before applying the INT8 quantization recipe. - DynamicCache Isolation: A custom PyTorch
nn.Modulewrapper was implemented to shield the compiler from Hugging Face's complexDynamicCachepython objects. The trace was forced to evaluate only pure mathematical tensor logits. - BPE Tokenizer Forgery: Because the base Eve/Qwen architecture relies on a Hugging Face
tokenizer.json(Byte-Pair Encoding), a dummy SentencePiece.modelbinary was trained and subsequently gutted. The specific BPE vocabulary and structural<unk>tokens were manually injected into the protobuf binary to satisfy the strict C++ requirements of the MediaPipe framework.
- Downloads last month
- 4