EPT-I is the first generation of the Efficiency-Prioritized Token-mixer(EPT) series, LLMs designed for extra efficient inference by reducing computation and memory occupation through architectural designs. EPT-I has 3 billion parameters(3B), allowing smoother inference on computation/memory-constrained devices.
Primary Architectural Features
1D Convolution Layers: EPT-I implements depthwise 1D convolution and pointwise computations instead of conventional Multi-Layer Perceptron(MLP) layers. This allows the FFN layers to process local context feature and positional signals, drastically improving both computational efficiency and performance, leading to higher token throughput, lower memory usage, less parameters with the same hidden size and number of layers, less training time and higher performance. EPT-I leverages convolution to model languages in continous way instead of processing them discretely per token. As the result, the model is able to retain smaller number of parameters while keeping the hidden size and number of layers same, supercharging on-device/edge inference.
Multi-Head Latent Attention(MLA): EPT-I uses Multi-Head Latent Attention, inspired by DeepSeek, to minimize KV Cache increment during long-context inference. This allows the model to keep the memory usage low while preserving intelligence, addressing the challenges in long-context scenarios.
Multi-Token Prediction(MTP): Instead of predicting one token at a time, the model predicts multiple tokens simultaneously, boosting both training and inference speed.
Intended Use
EPT-I is primarily designed as an educational assistant, but at the same time it is capable of performing as a generic LLM. It is recommended to use as a chatbot for aiding students' academic achievements, but can be used for other purposes such as accelerating STEM research.
Out of Scope Use