EPT-I is the first generation of the Efficiency-Prioritized Token-mixer(EPT) series, LLMs designed for extra efficient inference by reducing computation and memory occupation through architectural designs. EPT-I has 3 billion parameters(3B), allowing smoother inference on computation/memory-constrained devices.

Primary Architectural Features

1D Convolution Layers: EPT-I implements depthwise 1D convolution and pointwise computations instead of conventional Multi-Layer Perceptron(MLP) layers. This allows the FFN layers to process local context feature and positional signals, drastically improving both computational efficiency and performance, leading to higher token throughput, lower memory usage, less parameters with the same hidden size and number of layers, less training time and higher performance. EPT-I leverages convolution to model languages in continous way instead of processing them discretely per token. As the result, the model is able to retain smaller number of parameters while keeping the hidden size and number of layers same, supercharging on-device/edge inference.
Multi-Head Latent Attention(MLA): EPT-I uses Multi-Head Latent Attention, inspired by DeepSeek, to minimize KV Cache increment during long-context inference. This allows the model to keep the memory usage low while preserving intelligence, addressing the challenges in long-context scenarios.
Multi-Token Prediction(MTP): Instead of predicting one token at a time, the model predicts multiple tokens simultaneously, boosting both training and inference speed.

Intended Use

EPT-I is primarily designed as an educational assistant, but at the same time it is capable of performing as a generic LLM. It is recommended to use as a chatbot for aiding students' academic achievements, but can be used for other purposes such as accelerating STEM research.

Out of Scope Use

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train SKIS-AI-Research/EPT-I

Collection including SKIS-AI-Research/EPT-I

EPT Series

Collection

The collection of Efficiency-Prioritized Token-mixer(EPT), a model architecture aiming for maximization of LLM inference efficiency. • 2 items • Updated Jan 10