Fu01978's picture
Update README.md
0f6fa7e verified
---
license: apache-2.0
language:
- en
base_model:
- HuggingFaceTB/SmolLM2-360M-Instruct
pipeline_tag: text-generation
library_name: transformers
tags:
- pruned
- tiny-ml
- smollm
- research
---
# SmollerLM2-360M-Instruct-Pruned
A structurally pruned version
of the [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) model.
This model was created
as an experiment in **pruning**.
## Pruning Methodology
The model underwent
**Structured Every-Nth Neuron Pruning**.
Unlike random dropout
or unstructured pruning,
this method maintains
the dense matrix format
required by standard hardware
accelerators.
- **Target:** Intermediate MLP (Feed-Forward) layers.
- **Strategy:** Every 20th neuron was removed ($1/20$).
- **Dimension Shift:** Intermediate size reduced from **2560** to **2432**.
## Memory Efficiency
While the original model
is distributed in FP32,
this model provides
an optimization
that makes it significantly
more accessible:
- **Precision Reduction (FP32 → FP16):**
We converted the weights
to half-precision,
instantly cutting the memory footprint
by **50%**.
> **Total Savings:** **51.6%** smaller than the original version.
## Recommended Usage
> **TECHNICAL NOTE:**
On CPUs without native FP16 support,
this model may experience
a 'tax'
resulting in slower tokens-per-second
than the original.
This model is **RAM-optimized**,
not necessarily **CPU-Latency optimized**
in its raw FP16 state.
To get the best performance,
it is recommended to
use it on a GPU
or via 4-bit/8-bit quantization
to bypass CPU floating-point limitations.
You will first need to install
the `bitsandbytes` library
in Python.
(`pip install bitsandbytes`)
### GPU Loading (Fastest)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Fu01978/SmollerLM2-360M-Instruct-Pruned"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
quantization_config={"load_in_8bit": True},
device_map="auto"
)
```
### CPU Loading
If running on lower-end CPUs,
load the model in **4-bit**
to ensure the weights fit
in the L1/L2 cache:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Fu01978/SmollerLM2-360M-Instruct-Pruned"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config={"load_in_4bit": True},
device_map="auto"
)
```
### Generate
Use the following snippet
to chat with the model.
This uses the chat template.
```python
# Define your message(s)
messages = [
{"role": "user", "content": "Explain the concept of gravity."}
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=150,
temperature=0.2,
do_sample=True,
repetition_penalty=1.1
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
```
#### Example Output (4-bit Quantized)
* **Prompt:** "Explain the concept of gravity."
* **Output:**
> Gravity is indeed one of the most fundamental concepts in physics and mathematics. It's essentially the "attraction" between two bodies or masses. According to Einstein's theory of general relativity, mass warps space-time around it, creating a gravitational field that attracts other objects with mass. This means that anything having mass has a gravitational pull on other matter, making them feel heavy. For example, when you drop an object, you're not really feeling its weight; rather, you're feeling the gravitational force exerted by the Earth. The actual weight of an object depends on how massive the object itself is, which can be calculated using formulas like F=G x m/r^2 where G is the gravitational constant, **[hit 150 token limit]**
## Limitations & Bias
As a pruned version of SmolLM2,
this model inherits the biases of its parent.
While the pruning was found to be stable,
users may encounter slight regressions
in mathematical reasoning
compared to the full model.