Fu01978

Update README.md

0f6fa7e verified 13 days ago

4.33 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- HuggingFaceTB/SmolLM2-360M-Instruct
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- pruned
	- tiny-ml
	- smollm
	- research
	---

	# SmollerLM2-360M-Instruct-Pruned

	A structurally pruned version
	of the [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) model.

	This model was created
	as an experiment in pruning.

	## Pruning Methodology

	The model underwent
	Structured Every-Nth Neuron Pruning.
	Unlike random dropout
	or unstructured pruning,
	this method maintains
	the dense matrix format
	required by standard hardware
	accelerators.

	- Target: Intermediate MLP (Feed-Forward) layers.
	- Strategy: Every 20th neuron was removed ($1/20$).
	- Dimension Shift: Intermediate size reduced from 2560 to 2432.

	## Memory Efficiency

	While the original model
	is distributed in FP32,
	this model provides
	an optimization
	that makes it significantly
	more accessible:

	- Precision Reduction (FP32 → FP16):
	We converted the weights
	to half-precision,
	instantly cutting the memory footprint
	by 50%.

	> Total Savings: 51.6% smaller than the original version.

	## Recommended Usage

	> TECHNICAL NOTE:
	On CPUs without native FP16 support,
	this model may experience
	a 'tax'
	resulting in slower tokens-per-second
	than the original.
	This model is RAM-optimized,
	not necessarily CPU-Latency optimized
	in its raw FP16 state.

	To get the best performance,
	it is recommended to
	use it on a GPU
	or via 4-bit/8-bit quantization
	to bypass CPU floating-point limitations.

	You will first need to install
	the `bitsandbytes` library
	in Python.
	(`pip install bitsandbytes`)

	### GPU Loading (Fastest)

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_id = "Fu01978/SmollerLM2-360M-Instruct-Pruned"
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	quantization_config={"load_in_8bit": True},
	device_map="auto"
	)
	```

	### CPU Loading

	If running on lower-end CPUs,
	load the model in 4-bit
	to ensure the weights fit
	in the L1/L2 cache:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_id = "Fu01978/SmollerLM2-360M-Instruct-Pruned"
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	quantization_config={"load_in_4bit": True},
	device_map="auto"
	)
	```

	### Generate

	Use the following snippet
	to chat with the model.
	This uses the chat template.

	```python
	# Define your message(s)
	messages = [
	{"role": "user", "content": "Explain the concept of gravity."}
	]

	input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=150,
	temperature=0.2,
	do_sample=True,
	repetition_penalty=1.1
	)

	response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
	print(response)
	```

	#### Example Output (4-bit Quantized)

	* Prompt: "Explain the concept of gravity."
	* Output:
	> Gravity is indeed one of the most fundamental concepts in physics and mathematics. It's essentially the "attraction" between two bodies or masses. According to Einstein's theory of general relativity, mass warps space-time around it, creating a gravitational field that attracts other objects with mass. This means that anything having mass has a gravitational pull on other matter, making them feel heavy. For example, when you drop an object, you're not really feeling its weight; rather, you're feeling the gravitational force exerted by the Earth. The actual weight of an object depends on how massive the object itself is, which can be calculated using formulas like F=G x m/r^2 where G is the gravitational constant, [hit 150 token limit]

	## Limitations & Bias

	As a pruned version of SmolLM2,
	this model inherits the biases of its parent.

	While the pruning was found to be stable,
	users may encounter slight regressions
	in mathematical reasoning
	compared to the full model.