Blackroot
/

TensorProduct-Microllama

Text Generation

Model card Files Files and versions

TensorProduct-Microllama / README.md

Blackroot's picture

Update README.md

932e168 verified 12 months ago

|

history blame contribute delete

1.5 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	---

	From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~125M param model.

	Test network using [Tensor Product Attention](https://arxiv.org/abs/2501.06425). Other than some alterations to the attention, such as 16 heads insted of 9 and using TPA, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct

	# Scripts:
	- `inference.py` to run the model with some test prompts
	- `test_train.py` runs with the exact configurations used to train this model and is the reproduction script. Data is assumed to be in JSONL format with `"text":"example text", "text":"..."`

	# Notes:
	One of the primary reported benefits for TPA are for inference which are not really being leveraged at all, although you can probably fit a larger bsz than traditional MHA/GQA with this. This did save about 5% on params, that amount should scale much more as the network size increases. The run time is very similar to MHA/GQA at this scale.

	# Training Metrics

	## Dataset Information
	- Training data per epoch: 1 GB
	- Total tokens trained: 48,261,120
	- No sythetic data

	## Training Results
	- Final Train Loss: 3.0421
	- Final Train Perplexity: 20.95

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/637f3b03932a61b89aefbf5c/8iTSQFvwgbn5or6LdNT9G.png)

	# Code

	The code for tensor product attn is available at: https://github.com/tensorgi/T6.