sixf0ur
/

ScentLLaMA

Text Generation

text-generation-inference

Model card Files Files and versions

ScentLLaMA / README.md

sixf0ur's picture

Update README.md

317e4f0 verified 5 months ago

|

history blame contribute delete

2.82 kB

	---
	license: cc-by-4.0
	datasets:
	- sixf0ur/ScentSet
	language:
	- en
	tags:
	- chemistry
	- biology
	- climate
	- medical
	- text-generation-inference
	- tiny
	- scent
	- smell
	pipeline_tag: text-generation
	---

	# ScentLLaMA

	A tiny LLaMA-based language model with 600k parameters, pretrained specifically on the synthetic ScentSet dataset (572k entries, ~15M tokens).
	Designed exclusively to describe and classify smells and aromas.

	## Model Details

	- Parameters: ~600,000
	- Task: Text generation of smell descriptions
	- Training data: ScentSet (synthetic dataset of smell descriptions)
	- Training date: July 2025
	- License: CC BY 4.0

	### 📉 Training & Evaluation Loss

	The following plot shows the training and evaluation loss over time.
	Training was performed for approximately 160,000 steps.

	The evaluation loss remains consistently close to the training loss throughout training (within ~0.01),
	indicating that the model generalizes well and shows no signs of overfitting.
	Training arguments can be seen below:
	```python
	TRAINING_ARGS = TrainingArguments(
	output_dir=OUTPUT_DIR,
	overwrite_output_dir=True,
	num_train_epochs=20,
	per_device_train_batch_size=16,
	per_device_eval_batch_size=16,
	learning_rate=1e-4,
	warmup_steps=500,
	lr_scheduler_type="cosine",
	weight_decay=0.01,
	max_grad_norm=1.0,
	logging_dir=os.path.join(OUTPUT_DIR, "logs"),
	logging_steps=100,
	save_steps=500,
	eval_steps=500,
	eval_strategy="steps",
	load_best_model_at_end=True,
	metric_for_best_model="eval_loss",
	greater_is_better=False,
	save_total_limit=2,
	fp16=True,
	report_to="tensorboard",
	)
	````
	![Training loss](./loss.png)


	## Usage
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "sixf0ur/ScentLLaMA"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	prompt = "A fresh and fruity aroma with hints of"
	inputs = tokenizer(prompt, return_token_type_ids=False, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=25)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))

	# > A fresh and fruity aroma with hints of green leaves and a hint of something earthy. It is a ripe plum.
	```


	### Citation
	```json
	@misc{ScentLLaMA_2025,
	author = {David S.},
	title = {ScentLLaMA: A tiny LLaMA Model for Smell Description Generation},
	year = {2025},
	publisher = {Hugging Face Models},
	howpublished = {\url{https://huggingface.co/sixf0ur/ScentLLaMA}},
	note = {Pretrained on the ScentSet dataset to generate natural language descriptions of smells}
	}
	```