Update README.md

497220b verified 7 months ago

8.19 kB


	---
	license: mit
	language: en
	---

	# Smoothie: A Diffusion Model for Paraphrase Generation

	[![Generic badge](https://img.shields.io/badge/Model-Custom_Smoothie-blue.svg)](https://shields.io/)
	[![Generic badge](https://img.shields.io/badge/Dataset-QQP-green.svg)](https://huggingface.co/datasets/glue)
	[![Generic badge](https://img.shields.io/badge/Paper-arXiv:2505.18853v1-red.svg)](https://arxiv.org/abs/2505.18853)

	This repository contains a diffusion-based model for text generation, trained on the Quora Question Pairs (QQP) dataset for the task of paraphrasing. The architecture and training methodology are based on the paper Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation.

	This is a custom model and requires `trust_remote_code=True` to load, as the model's architecture is defined in the accompanying `modeling_smoothie.py` file.

	## Model Description

	The "Smoothie" model is a non-autoregressive text generation model that uses a diffusion process. Unlike traditional models that generate text token-by-token, this model starts with pure random noise and iteratively refines it over hundreds of steps to produce a full sentence.

	The key features of the architecture are:
	- Diffusion Process: Operates in a continuous space based on the negative squared Euclidean distances between token embeddings. This allows the model to smoothly add and remove "semantic noise".
	- Backbone: A Transformer Decoder with UNet-style skip connections, which is effective for denoising tasks.
	- Conditional Generation: The model is conditioned on an input sentence (a question) to generate a semantically similar output sentence (a paraphrase).

	This specific checkpoint was trained on the paraphrase pairs from the GLUE QQP dataset, using `bert-base-cased` as the base for its token embeddings.

	---

	## How to Use

	The following is a complete, self-contained example of how to load the model and use it for inference. The `SmoothieDiffusion` class, which orchestrates the multi-step generation process, is included for convenience.

	First, make sure you have the necessary libraries installed:
	```bash
	pip install torch transformers accelerate huggingface_hub -q
	```

	Then, you can run the following Python script:

	```python
	import torch
	import torch.nn as nn
	from transformers import AutoTokenizer, AutoModel, BertModel
	from tqdm.auto import tqdm
	import math

	# =============================================================================
	# PART 1: THE DIFFUSION PIPELINE (INFERENCE LOGIC)
	# This class is required to use the Smoothie model for generation.
	# =============================================================================

	def get_noise_schedule(T, s_min=1.5, s_max=200.0, d=9.0, epsilon=1e-5):
	"""Generates the noise schedule used during training."""
	t = torch.arange(0, T + 1, dtype=torch.float32)
	ratio = t / (T - t + epsilon)
	arg = (1/d) * ratio
	schedule = (s_max - s_min) * (2 / math.pi) * torch.atan(arg) + s_min
	schedule = s_min
	schedule[T] = s_max
	return schedule

	class SmoothieDiffusion:
	"""The inference pipeline for the Smoothie model."""
	def __init__(self, E, schedule):
	self.E = E.cuda() # The semantic map (embedding matrix)
	self.V, self.D = E.shape
	self.sigmas = schedule.cuda() # The blueprint (noise schedule)
	self.T = len(schedule) - 1

	@torch.no_grad()
	def get_D0(self, target_embeddings):
	"""Memory-efficient calculation of the distance matrix D0."""
	term1 = torch.sum(target_embeddings.pow(2), dim=-1, keepdim=True)
	term2 = torch.sum(self.E.pow(2), dim=-1).unsqueeze(0).unsqueeze(0)
	term3 = -2 * torch.matmul(target_embeddings, self.E.T)
	return -(term1 + term2 + term3)

	@torch.no_grad()
	def p_sample(self, model, D_t, t, delta_gen, src_tokens=None, src_mask=None):
	"""A single reverse diffusion (denoising) step."""
	p_t = torch.softmax(D_t, dim=-1)
	weighted_avg_emb = torch.matmul(p_t, self.E)
	t_tensor = torch.full((D_t.shape,), t, device=D_t.device, dtype=torch.long)

	pred_E0 = model(
	weighted_avg_emb=weighted_avg_emb,
	t=t_tensor,
	src_tokens=src_tokens,
	src_mask=src_mask
	)

	pred_D0 = self.get_D0(pred_E0)
	if t == 0:
	return pred_D0

	sigma_t_minus_1 = self.sigmas[t-1]
	D_t_minus_1 = pred_D0 / (sigma_t_minus_1 ** 2)
	if delta_gen > 0:
	D_t_minus_1 += delta_gen * torch.randn_like(D_t)
	return D_t_minus_1

	@torch.no_grad()
	def p_sample_loop(self, model, shape, delta_gen, src_tokens=None, src_mask=None):
	"""The full denoising loop from T to 0."""
	device = self.E.device
	D_t = torch.randn(shape, device=device) * delta_gen
	for t in tqdm(reversed(range(0, self.T + 1)), desc="Sampling", total=self.T + 1):
	D_t = self.p_sample(model, D_t, t, delta_gen, src_tokens=src_tokens, src_mask=src_mask)
	return D_t

	# =============================================================================
	# PART 2: LOADING THE MODEL AND RUNNING INFERENCE
	# =============================================================================

	# --- Configuration ---
	# Replace with your own username and repo name if you forked this
	repo_id = "your-hf-username/smoothie-diffusion-qqp"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	# --- Load Model and Tokenizer from the Hub ---
	print(f"Loading tokenizer and model from: {repo_id}")
	tokenizer = AutoTokenizer.from_pretrained(repo_id)

	# `trust_remote_code=True` is essential to load the custom SmoothieModel architecture
	model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).to(device)
	model.eval()
	print("\nModel loaded successfully from the Hub!")

	# --- Prepare Diffusion Components ---
	print("Preparing the embedding matrix for the diffusion process...")
	bert_for_embeddings = BertModel.from_pretrained("bert-base-cased")
	embedding_matrix = bert_for_embeddings.embeddings.word_embeddings.weight.detach().clone().to(device)
	mean = embedding_matrix.mean(0, keepdim=True)
	std = embedding_matrix.std(0, keepdim=True)
	embedding_matrix = (embedding_matrix - mean) / std

	# Recreate the exact noise schedule and initialize the diffusion pipeline
	DIFFUSION_STEPS = 200
	DELTA_GEN = 0.25
	noise_schedule = get_noise_schedule(T=DIFFUSION_STEPS)
	diffusion_pipeline = SmoothieDiffusion(E=embedding_matrix, schedule=noise_schedule)
	print("Diffusion components are ready.")

	# --- Run Inference ---
	source_question = "How can I become a better writer?"
	print(f"\nSource Question: {source_question}")

	inputs = tokenizer(
	source_question,
	max_length=model.config.max_seq_len,
	padding="max_length",
	truncation=True,
	return_tensors="pt"
	)
	src_tokens = inputs['input_ids'].to(device)
	src_mask = (src_tokens == tokenizer.pad_token_id).to(device)

	generated_D0 = diffusion_pipeline.p_sample_loop(
	model,
	shape=(1, model.config.max_seq_len, model.config.vocab_size),
	delta_gen=DELTA_GEN,
	src_tokens=src_tokens,
	src_mask=src_mask
	)

	# --- Decode and Display the Result ---
	output_tokens = torch.argmax(generated_D0, dim=-1)
	decoded_text = tokenizer.decode(output_tokens, skip_special_tokens=True)

	print("-" * 30)
	print(f"Generated Paraphrase: {decoded_text}")
	print("-" * 30)

	```

	---

	## Training Details

	This model was trained from scratch.

	- Dataset: `glue/qqp`, filtered for positive pairs (is_duplicate = 1).
	- Training Steps: 25,000
	- Batch Size: 16
	- Optimizer: AdamW
	- Learning Rate: 2e-4
	- Hardware: Trained on a single NVIDIA T4 GPU via Google Colab.

	### Limitations and Bias

	- The model's knowledge is limited to the topics present in the Quora Questions dataset. It may perform poorly on highly specialized or out-of-domain topics.
	- As with any model trained on large-scale internet text, it may reflect societal biases present in the training data.
	- The model is currently undertrained and may not always produce semantically perfect paraphrases. Continued training would improve its accuracy.

	```