--- license: mit language: en --- # Smoothie: A Diffusion Model for Paraphrase Generation [![Generic badge](https://img.shields.io/badge/Model-Custom_Smoothie-blue.svg)](https://shields.io/) [![Generic badge](https://img.shields.io/badge/Dataset-QQP-green.svg)](https://huggingface.co/datasets/glue) [![Generic badge](https://img.shields.io/badge/Paper-arXiv:2505.18853v1-red.svg)](https://arxiv.org/abs/2505.18853) This repository contains a diffusion-based model for text generation, trained on the **Quora Question Pairs (QQP)** dataset for the task of **paraphrasing**. The architecture and training methodology are based on the paper *Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation*. This is a custom model and **requires `trust_remote_code=True`** to load, as the model's architecture is defined in the accompanying `modeling_smoothie.py` file. ## Model Description The "Smoothie" model is a non-autoregressive text generation model that uses a diffusion process. Unlike traditional models that generate text token-by-token, this model starts with pure random noise and iteratively refines it over hundreds of steps to produce a full sentence. The key features of the architecture are: - **Diffusion Process:** Operates in a continuous space based on the negative squared Euclidean distances between token embeddings. This allows the model to smoothly add and remove "semantic noise". - **Backbone:** A Transformer Decoder with UNet-style skip connections, which is effective for denoising tasks. - **Conditional Generation:** The model is conditioned on an input sentence (a question) to generate a semantically similar output sentence (a paraphrase). This specific checkpoint was trained on the paraphrase pairs from the GLUE QQP dataset, using `bert-base-cased` as the base for its token embeddings. --- ## How to Use The following is a complete, self-contained example of how to load the model and use it for inference. The `SmoothieDiffusion` class, which orchestrates the multi-step generation process, is included for convenience. First, make sure you have the necessary libraries installed: ```bash pip install torch transformers accelerate huggingface_hub -q ``` Then, you can run the following Python script: ```python import torch import torch.nn as nn from transformers import AutoTokenizer, AutoModel, BertModel from tqdm.auto import tqdm import math # ============================================================================= # PART 1: THE DIFFUSION PIPELINE (INFERENCE LOGIC) # This class is required to use the Smoothie model for generation. # ============================================================================= def get_noise_schedule(T, s_min=1.5, s_max=200.0, d=9.0, epsilon=1e-5): """Generates the noise schedule used during training.""" t = torch.arange(0, T + 1, dtype=torch.float32) ratio = t / (T - t + epsilon) arg = (1/d) * ratio schedule = (s_max - s_min) * (2 / math.pi) * torch.atan(arg) + s_min schedule = s_min schedule[T] = s_max return schedule class SmoothieDiffusion: """The inference pipeline for the Smoothie model.""" def __init__(self, E, schedule): self.E = E.cuda() # The semantic map (embedding matrix) self.V, self.D = E.shape self.sigmas = schedule.cuda() # The blueprint (noise schedule) self.T = len(schedule) - 1 @torch.no_grad() def get_D0(self, target_embeddings): """Memory-efficient calculation of the distance matrix D0.""" term1 = torch.sum(target_embeddings.pow(2), dim=-1, keepdim=True) term2 = torch.sum(self.E.pow(2), dim=-1).unsqueeze(0).unsqueeze(0) term3 = -2 * torch.matmul(target_embeddings, self.E.T) return -(term1 + term2 + term3) @torch.no_grad() def p_sample(self, model, D_t, t, delta_gen, src_tokens=None, src_mask=None): """A single reverse diffusion (denoising) step.""" p_t = torch.softmax(D_t, dim=-1) weighted_avg_emb = torch.matmul(p_t, self.E) t_tensor = torch.full((D_t.shape,), t, device=D_t.device, dtype=torch.long) pred_E0 = model( weighted_avg_emb=weighted_avg_emb, t=t_tensor, src_tokens=src_tokens, src_mask=src_mask ) pred_D0 = self.get_D0(pred_E0) if t == 0: return pred_D0 sigma_t_minus_1 = self.sigmas[t-1] D_t_minus_1 = pred_D0 / (sigma_t_minus_1 ** 2) if delta_gen > 0: D_t_minus_1 += delta_gen * torch.randn_like(D_t) return D_t_minus_1 @torch.no_grad() def p_sample_loop(self, model, shape, delta_gen, src_tokens=None, src_mask=None): """The full denoising loop from T to 0.""" device = self.E.device D_t = torch.randn(shape, device=device) * delta_gen for t in tqdm(reversed(range(0, self.T + 1)), desc="Sampling", total=self.T + 1): D_t = self.p_sample(model, D_t, t, delta_gen, src_tokens=src_tokens, src_mask=src_mask) return D_t # ============================================================================= # PART 2: LOADING THE MODEL AND RUNNING INFERENCE # ============================================================================= # --- Configuration --- # Replace with your own username and repo name if you forked this repo_id = "your-hf-username/smoothie-diffusion-qqp" device = "cuda" if torch.cuda.is_available() else "cpu" # --- Load Model and Tokenizer from the Hub --- print(f"Loading tokenizer and model from: {repo_id}") tokenizer = AutoTokenizer.from_pretrained(repo_id) # `trust_remote_code=True` is essential to load the custom SmoothieModel architecture model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).to(device) model.eval() print("\nModel loaded successfully from the Hub!") # --- Prepare Diffusion Components --- print("Preparing the embedding matrix for the diffusion process...") bert_for_embeddings = BertModel.from_pretrained("bert-base-cased") embedding_matrix = bert_for_embeddings.embeddings.word_embeddings.weight.detach().clone().to(device) mean = embedding_matrix.mean(0, keepdim=True) std = embedding_matrix.std(0, keepdim=True) embedding_matrix = (embedding_matrix - mean) / std # Recreate the exact noise schedule and initialize the diffusion pipeline DIFFUSION_STEPS = 200 DELTA_GEN = 0.25 noise_schedule = get_noise_schedule(T=DIFFUSION_STEPS) diffusion_pipeline = SmoothieDiffusion(E=embedding_matrix, schedule=noise_schedule) print("Diffusion components are ready.") # --- Run Inference --- source_question = "How can I become a better writer?" print(f"\nSource Question: {source_question}") inputs = tokenizer( source_question, max_length=model.config.max_seq_len, padding="max_length", truncation=True, return_tensors="pt" ) src_tokens = inputs['input_ids'].to(device) src_mask = (src_tokens == tokenizer.pad_token_id).to(device) generated_D0 = diffusion_pipeline.p_sample_loop( model, shape=(1, model.config.max_seq_len, model.config.vocab_size), delta_gen=DELTA_GEN, src_tokens=src_tokens, src_mask=src_mask ) # --- Decode and Display the Result --- output_tokens = torch.argmax(generated_D0, dim=-1) decoded_text = tokenizer.decode(output_tokens, skip_special_tokens=True) print("-" * 30) print(f"Generated Paraphrase: {decoded_text}") print("-" * 30) ``` --- ## Training Details This model was trained from scratch. - **Dataset:** `glue/qqp`, filtered for positive pairs (is_duplicate = 1). - **Training Steps:** 25,000 - **Batch Size:** 16 - **Optimizer:** AdamW - **Learning Rate:** 2e-4 - **Hardware:** Trained on a single NVIDIA T4 GPU via Google Colab. ### Limitations and Bias - The model's knowledge is limited to the topics present in the Quora Questions dataset. It may perform poorly on highly specialized or out-of-domain topics. - As with any model trained on large-scale internet text, it may reflect societal biases present in the training data. - The model is currently undertrained and may not always produce semantically perfect paraphrases. Continued training would improve its accuracy. ```