|
|
| --- |
| license: mit |
| language: en |
| --- |
| |
| # Smoothie: A Diffusion Model for Paraphrase Generation |
|
|
| [](https://shields.io/) |
| [](https://huggingface.co/datasets/glue) |
| [](https://arxiv.org/abs/2505.18853) |
|
|
| This repository contains a diffusion-based model for text generation, trained on the **Quora Question Pairs (QQP)** dataset for the task of **paraphrasing**. The architecture and training methodology are based on the paper *Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation*. |
|
|
| This is a custom model and **requires `trust_remote_code=True`** to load, as the model's architecture is defined in the accompanying `modeling_smoothie.py` file. |
|
|
| ## Model Description |
|
|
| The "Smoothie" model is a non-autoregressive text generation model that uses a diffusion process. Unlike traditional models that generate text token-by-token, this model starts with pure random noise and iteratively refines it over hundreds of steps to produce a full sentence. |
|
|
| The key features of the architecture are: |
| - **Diffusion Process:** Operates in a continuous space based on the negative squared Euclidean distances between token embeddings. This allows the model to smoothly add and remove "semantic noise". |
| - **Backbone:** A Transformer Decoder with UNet-style skip connections, which is effective for denoising tasks. |
| - **Conditional Generation:** The model is conditioned on an input sentence (a question) to generate a semantically similar output sentence (a paraphrase). |
|
|
| This specific checkpoint was trained on the paraphrase pairs from the GLUE QQP dataset, using `bert-base-cased` as the base for its token embeddings. |
|
|
| --- |
|
|
| ## How to Use |
|
|
| The following is a complete, self-contained example of how to load the model and use it for inference. The `SmoothieDiffusion` class, which orchestrates the multi-step generation process, is included for convenience. |
|
|
| First, make sure you have the necessary libraries installed: |
| ```bash |
| pip install torch transformers accelerate huggingface_hub -q |
| ``` |
|
|
| Then, you can run the following Python script: |
|
|
| ```python |
| import torch |
| import torch.nn as nn |
| from transformers import AutoTokenizer, AutoModel, BertModel |
| from tqdm.auto import tqdm |
| import math |
| |
| # ============================================================================= |
| # PART 1: THE DIFFUSION PIPELINE (INFERENCE LOGIC) |
| # This class is required to use the Smoothie model for generation. |
| # ============================================================================= |
| |
| def get_noise_schedule(T, s_min=1.5, s_max=200.0, d=9.0, epsilon=1e-5): |
| """Generates the noise schedule used during training.""" |
| t = torch.arange(0, T + 1, dtype=torch.float32) |
| ratio = t / (T - t + epsilon) |
| arg = (1/d) * ratio |
| schedule = (s_max - s_min) * (2 / math.pi) * torch.atan(arg) + s_min |
| schedule = s_min |
| schedule[T] = s_max |
| return schedule |
| |
| class SmoothieDiffusion: |
| """The inference pipeline for the Smoothie model.""" |
| def __init__(self, E, schedule): |
| self.E = E.cuda() # The semantic map (embedding matrix) |
| self.V, self.D = E.shape |
| self.sigmas = schedule.cuda() # The blueprint (noise schedule) |
| self.T = len(schedule) - 1 |
| |
| @torch.no_grad() |
| def get_D0(self, target_embeddings): |
| """Memory-efficient calculation of the distance matrix D0.""" |
| term1 = torch.sum(target_embeddings.pow(2), dim=-1, keepdim=True) |
| term2 = torch.sum(self.E.pow(2), dim=-1).unsqueeze(0).unsqueeze(0) |
| term3 = -2 * torch.matmul(target_embeddings, self.E.T) |
| return -(term1 + term2 + term3) |
| |
| @torch.no_grad() |
| def p_sample(self, model, D_t, t, delta_gen, src_tokens=None, src_mask=None): |
| """A single reverse diffusion (denoising) step.""" |
| p_t = torch.softmax(D_t, dim=-1) |
| weighted_avg_emb = torch.matmul(p_t, self.E) |
| t_tensor = torch.full((D_t.shape,), t, device=D_t.device, dtype=torch.long) |
| |
| pred_E0 = model( |
| weighted_avg_emb=weighted_avg_emb, |
| t=t_tensor, |
| src_tokens=src_tokens, |
| src_mask=src_mask |
| ) |
| |
| pred_D0 = self.get_D0(pred_E0) |
| if t == 0: |
| return pred_D0 |
| |
| sigma_t_minus_1 = self.sigmas[t-1] |
| D_t_minus_1 = pred_D0 / (sigma_t_minus_1 ** 2) |
| if delta_gen > 0: |
| D_t_minus_1 += delta_gen * torch.randn_like(D_t) |
| return D_t_minus_1 |
| |
| @torch.no_grad() |
| def p_sample_loop(self, model, shape, delta_gen, src_tokens=None, src_mask=None): |
| """The full denoising loop from T to 0.""" |
| device = self.E.device |
| D_t = torch.randn(shape, device=device) * delta_gen |
| for t in tqdm(reversed(range(0, self.T + 1)), desc="Sampling", total=self.T + 1): |
| D_t = self.p_sample(model, D_t, t, delta_gen, src_tokens=src_tokens, src_mask=src_mask) |
| return D_t |
| |
| # ============================================================================= |
| # PART 2: LOADING THE MODEL AND RUNNING INFERENCE |
| # ============================================================================= |
| |
| # --- Configuration --- |
| # Replace with your own username and repo name if you forked this |
| repo_id = "your-hf-username/smoothie-diffusion-qqp" |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| |
| # --- Load Model and Tokenizer from the Hub --- |
| print(f"Loading tokenizer and model from: {repo_id}") |
| tokenizer = AutoTokenizer.from_pretrained(repo_id) |
| |
| # `trust_remote_code=True` is essential to load the custom SmoothieModel architecture |
| model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).to(device) |
| model.eval() |
| print("\nModel loaded successfully from the Hub!") |
| |
| # --- Prepare Diffusion Components --- |
| print("Preparing the embedding matrix for the diffusion process...") |
| bert_for_embeddings = BertModel.from_pretrained("bert-base-cased") |
| embedding_matrix = bert_for_embeddings.embeddings.word_embeddings.weight.detach().clone().to(device) |
| mean = embedding_matrix.mean(0, keepdim=True) |
| std = embedding_matrix.std(0, keepdim=True) |
| embedding_matrix = (embedding_matrix - mean) / std |
| |
| # Recreate the exact noise schedule and initialize the diffusion pipeline |
| DIFFUSION_STEPS = 200 |
| DELTA_GEN = 0.25 |
| noise_schedule = get_noise_schedule(T=DIFFUSION_STEPS) |
| diffusion_pipeline = SmoothieDiffusion(E=embedding_matrix, schedule=noise_schedule) |
| print("Diffusion components are ready.") |
| |
| # --- Run Inference --- |
| source_question = "How can I become a better writer?" |
| print(f"\nSource Question: {source_question}") |
| |
| inputs = tokenizer( |
| source_question, |
| max_length=model.config.max_seq_len, |
| padding="max_length", |
| truncation=True, |
| return_tensors="pt" |
| ) |
| src_tokens = inputs['input_ids'].to(device) |
| src_mask = (src_tokens == tokenizer.pad_token_id).to(device) |
| |
| generated_D0 = diffusion_pipeline.p_sample_loop( |
| model, |
| shape=(1, model.config.max_seq_len, model.config.vocab_size), |
| delta_gen=DELTA_GEN, |
| src_tokens=src_tokens, |
| src_mask=src_mask |
| ) |
| |
| # --- Decode and Display the Result --- |
| output_tokens = torch.argmax(generated_D0, dim=-1) |
| decoded_text = tokenizer.decode(output_tokens, skip_special_tokens=True) |
| |
| print("-" * 30) |
| print(f"Generated Paraphrase: {decoded_text}") |
| print("-" * 30) |
| |
| ``` |
|
|
| --- |
|
|
| ## Training Details |
|
|
| This model was trained from scratch. |
|
|
| - **Dataset:** `glue/qqp`, filtered for positive pairs (is_duplicate = 1). |
| - **Training Steps:** 25,000 |
| - **Batch Size:** 16 |
| - **Optimizer:** AdamW |
| - **Learning Rate:** 2e-4 |
| - **Hardware:** Trained on a single NVIDIA T4 GPU via Google Colab. |
| |
| ### Limitations and Bias |
| |
| - The model's knowledge is limited to the topics present in the Quora Questions dataset. It may perform poorly on highly specialized or out-of-domain topics. |
| - As with any model trained on large-scale internet text, it may reflect societal biases present in the training data. |
| - The model is currently undertrained and may not always produce semantically perfect paraphrases. Continued training would improve its accuracy. |
| |
| ``` |
| |