maomaocun
/

dLLM-Var

Model card Files Files and versions

dLLM-Var / README.md

maomaocun's picture

Update README.md

0c170b9 verified 2 months ago

|

history blame contribute delete

3.27 kB

	---
	license: apache-2.0
	---
	# dLLM-Var

	## Model Description

	This model is a fine-tuned version of the LLaDA 8B Base model, obtained through a specialized Supervised Fine-Tuning (SFT) process. It innovatively discards the complex attention mask design typically associated with block diffusion, while preserving full attention mechanisms. This allows the model to achieve block diffusion-style inference efficiently—leveraging KV cache for streamlined generation, outputting an EOS token upon completion of the response to seamlessly exit the generation process.

	Key innovations:
	- Full Attention Preservation: Maintains standard full attention without the overhead of intricate masking.
	- Block Diffusion Inference: Enables iterative block-wise generation via KV cache management, ensuring coherent and controlled outputs.
	- EOS Handling: Trained to naturally emit EOS tokens at response boundaries.

	This approach balances computational efficiency with high-quality generation, making it suitable for tasks requiring structured, multi-step reasoning.

	## Usage

	To load and use this model with Hugging Face Transformers:

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "maomaocun/dLLM-Var"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to("cuda")

	# 使用对话模板
	messages = [
	{"role": "user", "content": "Can you tell me an engaging short story about a brave young astronaut who discovers an ancient alien civilization on a distant planet? Make it adventurous and heartwarming, with a twist at the end."}
	]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	input_ids = inputs['input_ids']
	attention_mask = inputs.get('attention_mask', torch.ones_like(input_ids))
	result = model.generate(
	input_ids=input_ids,
	attention_mask=attention_mask,
	max_gen_length=1024,
	block_length=64,
	threshold=0.9,
	streaming=True,
	eos_token_id=126348
	)
	text = tokenizer.batch_decode(result, skip_special_tokens=True)
	print(text)
	```

	For block diffusion-style inference, customize the generation loop to manage KV cache and block outputs as needed.

	## Benchmarks

	The following table compares performance across key evaluation benchmarks. Results are reported as accuracy percentages where applicable.

	\| Model \| GSM8K \| GPQA \| BBH \| MATH \| HumanEval \| MBPP \| MMLU-Generate \|
	\|--------------------------------\|-------\|-------\|-------\|-------\|-----------\|-------\|---------------\|
	\| LLaDA 8B Base in Pure Diffusion \| 69.06 \| 31.91 \| 44.77 \| 30.84 \| 32.92 \| 40.80 \| 65.9 \|
	\| LLaDA 8B Instruct in Semi-ar Diffusion \| 77.48 \| 29.01 \| 51.49 \| 22.32 \| 38.71 \| 39.20 \| 65.5 \|
	\| dLLM-Var Block Diffusion \| 77.40 \| 33.03 \| 48.74 \| 31.94 \| 40.24 \| 42.00 \| 65.53 \|

	These results demonstrate competitive performance, particularly in code generation (HumanEval, MBPP) and reasoning tasks (BBH, MATH), with gains over the base instruct variant in several areas.