entropy-v1-fp8 / README.md

Update README.md

3aa146a verified 5 days ago

4.81 kB

	---
	base_model: google/gemma-3-27b-it
	library_name: vllm
	pipeline_tag: text-generation
	tags:
	- gemma
	- gemma3
	- text-rewrite
	- fp8
	- quantized
	- compressed-tensors
	- vllm
	license: gemma
	datasets:
	- N8Programs/unslop-good
	---

	<div align="center" style="width: 600px">

	![entropy-v1-fp8](https://cdn-uploads.huggingface.co/production/uploads/664e6989d6d71e3dad304d71/3QogW-Y_ts5jAXs16MS--.jpeg)

	</div>

	Try this model on [Entropy Studio](https://getEntropy.ai)



	# Entropy v1 FP8 (Gemma 3 27B IT)

	Entropy v1 FP8 is a merged + FP8-quantized checkpoint based on `google/gemma-3-27b-it`, fine-tuned to rewrite AI-polished text into more human-sounding prose while preserving meaning.

	This repo is intended for efficient inference in vLLM without runtime LoRA.

	## What It Does

	Given an AI-sounding passage, the model rewrites it to be:

	- More human and textured (less generic "professional polish")
	- More varied in rhythm/word choice
	- Meaning-preserving (style change, not content change)

	## Prompt Trigger (Recommended)

	This is the pattern used in our fine-tuning data. Keep the passage after a newline.

	```text
	Polish this AI passage to feel more human:
	{passage}
	```

	Short variants that usually work similarly:

	- `Rephrase this AI passage to feel more human:\n{passage}`
	- `Convert this AI passage into a more human-sounding version:\n{passage}`

	## How To Run (vLLM)

	### 1) Start an OpenAI-compatible server

	```bash
	vllm serve ysong21/entropy-v1-fp8 \
	--served-model-name entropy-v1-fp8 \
	--host 0.0.0.0 \
	--port 8000 \
	--dtype bfloat16 \
	--max-model-len 8192
	```

	Notes:

	- This checkpoint is already quantized (compressed-tensors FP8_DYNAMIC). You do not need to pass `--quantization fp8`.
	- FP8 execution is hardware-dependent; see "Quantization" below.

	### 2) Send a request

	```bash
	curl http://127.0.0.1:8000/v1/chat/completions \
	-H 'Content-Type: application/json' \
	-H 'Authorization: Bearer sk-noop' \
	-d '{
	"model": "entropy-v1-fp8",
	"messages": [
	{
	"role": "user",
	"content": "Polish this AI passage to feel more human:\nThis is a highly polished paragraph that sounds generic and overly smooth..."
	}
	],
	"temperature": 0.7,
	"top_p": 0.95,
	"max_tokens": 512
	}'
	```

	## Validation Benchmark (70 Gutenberg Examples)

	We evaluate by computing the conditional negative log-likelihood of the target (human) rewrite given the prompt, and report character-normalized bits_per_char:

	- Let `NLL` be the sum of token NLL over the target rewrite (teacher-forced).
	- Let `C` be the number of characters in the target rewrite.
	- `bits_per_char = (NLL / C) / ln(2)`.

	This char-normalization makes the score more comparable across models/tokenizers than token-based perplexity.

	Lower is better.

	### Results

	Baseline for relative improvement: `N8Programs/Unslopper-30B-A3B-bf16`.

	\| System \| bits_per_char (↓) \| Relative improvement vs Unslopper (↑) \|
	\|---\|---:\|---:\|
	\| Entropy v1 FP8 (this repo) \| 0.35994 \| +4.07% \|
	\| N8Programs/Unslopper-30B-A3B-bf16 \| 0.37522 \| +0.00% \|
	\| Base google/gemma-3-27b-it \| 0.99565 \| -165.35% \|

	Interpretation:

	- Entropy v1 FP8 achieves the best bits_per_char on this 70-example Gutenberg validation study.

	## Quantization (Merged FP8_DYNAMIC)

	This checkpoint is produced in two steps:

	1. Merge: a PEFT LoRA adapter is merged into the base Gemma 3 27B IT weights (no runtime LoRA).
	2. Quantize: we apply FP8_DYNAMIC (W8A8) quantization with `llm-compressor`:

	- Targets: all `Linear` layers in the language model
	- Weights: FP8, static per-channel scaling
	- Activations: FP8, dynamic per-token scaling
	- Ignored: `lm_head` and the Gemma 3 vision tower (left in BF16)

	The model is saved in a vLLM-loadable compressed-tensors format.

	Hardware notes (vLLM):

	- Hopper/Ada/Blackwell-class NVIDIA GPUs can execute FP8 efficiently.
	- Other GPUs may fall back to less optimized modes.

	## Throughput (vLLM)

	Measured on a single NVIDIA RTX PRO 6000 Blackwell 96GB using `vllm/vllm-openai:v0.11.2` with random prompts:

	- Input length: 512 tokens
	- Output length: 256 tokens

	\| max_concurrency \| output tok/s \| total tok/s \|
	\|---:\|---:\|---:\|
	\| 1 \| 25.87 \| 77.51 \|
	\| 20 \| 412.60 \| 1236.20 \|

	## Limitations / Misuse

	- Trained primarily on literary/public-domain style passages; performance may vary on technical/legal writing.
	- Like other "humanizer" models, it can be misused for deceptive purposes. Use responsibly and follow applicable policies and disclosure norms.

	## Citation

	If you use this model in research, please cite:

	```bibtex
	@misc{entropy_v1_fp8,
	title = {Entropy v1 FP8 (Gemma 3 27B IT)},
	author = {ysong21},
	year = {2026},
	note = {Merged + FP8_DYNAMIC quantized checkpoint for AI-to-human rewriting.}
	}
	```