---
base_model: google/gemma-3-27b-it
library_name: vllm
pipeline_tag: text-generation
tags:
- gemma
- gemma3
- text-rewrite
- fp8
- quantized
- compressed-tensors
- vllm
license: gemma
datasets:
- N8Programs/unslop-good
---

**Try this model on [Entropy Studio](https://getEntropy.ai)**
# Entropy v1 FP8 (Gemma 3 27B IT)
Entropy v1 FP8 is a **merged + FP8-quantized** checkpoint based on `google/gemma-3-27b-it`, fine-tuned to rewrite AI-polished text into more human-sounding prose while preserving meaning.
This repo is intended for **efficient inference in vLLM** without runtime LoRA.
## What It Does
Given an AI-sounding passage, the model rewrites it to be:
- More human and textured (less generic "professional polish")
- More varied in rhythm/word choice
- Meaning-preserving (style change, not content change)
## Prompt Trigger (Recommended)
This is the pattern used in our fine-tuning data. Keep the passage after a newline.
```text
Polish this AI passage to feel more human:
{passage}
```
Short variants that usually work similarly:
- `Rephrase this AI passage to feel more human:\n{passage}`
- `Convert this AI passage into a more human-sounding version:\n{passage}`
## How To Run (vLLM)
### 1) Start an OpenAI-compatible server
```bash
vllm serve ysong21/entropy-v1-fp8 \
--served-model-name entropy-v1-fp8 \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--max-model-len 8192
```
Notes:
- This checkpoint is already quantized (compressed-tensors FP8_DYNAMIC). You do not need to pass `--quantization fp8`.
- FP8 execution is hardware-dependent; see "Quantization" below.
### 2) Send a request
```bash
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-noop' \
-d '{
"model": "entropy-v1-fp8",
"messages": [
{
"role": "user",
"content": "Polish this AI passage to feel more human:\nThis is a highly polished paragraph that sounds generic and overly smooth..."
}
],
"temperature": 0.7,
"top_p": 0.95,
"max_tokens": 512
}'
```
## Validation Benchmark (70 Gutenberg Examples)
We evaluate by computing the **conditional negative log-likelihood** of the target (human) rewrite given the prompt, and report **character-normalized bits_per_char**:
- Let `NLL` be the sum of token NLL over the target rewrite (teacher-forced).
- Let `C` be the number of characters in the target rewrite.
- `bits_per_char = (NLL / C) / ln(2)`.
This char-normalization makes the score more comparable across models/tokenizers than token-based perplexity.
Lower is better.
### Results
Baseline for relative improvement: `N8Programs/Unslopper-30B-A3B-bf16`.
| System | bits_per_char (↓) | Relative improvement vs Unslopper (↑) |
|---|---:|---:|
| **Entropy v1 FP8 (this repo)** | **0.35994** | **+4.07%** |
| N8Programs/Unslopper-30B-A3B-bf16 | 0.37522 | +0.00% |
| Base google/gemma-3-27b-it | 0.99565 | -165.35% |
Interpretation:
- Entropy v1 FP8 achieves the best bits_per_char on this 70-example Gutenberg validation study.
## Quantization (Merged FP8_DYNAMIC)
This checkpoint is produced in two steps:
1. **Merge**: a PEFT LoRA adapter is merged into the base Gemma 3 27B IT weights (no runtime LoRA).
2. **Quantize**: we apply **FP8_DYNAMIC (W8A8)** quantization with `llm-compressor`:
- Targets: all `Linear` layers in the language model
- Weights: FP8, static per-channel scaling
- Activations: FP8, dynamic per-token scaling
- Ignored: `lm_head` and the Gemma 3 vision tower (left in BF16)
The model is saved in a vLLM-loadable **compressed-tensors** format.
Hardware notes (vLLM):
- Hopper/Ada/Blackwell-class NVIDIA GPUs can execute FP8 efficiently.
- Other GPUs may fall back to less optimized modes.
## Throughput (vLLM)
Measured on a single NVIDIA RTX PRO 6000 Blackwell 96GB using `vllm/vllm-openai:v0.11.2` with random prompts:
- Input length: 512 tokens
- Output length: 256 tokens
| max_concurrency | output tok/s | total tok/s |
|---:|---:|---:|
| 1 | 25.87 | 77.51 |
| 20 | 412.60 | 1236.20 |
## Limitations / Misuse
- Trained primarily on literary/public-domain style passages; performance may vary on technical/legal writing.
- Like other "humanizer" models, it can be misused for deceptive purposes. Use responsibly and follow applicable policies and disclosure norms.
## Citation
If you use this model in research, please cite:
```bibtex
@misc{entropy_v1_fp8,
title = {Entropy v1 FP8 (Gemma 3 27B IT)},
author = {ysong21},
year = {2026},
note = {Merged + FP8_DYNAMIC quantized checkpoint for AI-to-human rewriting.}
}
```