File size: 4,809 Bytes
aa46295 3aa146a aa46295 3aa146a aa46295 b2b0bea aa46295 2e8a34e aa46295 2e8a34e aa46295 3aa146a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | ---
base_model: google/gemma-3-27b-it
library_name: vllm
pipeline_tag: text-generation
tags:
- gemma
- gemma3
- text-rewrite
- fp8
- quantized
- compressed-tensors
- vllm
license: gemma
datasets:
- N8Programs/unslop-good
---
<div align="center" style="width: 600px">

</div>
**Try this model on [Entropy Studio](https://getEntropy.ai)**
# Entropy v1 FP8 (Gemma 3 27B IT)
Entropy v1 FP8 is a **merged + FP8-quantized** checkpoint based on `google/gemma-3-27b-it`, fine-tuned to rewrite AI-polished text into more human-sounding prose while preserving meaning.
This repo is intended for **efficient inference in vLLM** without runtime LoRA.
## What It Does
Given an AI-sounding passage, the model rewrites it to be:
- More human and textured (less generic "professional polish")
- More varied in rhythm/word choice
- Meaning-preserving (style change, not content change)
## Prompt Trigger (Recommended)
This is the pattern used in our fine-tuning data. Keep the passage after a newline.
```text
Polish this AI passage to feel more human:
{passage}
```
Short variants that usually work similarly:
- `Rephrase this AI passage to feel more human:\n{passage}`
- `Convert this AI passage into a more human-sounding version:\n{passage}`
## How To Run (vLLM)
### 1) Start an OpenAI-compatible server
```bash
vllm serve ysong21/entropy-v1-fp8 \
--served-model-name entropy-v1-fp8 \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--max-model-len 8192
```
Notes:
- This checkpoint is already quantized (compressed-tensors FP8_DYNAMIC). You do not need to pass `--quantization fp8`.
- FP8 execution is hardware-dependent; see "Quantization" below.
### 2) Send a request
```bash
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-noop' \
-d '{
"model": "entropy-v1-fp8",
"messages": [
{
"role": "user",
"content": "Polish this AI passage to feel more human:\nThis is a highly polished paragraph that sounds generic and overly smooth..."
}
],
"temperature": 0.7,
"top_p": 0.95,
"max_tokens": 512
}'
```
## Validation Benchmark (70 Gutenberg Examples)
We evaluate by computing the **conditional negative log-likelihood** of the target (human) rewrite given the prompt, and report **character-normalized bits_per_char**:
- Let `NLL` be the sum of token NLL over the target rewrite (teacher-forced).
- Let `C` be the number of characters in the target rewrite.
- `bits_per_char = (NLL / C) / ln(2)`.
This char-normalization makes the score more comparable across models/tokenizers than token-based perplexity.
Lower is better.
### Results
Baseline for relative improvement: `N8Programs/Unslopper-30B-A3B-bf16`.
| System | bits_per_char (↓) | Relative improvement vs Unslopper (↑) |
|---|---:|---:|
| **Entropy v1 FP8 (this repo)** | **0.35994** | **+4.07%** |
| N8Programs/Unslopper-30B-A3B-bf16 | 0.37522 | +0.00% |
| Base google/gemma-3-27b-it | 0.99565 | -165.35% |
Interpretation:
- Entropy v1 FP8 achieves the best bits_per_char on this 70-example Gutenberg validation study.
## Quantization (Merged FP8_DYNAMIC)
This checkpoint is produced in two steps:
1. **Merge**: a PEFT LoRA adapter is merged into the base Gemma 3 27B IT weights (no runtime LoRA).
2. **Quantize**: we apply **FP8_DYNAMIC (W8A8)** quantization with `llm-compressor`:
- Targets: all `Linear` layers in the language model
- Weights: FP8, static per-channel scaling
- Activations: FP8, dynamic per-token scaling
- Ignored: `lm_head` and the Gemma 3 vision tower (left in BF16)
The model is saved in a vLLM-loadable **compressed-tensors** format.
Hardware notes (vLLM):
- Hopper/Ada/Blackwell-class NVIDIA GPUs can execute FP8 efficiently.
- Other GPUs may fall back to less optimized modes.
## Throughput (vLLM)
Measured on a single NVIDIA RTX PRO 6000 Blackwell 96GB using `vllm/vllm-openai:v0.11.2` with random prompts:
- Input length: 512 tokens
- Output length: 256 tokens
| max_concurrency | output tok/s | total tok/s |
|---:|---:|---:|
| 1 | 25.87 | 77.51 |
| 20 | 412.60 | 1236.20 |
## Limitations / Misuse
- Trained primarily on literary/public-domain style passages; performance may vary on technical/legal writing.
- Like other "humanizer" models, it can be misused for deceptive purposes. Use responsibly and follow applicable policies and disclosure norms.
## Citation
If you use this model in research, please cite:
```bibtex
@misc{entropy_v1_fp8,
title = {Entropy v1 FP8 (Gemma 3 27B IT)},
author = {ysong21},
year = {2026},
note = {Merged + FP8_DYNAMIC quantized checkpoint for AI-to-human rewriting.}
}
``` |