File size: 4,809 Bytes

---
base_model: google/gemma-3-27b-it
library_name: vllm
pipeline_tag: text-generation
tags:
- gemma
- gemma3
- text-rewrite
- fp8
- quantized
- compressed-tensors
- vllm
license: gemma
datasets:
- N8Programs/unslop-good
---

<div align="center" style="width: 600px">

![entropy-v1-fp8](https://cdn-uploads.huggingface.co/production/uploads/664e6989d6d71e3dad304d71/3QogW-Y_ts5jAXs16MS--.jpeg)

</div>

**Try this model on [Entropy Studio](https://getEntropy.ai)**



# Entropy v1 FP8 (Gemma 3 27B IT)

Entropy v1 FP8 is a **merged + FP8-quantized** checkpoint based on `google/gemma-3-27b-it`, fine-tuned to rewrite AI-polished text into more human-sounding prose while preserving meaning.

This repo is intended for **efficient inference in vLLM** without runtime LoRA.

## What It Does

Given an AI-sounding passage, the model rewrites it to be:

- More human and textured (less generic "professional polish")
- More varied in rhythm/word choice
- Meaning-preserving (style change, not content change)

## Prompt Trigger (Recommended)

This is the pattern used in our fine-tuning data. Keep the passage after a newline.

```text
Polish this AI passage to feel more human:
{passage}
```

Short variants that usually work similarly:

- `Rephrase this AI passage to feel more human:\n{passage}`
- `Convert this AI passage into a more human-sounding version:\n{passage}`

## How To Run (vLLM)

### 1) Start an OpenAI-compatible server

```bash
vllm serve ysong21/entropy-v1-fp8 \
  --served-model-name entropy-v1-fp8 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --max-model-len 8192
```

Notes:

- This checkpoint is already quantized (compressed-tensors FP8_DYNAMIC). You do not need to pass `--quantization fp8`.
- FP8 execution is hardware-dependent; see "Quantization" below.

### 2) Send a request

```bash
curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer sk-noop' \
  -d '{
    "model": "entropy-v1-fp8",
    "messages": [
      {
        "role": "user",
        "content": "Polish this AI passage to feel more human:\nThis is a highly polished paragraph that sounds generic and overly smooth..."
      }
    ],
    "temperature": 0.7,
    "top_p": 0.95,
    "max_tokens": 512
  }'
```

## Validation Benchmark (70 Gutenberg Examples)

We evaluate by computing the **conditional negative log-likelihood** of the target (human) rewrite given the prompt, and report **character-normalized bits_per_char**:

- Let `NLL` be the sum of token NLL over the target rewrite (teacher-forced).
- Let `C` be the number of characters in the target rewrite.
- `bits_per_char = (NLL / C) / ln(2)`.

This char-normalization makes the score more comparable across models/tokenizers than token-based perplexity.

Lower is better.

### Results

Baseline for relative improvement: `N8Programs/Unslopper-30B-A3B-bf16`.

| System | bits_per_char (↓) | Relative improvement vs Unslopper (↑) |
|---|---:|---:|
| **Entropy v1 FP8 (this repo)** | **0.35994** | **+4.07%** |
| N8Programs/Unslopper-30B-A3B-bf16 | 0.37522 | +0.00% |
| Base google/gemma-3-27b-it | 0.99565 | -165.35% |

Interpretation:

- Entropy v1 FP8 achieves the best bits_per_char on this 70-example Gutenberg validation study.

## Quantization (Merged FP8_DYNAMIC)

This checkpoint is produced in two steps:

1. **Merge**: a PEFT LoRA adapter is merged into the base Gemma 3 27B IT weights (no runtime LoRA).
2. **Quantize**: we apply **FP8_DYNAMIC (W8A8)** quantization with `llm-compressor`:

- Targets: all `Linear` layers in the language model
- Weights: FP8, static per-channel scaling
- Activations: FP8, dynamic per-token scaling
- Ignored: `lm_head` and the Gemma 3 vision tower (left in BF16)

The model is saved in a vLLM-loadable **compressed-tensors** format.

Hardware notes (vLLM):

- Hopper/Ada/Blackwell-class NVIDIA GPUs can execute FP8 efficiently.
- Other GPUs may fall back to less optimized modes.

## Throughput (vLLM)

Measured on a single NVIDIA RTX PRO 6000 Blackwell 96GB using `vllm/vllm-openai:v0.11.2` with random prompts:

- Input length: 512 tokens
- Output length: 256 tokens

| max_concurrency | output tok/s | total tok/s |
|---:|---:|---:|
| 1 | 25.87 | 77.51 |
| 20 | 412.60 | 1236.20 |

## Limitations / Misuse

- Trained primarily on literary/public-domain style passages; performance may vary on technical/legal writing.
- Like other "humanizer" models, it can be misused for deceptive purposes. Use responsibly and follow applicable policies and disclosure norms.

## Citation

If you use this model in research, please cite:

```bibtex
@misc{entropy_v1_fp8,
  title  = {Entropy v1 FP8 (Gemma 3 27B IT)},
  author = {ysong21},
  year   = {2026},
  note   = {Merged + FP8_DYNAMIC quantized checkpoint for AI-to-human rewriting.}
}
```