|
|
--- |
|
|
base_model: google/gemma-3-27b-it |
|
|
library_name: vllm |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- gemma |
|
|
- gemma3 |
|
|
- text-rewrite |
|
|
- fp8 |
|
|
- quantized |
|
|
- compressed-tensors |
|
|
- vllm |
|
|
license: gemma |
|
|
datasets: |
|
|
- N8Programs/unslop-good |
|
|
--- |
|
|
|
|
|
<div align="center" style="width: 600px"> |
|
|
|
|
|
 |
|
|
|
|
|
</div> |
|
|
|
|
|
**Try this model on [Entropy Studio](https://getEntropy.ai)** |
|
|
|
|
|
|
|
|
|
|
|
# Entropy v1 FP8 (Gemma 3 27B IT) |
|
|
|
|
|
Entropy v1 FP8 is a **merged + FP8-quantized** checkpoint based on `google/gemma-3-27b-it`, fine-tuned to rewrite AI-polished text into more human-sounding prose while preserving meaning. |
|
|
|
|
|
This repo is intended for **efficient inference in vLLM** without runtime LoRA. |
|
|
|
|
|
## What It Does |
|
|
|
|
|
Given an AI-sounding passage, the model rewrites it to be: |
|
|
|
|
|
- More human and textured (less generic "professional polish") |
|
|
- More varied in rhythm/word choice |
|
|
- Meaning-preserving (style change, not content change) |
|
|
|
|
|
## Prompt Trigger (Recommended) |
|
|
|
|
|
This is the pattern used in our fine-tuning data. Keep the passage after a newline. |
|
|
|
|
|
```text |
|
|
Polish this AI passage to feel more human: |
|
|
{passage} |
|
|
``` |
|
|
|
|
|
Short variants that usually work similarly: |
|
|
|
|
|
- `Rephrase this AI passage to feel more human:\n{passage}` |
|
|
- `Convert this AI passage into a more human-sounding version:\n{passage}` |
|
|
|
|
|
## How To Run (vLLM) |
|
|
|
|
|
### 1) Start an OpenAI-compatible server |
|
|
|
|
|
```bash |
|
|
vllm serve ysong21/entropy-v1-fp8 \ |
|
|
--served-model-name entropy-v1-fp8 \ |
|
|
--host 0.0.0.0 \ |
|
|
--port 8000 \ |
|
|
--dtype bfloat16 \ |
|
|
--max-model-len 8192 |
|
|
``` |
|
|
|
|
|
Notes: |
|
|
|
|
|
- This checkpoint is already quantized (compressed-tensors FP8_DYNAMIC). You do not need to pass `--quantization fp8`. |
|
|
- FP8 execution is hardware-dependent; see "Quantization" below. |
|
|
|
|
|
### 2) Send a request |
|
|
|
|
|
```bash |
|
|
curl http://127.0.0.1:8000/v1/chat/completions \ |
|
|
-H 'Content-Type: application/json' \ |
|
|
-H 'Authorization: Bearer sk-noop' \ |
|
|
-d '{ |
|
|
"model": "entropy-v1-fp8", |
|
|
"messages": [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": "Polish this AI passage to feel more human:\nThis is a highly polished paragraph that sounds generic and overly smooth..." |
|
|
} |
|
|
], |
|
|
"temperature": 0.7, |
|
|
"top_p": 0.95, |
|
|
"max_tokens": 512 |
|
|
}' |
|
|
``` |
|
|
|
|
|
## Validation Benchmark (70 Gutenberg Examples) |
|
|
|
|
|
We evaluate by computing the **conditional negative log-likelihood** of the target (human) rewrite given the prompt, and report **character-normalized bits_per_char**: |
|
|
|
|
|
- Let `NLL` be the sum of token NLL over the target rewrite (teacher-forced). |
|
|
- Let `C` be the number of characters in the target rewrite. |
|
|
- `bits_per_char = (NLL / C) / ln(2)`. |
|
|
|
|
|
This char-normalization makes the score more comparable across models/tokenizers than token-based perplexity. |
|
|
|
|
|
Lower is better. |
|
|
|
|
|
### Results |
|
|
|
|
|
Baseline for relative improvement: `N8Programs/Unslopper-30B-A3B-bf16`. |
|
|
|
|
|
| System | bits_per_char (↓) | Relative improvement vs Unslopper (↑) | |
|
|
|---|---:|---:| |
|
|
| **Entropy v1 FP8 (this repo)** | **0.35994** | **+4.07%** | |
|
|
| N8Programs/Unslopper-30B-A3B-bf16 | 0.37522 | +0.00% | |
|
|
| Base google/gemma-3-27b-it | 0.99565 | -165.35% | |
|
|
|
|
|
Interpretation: |
|
|
|
|
|
- Entropy v1 FP8 achieves the best bits_per_char on this 70-example Gutenberg validation study. |
|
|
|
|
|
## Quantization (Merged FP8_DYNAMIC) |
|
|
|
|
|
This checkpoint is produced in two steps: |
|
|
|
|
|
1. **Merge**: a PEFT LoRA adapter is merged into the base Gemma 3 27B IT weights (no runtime LoRA). |
|
|
2. **Quantize**: we apply **FP8_DYNAMIC (W8A8)** quantization with `llm-compressor`: |
|
|
|
|
|
- Targets: all `Linear` layers in the language model |
|
|
- Weights: FP8, static per-channel scaling |
|
|
- Activations: FP8, dynamic per-token scaling |
|
|
- Ignored: `lm_head` and the Gemma 3 vision tower (left in BF16) |
|
|
|
|
|
The model is saved in a vLLM-loadable **compressed-tensors** format. |
|
|
|
|
|
Hardware notes (vLLM): |
|
|
|
|
|
- Hopper/Ada/Blackwell-class NVIDIA GPUs can execute FP8 efficiently. |
|
|
- Other GPUs may fall back to less optimized modes. |
|
|
|
|
|
## Throughput (vLLM) |
|
|
|
|
|
Measured on a single NVIDIA RTX PRO 6000 Blackwell 96GB using `vllm/vllm-openai:v0.11.2` with random prompts: |
|
|
|
|
|
- Input length: 512 tokens |
|
|
- Output length: 256 tokens |
|
|
|
|
|
| max_concurrency | output tok/s | total tok/s | |
|
|
|---:|---:|---:| |
|
|
| 1 | 25.87 | 77.51 | |
|
|
| 20 | 412.60 | 1236.20 | |
|
|
|
|
|
## Limitations / Misuse |
|
|
|
|
|
- Trained primarily on literary/public-domain style passages; performance may vary on technical/legal writing. |
|
|
- Like other "humanizer" models, it can be misused for deceptive purposes. Use responsibly and follow applicable policies and disclosure norms. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{entropy_v1_fp8, |
|
|
title = {Entropy v1 FP8 (Gemma 3 27B IT)}, |
|
|
author = {ysong21}, |
|
|
year = {2026}, |
|
|
note = {Merged + FP8_DYNAMIC quantized checkpoint for AI-to-human rewriting.} |
|
|
} |
|
|
``` |