--- base_model: google/gemma-3-27b-it library_name: vllm pipeline_tag: text-generation tags: - gemma - gemma3 - text-rewrite - fp8 - quantized - compressed-tensors - vllm license: gemma datasets: - N8Programs/unslop-good ---
![entropy-v1-fp8](https://cdn-uploads.huggingface.co/production/uploads/664e6989d6d71e3dad304d71/3QogW-Y_ts5jAXs16MS--.jpeg)
**Try this model on [Entropy Studio](https://getEntropy.ai)** # Entropy v1 FP8 (Gemma 3 27B IT) Entropy v1 FP8 is a **merged + FP8-quantized** checkpoint based on `google/gemma-3-27b-it`, fine-tuned to rewrite AI-polished text into more human-sounding prose while preserving meaning. This repo is intended for **efficient inference in vLLM** without runtime LoRA. ## What It Does Given an AI-sounding passage, the model rewrites it to be: - More human and textured (less generic "professional polish") - More varied in rhythm/word choice - Meaning-preserving (style change, not content change) ## Prompt Trigger (Recommended) This is the pattern used in our fine-tuning data. Keep the passage after a newline. ```text Polish this AI passage to feel more human: {passage} ``` Short variants that usually work similarly: - `Rephrase this AI passage to feel more human:\n{passage}` - `Convert this AI passage into a more human-sounding version:\n{passage}` ## How To Run (vLLM) ### 1) Start an OpenAI-compatible server ```bash vllm serve ysong21/entropy-v1-fp8 \ --served-model-name entropy-v1-fp8 \ --host 0.0.0.0 \ --port 8000 \ --dtype bfloat16 \ --max-model-len 8192 ``` Notes: - This checkpoint is already quantized (compressed-tensors FP8_DYNAMIC). You do not need to pass `--quantization fp8`. - FP8 execution is hardware-dependent; see "Quantization" below. ### 2) Send a request ```bash curl http://127.0.0.1:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer sk-noop' \ -d '{ "model": "entropy-v1-fp8", "messages": [ { "role": "user", "content": "Polish this AI passage to feel more human:\nThis is a highly polished paragraph that sounds generic and overly smooth..." } ], "temperature": 0.7, "top_p": 0.95, "max_tokens": 512 }' ``` ## Validation Benchmark (70 Gutenberg Examples) We evaluate by computing the **conditional negative log-likelihood** of the target (human) rewrite given the prompt, and report **character-normalized bits_per_char**: - Let `NLL` be the sum of token NLL over the target rewrite (teacher-forced). - Let `C` be the number of characters in the target rewrite. - `bits_per_char = (NLL / C) / ln(2)`. This char-normalization makes the score more comparable across models/tokenizers than token-based perplexity. Lower is better. ### Results Baseline for relative improvement: `N8Programs/Unslopper-30B-A3B-bf16`. | System | bits_per_char (↓) | Relative improvement vs Unslopper (↑) | |---|---:|---:| | **Entropy v1 FP8 (this repo)** | **0.35994** | **+4.07%** | | N8Programs/Unslopper-30B-A3B-bf16 | 0.37522 | +0.00% | | Base google/gemma-3-27b-it | 0.99565 | -165.35% | Interpretation: - Entropy v1 FP8 achieves the best bits_per_char on this 70-example Gutenberg validation study. ## Quantization (Merged FP8_DYNAMIC) This checkpoint is produced in two steps: 1. **Merge**: a PEFT LoRA adapter is merged into the base Gemma 3 27B IT weights (no runtime LoRA). 2. **Quantize**: we apply **FP8_DYNAMIC (W8A8)** quantization with `llm-compressor`: - Targets: all `Linear` layers in the language model - Weights: FP8, static per-channel scaling - Activations: FP8, dynamic per-token scaling - Ignored: `lm_head` and the Gemma 3 vision tower (left in BF16) The model is saved in a vLLM-loadable **compressed-tensors** format. Hardware notes (vLLM): - Hopper/Ada/Blackwell-class NVIDIA GPUs can execute FP8 efficiently. - Other GPUs may fall back to less optimized modes. ## Throughput (vLLM) Measured on a single NVIDIA RTX PRO 6000 Blackwell 96GB using `vllm/vllm-openai:v0.11.2` with random prompts: - Input length: 512 tokens - Output length: 256 tokens | max_concurrency | output tok/s | total tok/s | |---:|---:|---:| | 1 | 25.87 | 77.51 | | 20 | 412.60 | 1236.20 | ## Limitations / Misuse - Trained primarily on literary/public-domain style passages; performance may vary on technical/legal writing. - Like other "humanizer" models, it can be misused for deceptive purposes. Use responsibly and follow applicable policies and disclosure norms. ## Citation If you use this model in research, please cite: ```bibtex @misc{entropy_v1_fp8, title = {Entropy v1 FP8 (Gemma 3 27B IT)}, author = {ysong21}, year = {2026}, note = {Merged + FP8_DYNAMIC quantized checkpoint for AI-to-human rewriting.} } ```