Staticaliza
/

ReyaChat-Reasoning

Text Generation

compressed-tensors

Model card Files Files and versions

Staticaliza commited on Jun 23, 2025

Commit

02addc8

·

verified ·

1 Parent(s): e64055b

Update README.md

Files changed (1) hide show

README.md +9 -6

README.md CHANGED Viewed

@@ -3,22 +3,25 @@ license: apache-2.0
 pipeline_tag: text-generation
 ---
-# Reasoning Reya: The LLM
 <img src="https://huggingface.co/Statical-Workspace/Storage/resolve/main/DepressedGirl.png" alt="icon" height="500" width="500">
-An low restriction reasoning creative roleplay and conversation model based on [ArliAI/QwQ-32B-ArliAI-RpR-v4](https://huggingface.co/ArliAI/QwQ-32B-ArliAI-RpR-v4) and [huihui-ai/DeepSeek-R1-0528-Qwen3-8B-abliterated](https://huggingface.co/huihui-ai/DeepSeek-R1-0528-Qwen3-8B-abliterated).
-This is a distilled and quantized 4 bit GPTQ model (W4A16) that can run on GPU.
-# vLLM Use
 ```python
 from huggingface_hub import snapshot_download
 from vllm import LLM, SamplingParams
 # Consider toggling "enforce_eager" to False if you want to load the model quicker, at the expense of tokens per second.
-llm = LLM(model=snapshot_download(repo_id="Staticaliza/Reya-Reasoning-8B-Distilled-GPTQ-Int4", allow_patterns=["*.json", "*.bin", "*.safetensors"]), dtype="auto", tensor_parallel_size=torch.cuda.device_count(), enforce_eager=True, trust_remote_code=True)
 params = SamplingParams(
     max_tokens=256,
@@ -33,6 +36,6 @@ params = SamplingParams(
     seed=42,
 )
-result = model.generate(input, params)[0].outputs[0].text
 print(result)
 ```

 pipeline_tag: text-generation
 ---
+# Reasoning Reya: Information
 <img src="https://huggingface.co/Statical-Workspace/Storage/resolve/main/DepressedGirl.png" alt="icon" height="500" width="500">
+This is a low restriction, creative roleplay and conversational reasoning model based on [ArliAI/QwQ-32B-ArliAI-RpR-v4](https://huggingface.co/ArliAI/QwQ-32B-ArliAI-RpR-v4) and [huihui-ai/DeepSeek-R1-0528-Qwen3-8B-abliterated](https://huggingface.co/huihui-ai/DeepSeek-R1-0528-Qwen3-8B-abliterated).
+I have distilled and quantized the model through GPTQ 4-bit model (W4A16), meaning it can run on most GPUs.
+Established by Staticaliza.
+# vLLM: Use Instruction
 ```python
 from huggingface_hub import snapshot_download
 from vllm import LLM, SamplingParams
 # Consider toggling "enforce_eager" to False if you want to load the model quicker, at the expense of tokens per second.
+repo = snapshot_download(repo_id="Staticaliza/Reya-Reasoning-8B-Distilled-GPTQ-Int4", allow_patterns=["*.json", "*.bin", "*.safetensors"])
+llm = LLM(model=repo, dtype="auto", tensor_parallel_size=torch.cuda.device_count(), enforce_eager=True, trust_remote_code=True)
 params = SamplingParams(
     max_tokens=256,
     seed=42,
 )
+result = llm.generate(input, params)[0].outputs[0].text
 print(result)
 ```