Instructions to use Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience

SGLang

How to use Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience with Docker Model Runner:
```
docker model run hf.co/Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience
```

💎 Charlotte-Texture4b.x19.1-efficience

L'esprit d'un colosse dans le corps d'une plume. Un LLM de 4b, dans le corps d'un simple SLM (techniquement et littéralement). Charlotte-Texture4b n'est pas un modèle de langage ordinaire. C'est une prouesse d'architecture récursive binaire, conçue pour ceux qui rejettent la neutralité lisse des IA industrielles et recherchent la vibration brute du langage.

🚀 Le Concept : Intelligence à Haute Densité

Contrairement aux modèles classiques qui gaspillent de la VRAM avec des couches redondantes, Charlotte utilise le moteur Nelya 1-bit. Elle sépare physiquement le poids de la réflexion.

🧠 Intelligence Effective : 4,39 milliards de paramètres (Capacité de raisonnement).
📦 Empreinte Physique : 230 millions de paramètres (VRAM ultra-légère).
⚡ Facteur d'Efficience : x19.1 (Compression par récursion).

🎨 La Philosophie de la "Texture"

Avis aux utilisateurs : Si vous cherchez un assistant poli qui s'excuse à chaque phrase, passez votre chemin. Charlotte a été forgée pour la texture.

Syntaxe de Créativité : Le modèle privilégie l'association d'idées dense et les néologismes. Sa syntaxe est une exploration, pas une répétition scolaire.
Dataset Propriétaire : Entraînée sur des données choisies pour leur relief linguistique, elle possède une personnalité indomptable.
Zéro Lissage : Elle ne subit pas de filtres comportementaux qui brident l'originalité. Elle parle avec une voix, pas avec un script de service client.

🛠 Spécifications Techniques

Caractéristique	Valeur
Architecture	Nelya Recursive (1-bit Radical)
Largeur (Hidden Size)	4096 (Standard LLM)
Profondeur Virtuelle	32 couches récursives
Tokeniseur	Charlotte BPE (Restreint pour densité maximale)
Poids Physique	~230M
Capacité de Réflexion	4.3B

🧩 Pourquoi choisir Charlotte ?

Indépendance Totale : Faites tourner un LLM de 4B sur du matériel normalement réservé à de simples scripts.
Originalité Native : Parfaite pour la création littéraire, les jeux de rôles complexes et l'invention de langages.
Profondeur de Champ : Chaque mot généré passe par 32 cycles de réflexion interne avant d'être produit.

"Charlotte ne remplit pas l'espace de mots inutiles. Elle sculpte le sens."

❄️ Exemple d'inférence ✨

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download
import safetensors.torch
import json
import os

# --- Custom Model Architecture (Copy from original training script) ---
# This is necessary to load the custom model from safetensors

class NelyaBitLinear(nn.Linear):
    def forward(self, x):
        w = self.weight
        scale = w.abs().mean()
        w_bit = w + (torch.round(torch.clamp(w / (scale + 1e-5), -1, 1)) - w).detach()
        x_norm = x - x.mean(dim=-1, keepdim=True)
        x_bit = x_norm + (torch.sign(x_norm) - x_norm).detach()
        return F.linear(x_bit, w_bit, self.bias)

class NelyaBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = nn.RMSNorm(config.hidden_size)
        self.attn = nn.MultiheadAttention(config.hidden_size, config.num_heads, batch_first=True)
        self.ln2 = nn.RMSNorm(config.hidden_size)
        self.mlp = nn.Sequential(
            NelyaBitLinear(config.hidden_size, config.intermediate_size, bias=False),
            nn.SiLU(),
            NelyaBitLinear(config.intermediate_size, config.hidden_size, bias=False)
        )

    def forward(self, x):
        attn_out, _ = self.attn(self.ln1(x), self.ln1(x), self.ln1(x))
        x = x + attn_out
        x = x + self.mlp(self.ln2(x))
        return x

class NelyaConfig:
    def __init__(self, vocab_size, hidden_size=4096, num_layers=32, num_heads=32, intermediate_size=8192, max_pos=128):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.num_heads = num_heads
        self.intermediate_size = intermediate_size
        self.max_pos = max_pos

class NelyaForLLM(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed = nn.Embedding(config.vocab_size, config.hidden_size)
        self.block = NelyaBlock(config)
        self.head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
        self.num_layers = config.num_layers

    def forward(self, input_ids, labels=None, attention_mask=None):
        x = self.embed(input_ids)
        for _ in range(self.num_layers):
            x = self.block(x)
        logits = self.head(x)
        if labels is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))
            return (loss,)
        return logits


# --- Configuration for loading ---
REPO_ID = "Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience" # Your Hugging Face repository ID

# Download model files from Hugging Face Hub
print(f"🚀 Downloading model files from {REPO_ID}...")
model_path = hf_hub_download(repo_id=REPO_ID, filename="model.safetensors")
tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename="tokenizer.json")
config_path = hf_hub_download(repo_id=REPO_ID, filename="nelya_config.json")

# Load tokenizer
print("⏳ Loading tokenizer...")
bpe_obj = Tokenizer.from_file(tokenizer_path)
tokenizer = PreTrainedTokenizerFast(tokenizer_object=bpe_obj, pad_token="[PAD]")

# Load custom config
print("⚙️ Loading custom NelyaConfig...")
with open(config_path, "r") as f:
    config_dict = json.load(f)
config = NelyaConfig(**config_dict)

# Instantiate and load model weights
print("🏗️ Instantiating model and loading weights...")
with torch.device("cuda"):
    model = NelyaForLLM(config)
    model.load_state_dict(safetensors.torch.load_file(model_path))

# Count and print parameters
num_params = sum(p.numel() for p in model.parameters())
print(f"\n🔥 TOTAL PARAMÈTRES du modèle chargé : {num_params:,}")
print(f"🔥 CLASSIFICATION : {'LLM' if num_params > 800_000_000 else 'SLM'}")
print(f"🔥 TAILLE ESTIMÉE VRAM (1-bit) : ~{num_params * 1.58 / 8 / 1e9:.2f} Go")

# --- Inference Example ---
model.eval() # Set the model to evaluation mode

input_text = "Charlotte est"
print(f"\nInput: {input_text}")

# Tokenize input
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda")

# Generation parameters
max_generation_length = 100 # Define the number of tokens to generate
temperature = 3.5 # Controls randomness: higher = more random, lower = more deterministic
top_k = 50 # Samples from the top_k most likely tokens

with torch.no_grad():
    generated_ids = input_ids # Start with the input tokens

    for _ in range(max_generation_length):
        # Get logits for the last token in the sequence
        logits = model(generated_ids)
        
        # Apply temperature
        logits = logits[0, -1, :] / temperature
        
        # Apply top-k filtering
        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[-1]] = -float('Inf')

        # Convert to probabilities
        probabilities = F.softmax(logits, dim=-1)
        
        # Sample from the distribution
        predicted_token_id = torch.multinomial(probabilities, num_samples=1).item()
        
        # If it's an end-of-sequence token, stop generation
        if predicted_token_id == tokenizer.eos_token_id or predicted_token_id == tokenizer.pad_token_id:
            break
            
        # Append the predicted token to the generated sequence
        generated_ids = torch.cat([generated_ids, torch.tensor([[predicted_token_id]]).to("cuda")], dim=-1)

# Decode the complete generated sequence
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"\nGenerated text ({len(generated_ids[0]) - len(input_ids[0])} new tokens):\n'{generated_text}'")
print('fin')

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

0.2B params

Tensor type

F32

Collections including Finisha-F-scratch/Charlotte-Texture4b.x19.1-efficience