Instructions to use RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300", trust_remote_code=True, device_map="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300

SGLang

How to use RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300 with Docker Model Runner:
```
docker model run hf.co/RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300
```

HENLA-CONFED-3B · PREFIX-V4 · Step 300

Balanced short-prompt demo branch — best grammar/article probes in the HENLA-CONFED family

⚠️ Research artifact. This checkpoint is part of an experimental constrained-compute research project. It is not a production model, not an AGI claim, and not a leaderboard entry. Read the limitations section before using it.

Model summary

HENLA-CONFED-3B-PREFIX-V4-STEP300 is a prefix-specialized branch of the HENLA-CONFED-3B causal language model family. It was fine-tuned from the general CLEANLM-STEP70000 base using prefix-only loss over a small set of grammar/article and HENLA-identity continuations, for 300 steps.

Within the HENLA-CONFED family it is the best balanced demonstrator: strongest on targeted grammar/article probes, competitive HENLA-identity completions, and reasonable short general continuations — at the cost of some reduction in open-ended diversity compared to the base.

Checkpoint	Best use	Limitation
PREFIX-V4-STEP300 ← this one	Balanced controlled demo; grammar/article probes; short HENLA completions	More specialized than CLEANLM; prefix conditioning can reduce open-ended diversity
CLEANLM-STEP70000	General base for further training and continuation	Weak HENLA identity; grammar/article instability
HENLA-PREFIX-150	Focused HENLA-description prompts	Strongly biased toward HENLA vocabulary on unrelated prompts

Architecture

HENLA-CONFED is a non-standard causal LM. Each Transformer block replaces the single feed-forward module with K parallel cognitive-area MLPs selected by a learned softmax gate, then fused residually before the next block.

token + position embeddings
→ N × HenlaConfederatedBlock
     → causal self-attention
     → K parallel cognitive-area MLPs   ← the confederated routing
     → learned gate (softmax over K areas)
     → weighted area fusion
     → residual stream
→ tied LM head

Configuration

Parameter	Value
Approximate parameters	~2.844 B
Layers	24
Hidden size	1 280
Attention heads	16
Cognitive areas per block	8
Context length	512
Vocabulary	GPT-2 style, 50 257 tokens
Pad / EOS token	50 256
Framework	PyTorch / Hugging Face Transformers (remote code)

Training lineage

Bootstrap (local corpus, smoke test)
  └─► FineWeb-Edu streaming (steps ~30k → 57k)
        └─► LR3E5 branch (plateaued ~57k, degenerate samples)
              └─► GATEFIX 65k (auxiliary gate-entropy + area-usage loss)
                    └─► CLEANLM 70k ← general base  ●
                          └─► PREFIX-V4 300 steps ← this checkpoint  ●

PREFIX-V4 fine-tuning details:

Setting	Value
Base	CLEANLM-STEP70000
Loss	Prefix-only (loss computed only on the forced continuation, not the prompt)
Steps	300 (checkpoint published at step 300)
Learning rate	2e-5
Batch size	2
Sequence length	128

Progression observed during training: at step 50 behavior was partial; at step 200 HENLA identity was strong; at step 250–300 the best grammar/identity balance was reached.

Expected behaviors (short-prompt probes)

These are illustrative examples from internal diagnostics. They are not guaranteed outputs and can vary with temperature, sampling parameters, and context.

Prompt	Expected completion style
`HENLA is`	experimental neuro-symbolic cognitive architecture …
`HENLA is not`	not conscious / not human-level …
`Artificial intelligence is`	an important tool …
`The researchers used`	a device …
`The solar energy system is`	an important source …

Internal benchmark results

HENLA family benchmark

Categories: HENLA identity, grammar/article probes, short general continuation, repetition, bad-pattern checks, top-token sanity.

Verdict within HENLA family:

Best overall: PREFIX-V4-STEP300
Best HENLA identity: HENLA-PREFIX-150
Best general base: CLEANLM-70K

Small-LM heuristic comparison

Non-HENLA short prompts, heuristic deterministic scoring. Not a standardized leaderboard.

Model	Overall (mean)	General (mean)	Grammar/article (mean)
Phi-3.5-mini-instruct	2.25	3.00	1.00
HENLA PREFIX-V4 ← this	2.13	2.00	2.33
Qwen2.5-3B	1.81	2.30	1.00
HENLA CLEANLM 70K	1.50	2.10	0.50

Correct interpretation: PREFIX-V4 ranked second overall and first on grammar/article probes under this heuristic. Phi-3.5-mini is stronger on general prompts. This comparison uses a small, targeted prompt set and should not be read as a broad claim of superiority over mature small LMs.

Inference economics

HENLA-CONFED-3B-PREFIX-V4 is a constrained-compute experimental confederated neuro-symbolic language model produced with approximately €50 total compute cost.

On a rented NVIDIA A40 48 GB using Hugging Face Transformers bf16 greedy decoding, HENLA reaches 142.4 decode tokens/s in a short-prompt batch-4 scenario, with ~5.34 GB peak VRAM and an estimated €0.78 per 1M output tokens at €0.40/GPU-hour.

In a heavier telemetry run using a longer technical prompt, batch size 4, and 128 forced output tokens, HENLA reaches 92.3 decode tokens/s with ~5.37 GB allocated VRAM, 94% average GPU utilization, ~287 W average power draw, and **€1.20 per 1M output tokens**.

Scenario	Tokens/s	Peak VRAM	Cost / 1M tokens
Short prompt, batch 4, greedy	142.4	~5.34 GB	~€0.78
Long technical prompt, batch 4, 128 tokens	92.3	~5.37 GB	~€1.20

Measured on NVIDIA A40 48 GB, CUDA 12.8, bf16, HF Transformers 4.44.2, greedy decoding. Cost estimate assumes €0.40/GPU-hour.

Local diagnostics show that HENLA is weaker than mature external baselines on general and technical text quality, but obtains the lowest local perplexity on a small HENLA-domain corpus. The appropriate positioning is low-cost experimental architecture, low-memory inference, and HENLA-domain specialization — not general benchmark leadership.

How to load

Custom architecture requires trust_remote_code=True.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo = "RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300"

tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

Basic generation

prompt = "HENLA is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
    )

print(tokenizer.decode(out[0], skip_special_tokens=True))

Stable software environment

Component	Version
PyTorch	2.4.1
Transformers	4.44.2
Tokenizers	0.19.1
Accelerate	0.33.0

Note: Updating Transformers beyond 4.44.2 can trigger compatibility issues with this checkpoint family. Pin the versions above for reproducible loading.

Serialization note: The tied embedding / LM-head weight requires safe_serialization=False at save time. Remote-code loading handles this transparently.

Inference economics

Measured on a rented NVIDIA A40 48 GB, HF Transformers, bf16, greedy decoding.

Short-prompt scenario (batch 4)

Metric	Value
Decode throughput	142.4 tokens / s
Peak VRAM	~5.34 GB
Estimated cost	~€0.78 / 1M output tokens (at €0.40 / GPU-hour)

Heavy telemetry scenario (longer technical prompt, batch 4, 128 generated tokens)

Metric	Value
Decode throughput	92.3 tokens / s
Peak VRAM allocated	~5.37 GB
Average GPU utilization	~94 %
Average power draw	~287 W
Estimated cost	~€1.20 / 1M output tokens (at €0.40 / GPU-hour)

Notes

~5.3 GB peak VRAM is close to the theoretical minimum for a 2.84B bf16 model, enabled by the tied embedding / LM-head architecture.
The model fits comfortably on consumer GPUs with ≥8 GB VRAM (RTX 3070/4060 Ti class and above).
HENLA is not competitive with mature external baselines on general or technical text quality. It achieves the lowest perplexity on a small HENLA-domain corpus. The appropriate positioning is low-cost experimental architecture, low-memory inference, and HENLA-domain specialization — not general benchmark leadership.

Limitations

Internal validation only. All benchmarks are internal diagnostics, not independent external evaluations.
Prefix specialization. The 300-step prefix run narrows the conditional distribution. Open-ended generation on prompts far from the training prefixes may drift or degrade compared to the CLEANLM base.
Linguistically weak compared to mature small LMs. HENLA-CONFED was trained at a fraction of the compute and data scale of Phi, Qwen, Llama, or Gemma.
Short context. Maximum context length is 512 tokens.
No instruction-following. This is not an instruction-tuned model. It does not follow chat templates or system prompts.
Grammar and article errors. Expect residual grammatical instability, especially on longer continuations.
No multi-seed confidence. Results are single-run diagnostics without statistical confidence intervals.
No external human evaluation.

Project context and related checkpoints

HENLA (Hypergraph Embodied Neural Learning Architecture) is a constrained-compute research program that progressed from an embodied hypergraph learner (HENLA-0) to a modular evidence-routing cognitive architecture (HENLA-MoC) and finally to this confederated-area causal LM family. The full development trajectory is documented in the companion white paper.

Total external compute for the HENLA-CONFED line: ~EUR 325 of rented GPU time (NVIDIA A40, CUDA 12.8), ~2-day development cycle.

Family

Repository	Role
RthItalia/HENLA-CONFED-3B-FINEWEB-CLEANLM-STEP70000	General base — recommended for continuation and further fine-tuning
RthItalia/HENLA-CONFED-3B-PREFIX-V4-STEP300 ← you are here	Balanced demo branch — grammar/article probes, controlled HENLA identity
RthItalia/HENLA-CONFED-3B-HENLA-PREFIX-150	HENLA-identity branch — use only for HENLA-description prompts

License

HENLA Research and Education Non-Commercial License

license: other
license_name: henla-research-education-non-commercial

Permitted: academic research · independent research · educational use · student projects · evaluation and benchmarking · non-commercial experimentation.

Not permitted without prior written permission: commercial use · paid products or services · hosted commercial inference · resale · integration into commercial applications · training / distillation / fine-tuning for commercial deployment.

Citation / acknowledgment

If you use this model in research, please cite the companion white paper or link to this repository and the HENLA project logs.

Road / RthItalia (2026). HENLA: a constrained-compute study of hypergraph memory,
federated cognitive routing, and a 3B confederated language model. Draft white paper,
May 2026. https://huggingface.co/RthItalia

Ethics and safety

This checkpoint is a non-commercial research artifact. It is not intended for medical, legal, financial, safety-critical, or automated decision-making use. It does not represent AGI, consciousness, or human-level intelligence. Correct use is research, education, benchmarking, and non-commercial experimentation.

Downloads last month: 9