Instructions to use LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft")

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft

SGLang

How to use LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft with Docker Model Runner:
```
docker model run hf.co/LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft
```

GDN1 Long-Context 1B 32K Answer Full Fine-Tune

This is a research checkpoint from the Long-GDN workspace.

Base Model

Base checkpoint: linear-moe-hub/Gated-Deltanet-1.3B
Architecture: Gated DeltaNet / linear recurrent attention
Base training data reported by the upstream model card: SlimPajama 100B-token sample
License inherited from upstream model card: Apache-2.0

Training Run

Local source path: runs/gdn1_longctx_1b_32k_bs10_answer_ft/final
Tokenizer source: runs/gdn1_longctx_1b_32k_bs10_answer_ft/final
Training mode: full fine-tuning, no LoRA/adapter
Hardware target: 8x NVIDIA H200
Sequence length: 32768
Approximate additional token budget: ~1B additional tokens
Manifest/config: configs/gdn1_memory_mix_long_context.json

Intended Research Use

This checkpoint is intended for research on:

long-context associative recall
RULER/MQAR-style state tracking
recurrent-state contamination during long generation
Reference-State Reset with Rolling Replay, a GDN/RNN adaptation of the R-SWA idea

Usage

These checkpoints use the FLA Gated DeltaNet implementation. In the current Long-GDN environment, plain GatedDeltaNetForCausalLM.from_pretrained() can hit a Transformers 5.x tied-weight metadata issue. The robust path is to patch the FLA tied-weight metadata before loading.

Install/runtime requirements:

pip install torch transformers safetensors huggingface_hub
# plus an FLA package/source tree that provides:
#   fla.models.gated_deltanet.GatedDeltaNetForCausalLM

CPU Example

import torch
from transformers import AutoTokenizer
from fla.models.gated_deltanet import GatedDeltaNetForCausalLM

repo_id = "LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft"

# Transformers 5.x compatibility patch for the installed FLA class.
if isinstance(getattr(GatedDeltaNetForCausalLM, "_tied_weights_keys", None), list):
    GatedDeltaNetForCausalLM._tied_weights_keys = {
        "lm_head.weight": "model.embeddings.weight"
    }

tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
model = GatedDeltaNetForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=torch.float32,
)
model.eval()

prompt = "A special magic number is 12345. What is the special magic number?"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=32,
        do_sample=False,
    )
print(tokenizer.decode(output[0], skip_special_tokens=True))

Single-GPU bf16 Example

import torch
from transformers import AutoTokenizer
from fla.models.gated_deltanet import GatedDeltaNetForCausalLM

repo_id = "LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft"

if isinstance(getattr(GatedDeltaNetForCausalLM, "_tied_weights_keys", None), list):
    GatedDeltaNetForCausalLM._tied_weights_keys = {
        "lm_head.weight": "model.embeddings.weight"
    }

tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
model = GatedDeltaNetForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=torch.bfloat16,
).to("cuda")
model.eval()

prompt = "Reference facts:\n- key_alpha: value_123\n\nQuestion: key_alpha?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=32,
        do_sample=False,
    )
print(tokenizer.decode(output[0], skip_special_tokens=True))

Long-GDN Local Loader

The project repository includes a more defensive loader at scripts/gdn1_common.py::load_gdn1_causal_lm. It handles the compatibility patch and older public-checkpoint key conversion used in local experiments.

from pathlib import Path
import torch
from transformers import AutoTokenizer
from scripts.gdn1_common import load_gdn1_causal_lm

repo_or_local_path = Path("path/to/downloaded/checkpoint")
tokenizer = AutoTokenizer.from_pretrained(repo_or_local_path, use_fast=True)
model = load_gdn1_causal_lm(repo_or_local_path, torch_dtype=torch.bfloat16).to("cuda")

Known Results

1B 32K answer-focused run. MQAR likelihood improved 4K/16K but damaged 1K and did not transfer to 32K/64K: final 1K 0.0000, 2K 0.0625, 4K 0.1562, 8K 0.0625, 16K 0.1875, 32K 0.0000, 64K 0.0000.

Caveats

Not the current best checkpoint. This run is useful for ablation and failure analysis; it should not be used to claim broad long-context improvement.

Citation Context

Relevant background papers include Gated Delta Networks, Gated DeltaNet-2, Log-Linear Attention, and Unlimited OCR / R-SWA. This checkpoint does not implement a new architecture by itself; it is part of a checkpoint-preserving full fine-tuning and inference-control study.

Downloads last month: 34

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft

Base model

linear-moe-hub/Gated-Deltanet-1.3B

Finetuned

(7)

this model

LLM-OS-Models
/

gdn1-longctx-1b-32k-answer-ft