GDN1 Long-Context 1B 32K Answer Full Fine-Tune

This is a research checkpoint from the Long-GDN workspace.

Base Model

  • Base checkpoint: linear-moe-hub/Gated-Deltanet-1.3B
  • Architecture: Gated DeltaNet / linear recurrent attention
  • Base training data reported by the upstream model card: SlimPajama 100B-token sample
  • License inherited from upstream model card: Apache-2.0

Training Run

  • Local source path: runs/gdn1_longctx_1b_32k_bs10_answer_ft/final
  • Tokenizer source: runs/gdn1_longctx_1b_32k_bs10_answer_ft/final
  • Training mode: full fine-tuning, no LoRA/adapter
  • Hardware target: 8x NVIDIA H200
  • Sequence length: 32768
  • Approximate additional token budget: ~1B additional tokens
  • Manifest/config: configs/gdn1_memory_mix_long_context.json

Intended Research Use

This checkpoint is intended for research on:

  • long-context associative recall
  • RULER/MQAR-style state tracking
  • recurrent-state contamination during long generation
  • Reference-State Reset with Rolling Replay, a GDN/RNN adaptation of the R-SWA idea

Usage

These checkpoints use the FLA Gated DeltaNet implementation. In the current Long-GDN environment, plain GatedDeltaNetForCausalLM.from_pretrained() can hit a Transformers 5.x tied-weight metadata issue. The robust path is to patch the FLA tied-weight metadata before loading.

Install/runtime requirements:

pip install torch transformers safetensors huggingface_hub
# plus an FLA package/source tree that provides:
#   fla.models.gated_deltanet.GatedDeltaNetForCausalLM

CPU Example

import torch
from transformers import AutoTokenizer
from fla.models.gated_deltanet import GatedDeltaNetForCausalLM

repo_id = "LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft"

# Transformers 5.x compatibility patch for the installed FLA class.
if isinstance(getattr(GatedDeltaNetForCausalLM, "_tied_weights_keys", None), list):
    GatedDeltaNetForCausalLM._tied_weights_keys = {
        "lm_head.weight": "model.embeddings.weight"
    }

tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
model = GatedDeltaNetForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=torch.float32,
)
model.eval()

prompt = "A special magic number is 12345. What is the special magic number?"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=32,
        do_sample=False,
    )
print(tokenizer.decode(output[0], skip_special_tokens=True))

Single-GPU bf16 Example

import torch
from transformers import AutoTokenizer
from fla.models.gated_deltanet import GatedDeltaNetForCausalLM

repo_id = "LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft"

if isinstance(getattr(GatedDeltaNetForCausalLM, "_tied_weights_keys", None), list):
    GatedDeltaNetForCausalLM._tied_weights_keys = {
        "lm_head.weight": "model.embeddings.weight"
    }

tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
model = GatedDeltaNetForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=torch.bfloat16,
).to("cuda")
model.eval()

prompt = "Reference facts:\n- key_alpha: value_123\n\nQuestion: key_alpha?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=32,
        do_sample=False,
    )
print(tokenizer.decode(output[0], skip_special_tokens=True))

Long-GDN Local Loader

The project repository includes a more defensive loader at scripts/gdn1_common.py::load_gdn1_causal_lm. It handles the compatibility patch and older public-checkpoint key conversion used in local experiments.

from pathlib import Path
import torch
from transformers import AutoTokenizer
from scripts.gdn1_common import load_gdn1_causal_lm

repo_or_local_path = Path("path/to/downloaded/checkpoint")
tokenizer = AutoTokenizer.from_pretrained(repo_or_local_path, use_fast=True)
model = load_gdn1_causal_lm(repo_or_local_path, torch_dtype=torch.bfloat16).to("cuda")

Known Results

1B 32K answer-focused run. MQAR likelihood improved 4K/16K but damaged 1K and did not transfer to 32K/64K: final 1K 0.0000, 2K 0.0625, 4K 0.1562, 8K 0.0625, 16K 0.1875, 32K 0.0000, 64K 0.0000.

Caveats

Not the current best checkpoint. This run is useful for ablation and failure analysis; it should not be used to claim broad long-context improvement.

Citation Context

Relevant background papers include Gated Delta Networks, Gated DeltaNet-2, Log-Linear Attention, and Unlimited OCR / R-SWA. This checkpoint does not implement a new architecture by itself; it is part of a checkpoint-preserving full fine-tuning and inference-control study.

Downloads last month
34
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft

Finetuned
(7)
this model

Dataset used to train LLM-OS-Models/gdn1-longctx-1b-32k-answer-ft