summerMC/summerV2

summerMC/summerV2 is an experimental causal language model based on a custom VanFastForCausalLM architecture.

This model was developed by a first-year vocational school student in Japan, age 18, as an independent research and engineering project.

The project focuses on building and testing a custom fast causal language model with:

custom Hugging Face-compatible model code
KV-cache enabled autoregressive inference
streaming decode support
anti-repetition sampling utilities
NaN/Inf guarded logits handling
local modeling_van_fast.py loading support

The model is primarily intended for research and experimentation, not production deployment.

Model Details

Item	Value
Model name	`summerMC/summerV2`
Architecture	`VanFastForCausalLM`
Task	Causal language modeling
Framework	PyTorch / Hugging Face Transformers
Inference style	Autoregressive text generation
Cache support	KV-cache enabled
Primary language	English
Developer	First-year vocational school student, age 18
Status	Experimental

Developer Note

This model was developed by an 18-year-old first-year vocational school student as part of an independent AI research project.

The goal is to explore practical custom language-model architecture design, Hugging Face compatibility, fast inference, and KV-cache decoding. The project is experimental, but it is designed to be reproducible and inspectable for other researchers, students, and engineers.

Intended Use

This model is intended for:

language-model architecture research
custom Transformer inference experiments
KV-cache decoding tests
sampling strategy experiments
small-to-mid scale causal LM prototyping
comparison against GPT-style baselines
student-led AI research demonstrations

This model is not intended for:

safety-critical use
medical, legal, or financial advice
autonomous decision-making
deployment without additional evaluation
factual answering without retrieval or verification

Installation

pip install -U torch transformers accelerate safetensors

For GPU inference, install a CUDA-compatible PyTorch build.

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "summerMC/summerV2"

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float32

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=dtype,
)

model.to(device)
model.eval()

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

prompt = "Explain Transformer models in simple terms.\n\nAnswer:"

inputs = tokenizer(
    prompt,
    return_tensors="pt",
    add_special_tokens=False,
).to(device)

with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=120,
        do_sample=True,
        temperature=0.85,
        top_k=80,
        top_p=0.92,
        repetition_penalty=1.25,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

text = tokenizer.decode(
    outputs[0],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)

print(text)

Direct Local Import Inference

If remote-code loading causes cache or import issues, the model can be loaded by directly importing modeling_van_fast.py.

import os
import sys
import json
import importlib.util
import torch
from transformers import AutoTokenizer

HF_OUT_DIR = "/content/van_fast_transformer/hf_compatible"
MODELING_PATH = os.path.join(HF_OUT_DIR, "modeling_van_fast.py")
CONFIG_PATH = os.path.join(HF_OUT_DIR, "config.json")

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.float32

module_name = "modeling_van_fast_runtime"

if module_name in sys.modules:
    del sys.modules[module_name]

spec = importlib.util.spec_from_file_location(module_name, MODELING_PATH)
mod = importlib.util.module_from_spec(spec)
sys.modules[module_name] = mod
spec.loader.exec_module(mod)

VanFastConfig = mod.VanFastConfig
VanFastForCausalLM = mod.VanFastForCausalLM

with open(CONFIG_PATH, "r", encoding="utf-8") as f:
    cfg_json = json.load(f)

cfg_json["use_cache"] = True
cfg_json["tie_word_embeddings"] = False

config = VanFastConfig(**cfg_json)
config.use_cache = True

tokenizer = AutoTokenizer.from_pretrained(
    HF_OUT_DIR,
    use_fast=True,
    trust_remote_code=True,
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = VanFastForCausalLM.from_pretrained(
    HF_OUT_DIR,
    config=config,
    torch_dtype=DTYPE,
)

model.to(DEVICE)
model.eval()

KV-cache Test

import torch

@torch.inference_mode()
def test_kv_cache(prompt="Hello world"):
    input_ids = tokenizer(
        prompt,
        return_tensors="pt",
        add_special_tokens=False,
    ).input_ids.to(model.device)

    out = model(
        input_ids=input_ids,
        use_cache=True,
        return_dict=True,
    )

    print("input shape:", tuple(input_ids.shape))
    print("logits:", tuple(out.logits.shape))
    print("past_key_values is None:", out.past_key_values is None)

    if out.past_key_values is None:
        raise RuntimeError("KV cache is inactive.")

    print("layers:", len(out.past_key_values))

    k0, v0 = out.past_key_values[0]
    print("layer0 k:", tuple(k0.shape))
    print("layer0 v:", tuple(v0.shape))

    next_id = torch.argmax(out.logits[:, -1, :], dim=-1, keepdim=True)

    out2 = model(
        input_ids=next_id,
        past_key_values=out.past_key_values,
        use_cache=True,
        return_dict=True,
    )

    k1, v1 = out2.past_key_values[0]
    print("after decode layer0 k:", tuple(k1.shape))
    print("after decode layer0 v:", tuple(v1.shape))
    print("KV cache OK")

test_kv_cache()

Recommended Sampling Settings

The following settings were used during local KV-cache inference testing:

max_new_tokens = 160
temperature = 0.85
top_k = 80
top_p = 0.92
repetition_penalty = 1.35
no_repeat_ngram_size = 3

For more stable output, try:

temperature = 0.7
top_k = 50
top_p = 0.9
repetition_penalty = 1.4

For more diverse output, try:

temperature = 1.0
top_k = 100
top_p = 0.95
repetition_penalty = 1.2

Example Prompt

Explain Transformer models in simple terms.

Answer:

Current Limitations

This is an experimental model. Output quality may include:

repetition
grammatical instability
factual hallucination
incomplete reasoning
degraded long-form coherence
unstable behavior with very high temperature
weak instruction following compared with instruction-tuned models

The model should be evaluated carefully before any downstream use.

Safety Notice

This model may generate incorrect, biased, unsafe, or misleading content.

Do not use it as the sole source of truth for high-stakes decisions.

Recommended mitigations:

use retrieval for factual tasks
apply output filtering
evaluate on task-specific benchmarks
use human review for sensitive outputs
avoid deployment without safety tuning

Research Notes

summerV2 is part of an experimental model-development line focused on fast training and inference for custom causal language models.

The current implementation emphasizes:

Hugging Face compatibility
direct model-code import fallback
KV-cache streaming decode
custom sampling controls
inference stability checks

Future work may include:

better pretraining data mixture
instruction tuning
DPO or preference optimization
stronger tokenizer/model alignment
long-context stability improvements
benchmark reporting
model card expansion with training details

Citation

If you use this model in experiments, cite the repository:

@misc{summerV2,
  title        = {summerMC/summerV2},
  author       = {summerMC},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/summerMC/summerV2}}
}

Disclaimer

This repository contains an experimental research model.

No warranty is provided regarding factuality, safety, performance, or fitness for a particular use case.

Downloads last month: 740

Safetensors

Model size

0.4B params

Tensor type

F32