summerMC/summerV2

summerMC/summerV2 is an experimental causal language model based on a custom VanFastForCausalLM architecture.

This model was developed by a first-year vocational school student in Japan, age 18, as an independent research and engineering project.

The project focuses on building and testing a custom fast causal language model with:

  • custom Hugging Face-compatible model code
  • KV-cache enabled autoregressive inference
  • streaming decode support
  • anti-repetition sampling utilities
  • NaN/Inf guarded logits handling
  • local modeling_van_fast.py loading support

The model is primarily intended for research and experimentation, not production deployment.


Model Details

Item Value
Model name summerMC/summerV2
Architecture VanFastForCausalLM
Task Causal language modeling
Framework PyTorch / Hugging Face Transformers
Inference style Autoregressive text generation
Cache support KV-cache enabled
Primary language English
Developer First-year vocational school student, age 18
Status Experimental

Developer Note

This model was developed by an 18-year-old first-year vocational school student as part of an independent AI research project.

The goal is to explore practical custom language-model architecture design, Hugging Face compatibility, fast inference, and KV-cache decoding. The project is experimental, but it is designed to be reproducible and inspectable for other researchers, students, and engineers.


Intended Use

This model is intended for:

  • language-model architecture research
  • custom Transformer inference experiments
  • KV-cache decoding tests
  • sampling strategy experiments
  • small-to-mid scale causal LM prototyping
  • comparison against GPT-style baselines
  • student-led AI research demonstrations

This model is not intended for:

  • safety-critical use
  • medical, legal, or financial advice
  • autonomous decision-making
  • deployment without additional evaluation
  • factual answering without retrieval or verification

Installation

pip install -U torch transformers accelerate safetensors

For GPU inference, install a CUDA-compatible PyTorch build.


Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "summerMC/summerV2"

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float32

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=dtype,
)

model.to(device)
model.eval()

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

prompt = "Explain Transformer models in simple terms.\n\nAnswer:"

inputs = tokenizer(
    prompt,
    return_tensors="pt",
    add_special_tokens=False,
).to(device)

with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=120,
        do_sample=True,
        temperature=0.85,
        top_k=80,
        top_p=0.92,
        repetition_penalty=1.25,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

text = tokenizer.decode(
    outputs[0],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)

print(text)

Direct Local Import Inference

If remote-code loading causes cache or import issues, the model can be loaded by directly importing modeling_van_fast.py.

import os
import sys
import json
import importlib.util
import torch
from transformers import AutoTokenizer

HF_OUT_DIR = "/content/van_fast_transformer/hf_compatible"
MODELING_PATH = os.path.join(HF_OUT_DIR, "modeling_van_fast.py")
CONFIG_PATH = os.path.join(HF_OUT_DIR, "config.json")

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.float32

module_name = "modeling_van_fast_runtime"

if module_name in sys.modules:
    del sys.modules[module_name]

spec = importlib.util.spec_from_file_location(module_name, MODELING_PATH)
mod = importlib.util.module_from_spec(spec)
sys.modules[module_name] = mod
spec.loader.exec_module(mod)

VanFastConfig = mod.VanFastConfig
VanFastForCausalLM = mod.VanFastForCausalLM

with open(CONFIG_PATH, "r", encoding="utf-8") as f:
    cfg_json = json.load(f)

cfg_json["use_cache"] = True
cfg_json["tie_word_embeddings"] = False

config = VanFastConfig(**cfg_json)
config.use_cache = True

tokenizer = AutoTokenizer.from_pretrained(
    HF_OUT_DIR,
    use_fast=True,
    trust_remote_code=True,
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = VanFastForCausalLM.from_pretrained(
    HF_OUT_DIR,
    config=config,
    torch_dtype=DTYPE,
)

model.to(DEVICE)
model.eval()

KV-cache Test

import torch

@torch.inference_mode()
def test_kv_cache(prompt="Hello world"):
    input_ids = tokenizer(
        prompt,
        return_tensors="pt",
        add_special_tokens=False,
    ).input_ids.to(model.device)

    out = model(
        input_ids=input_ids,
        use_cache=True,
        return_dict=True,
    )

    print("input shape:", tuple(input_ids.shape))
    print("logits:", tuple(out.logits.shape))
    print("past_key_values is None:", out.past_key_values is None)

    if out.past_key_values is None:
        raise RuntimeError("KV cache is inactive.")

    print("layers:", len(out.past_key_values))

    k0, v0 = out.past_key_values[0]
    print("layer0 k:", tuple(k0.shape))
    print("layer0 v:", tuple(v0.shape))

    next_id = torch.argmax(out.logits[:, -1, :], dim=-1, keepdim=True)

    out2 = model(
        input_ids=next_id,
        past_key_values=out.past_key_values,
        use_cache=True,
        return_dict=True,
    )

    k1, v1 = out2.past_key_values[0]
    print("after decode layer0 k:", tuple(k1.shape))
    print("after decode layer0 v:", tuple(v1.shape))
    print("KV cache OK")

test_kv_cache()

Recommended Sampling Settings

The following settings were used during local KV-cache inference testing:

max_new_tokens = 160
temperature = 0.85
top_k = 80
top_p = 0.92
repetition_penalty = 1.35
no_repeat_ngram_size = 3

For more stable output, try:

temperature = 0.7
top_k = 50
top_p = 0.9
repetition_penalty = 1.4

For more diverse output, try:

temperature = 1.0
top_k = 100
top_p = 0.95
repetition_penalty = 1.2

Example Prompt

Explain Transformer models in simple terms.

Answer:

Current Limitations

This is an experimental model. Output quality may include:

  • repetition
  • grammatical instability
  • factual hallucination
  • incomplete reasoning
  • degraded long-form coherence
  • unstable behavior with very high temperature
  • weak instruction following compared with instruction-tuned models

The model should be evaluated carefully before any downstream use.


Safety Notice

This model may generate incorrect, biased, unsafe, or misleading content.

Do not use it as the sole source of truth for high-stakes decisions.

Recommended mitigations:

  • use retrieval for factual tasks
  • apply output filtering
  • evaluate on task-specific benchmarks
  • use human review for sensitive outputs
  • avoid deployment without safety tuning

Research Notes

summerV2 is part of an experimental model-development line focused on fast training and inference for custom causal language models.

The current implementation emphasizes:

  • Hugging Face compatibility
  • direct model-code import fallback
  • KV-cache streaming decode
  • custom sampling controls
  • inference stability checks

Future work may include:

  • better pretraining data mixture
  • instruction tuning
  • DPO or preference optimization
  • stronger tokenizer/model alignment
  • long-context stability improvements
  • benchmark reporting
  • model card expansion with training details

Citation

If you use this model in experiments, cite the repository:

@misc{summerV2,
  title        = {summerMC/summerV2},
  author       = {summerMC},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/summerMC/summerV2}}
}

Disclaimer

This repository contains an experimental research model.

No warranty is provided regarding factuality, safety, performance, or fitness for a particular use case.

Downloads last month
740
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support