summerMC/summerV2
summerMC/summerV2 is an experimental causal language model based on a custom VanFastForCausalLM architecture.
This model was developed by a first-year vocational school student in Japan, age 18, as an independent research and engineering project.
The project focuses on building and testing a custom fast causal language model with:
- custom Hugging Face-compatible model code
- KV-cache enabled autoregressive inference
- streaming decode support
- anti-repetition sampling utilities
- NaN/Inf guarded logits handling
- local
modeling_van_fast.pyloading support
The model is primarily intended for research and experimentation, not production deployment.
Model Details
| Item | Value |
|---|---|
| Model name | summerMC/summerV2 |
| Architecture | VanFastForCausalLM |
| Task | Causal language modeling |
| Framework | PyTorch / Hugging Face Transformers |
| Inference style | Autoregressive text generation |
| Cache support | KV-cache enabled |
| Primary language | English |
| Developer | First-year vocational school student, age 18 |
| Status | Experimental |
Developer Note
This model was developed by an 18-year-old first-year vocational school student as part of an independent AI research project.
The goal is to explore practical custom language-model architecture design, Hugging Face compatibility, fast inference, and KV-cache decoding. The project is experimental, but it is designed to be reproducible and inspectable for other researchers, students, and engineers.
Intended Use
This model is intended for:
- language-model architecture research
- custom Transformer inference experiments
- KV-cache decoding tests
- sampling strategy experiments
- small-to-mid scale causal LM prototyping
- comparison against GPT-style baselines
- student-led AI research demonstrations
This model is not intended for:
- safety-critical use
- medical, legal, or financial advice
- autonomous decision-making
- deployment without additional evaluation
- factual answering without retrieval or verification
Installation
pip install -U torch transformers accelerate safetensors
For GPU inference, install a CUDA-compatible PyTorch build.
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "summerMC/summerV2"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float32
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=dtype,
)
model.to(device)
model.eval()
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
prompt = "Explain Transformer models in simple terms.\n\nAnswer:"
inputs = tokenizer(
prompt,
return_tensors="pt",
add_special_tokens=False,
).to(device)
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=120,
do_sample=True,
temperature=0.85,
top_k=80,
top_p=0.92,
repetition_penalty=1.25,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
text = tokenizer.decode(
outputs[0],
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(text)
Direct Local Import Inference
If remote-code loading causes cache or import issues, the model can be loaded by directly importing modeling_van_fast.py.
import os
import sys
import json
import importlib.util
import torch
from transformers import AutoTokenizer
HF_OUT_DIR = "/content/van_fast_transformer/hf_compatible"
MODELING_PATH = os.path.join(HF_OUT_DIR, "modeling_van_fast.py")
CONFIG_PATH = os.path.join(HF_OUT_DIR, "config.json")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.float32
module_name = "modeling_van_fast_runtime"
if module_name in sys.modules:
del sys.modules[module_name]
spec = importlib.util.spec_from_file_location(module_name, MODELING_PATH)
mod = importlib.util.module_from_spec(spec)
sys.modules[module_name] = mod
spec.loader.exec_module(mod)
VanFastConfig = mod.VanFastConfig
VanFastForCausalLM = mod.VanFastForCausalLM
with open(CONFIG_PATH, "r", encoding="utf-8") as f:
cfg_json = json.load(f)
cfg_json["use_cache"] = True
cfg_json["tie_word_embeddings"] = False
config = VanFastConfig(**cfg_json)
config.use_cache = True
tokenizer = AutoTokenizer.from_pretrained(
HF_OUT_DIR,
use_fast=True,
trust_remote_code=True,
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = VanFastForCausalLM.from_pretrained(
HF_OUT_DIR,
config=config,
torch_dtype=DTYPE,
)
model.to(DEVICE)
model.eval()
KV-cache Test
import torch
@torch.inference_mode()
def test_kv_cache(prompt="Hello world"):
input_ids = tokenizer(
prompt,
return_tensors="pt",
add_special_tokens=False,
).input_ids.to(model.device)
out = model(
input_ids=input_ids,
use_cache=True,
return_dict=True,
)
print("input shape:", tuple(input_ids.shape))
print("logits:", tuple(out.logits.shape))
print("past_key_values is None:", out.past_key_values is None)
if out.past_key_values is None:
raise RuntimeError("KV cache is inactive.")
print("layers:", len(out.past_key_values))
k0, v0 = out.past_key_values[0]
print("layer0 k:", tuple(k0.shape))
print("layer0 v:", tuple(v0.shape))
next_id = torch.argmax(out.logits[:, -1, :], dim=-1, keepdim=True)
out2 = model(
input_ids=next_id,
past_key_values=out.past_key_values,
use_cache=True,
return_dict=True,
)
k1, v1 = out2.past_key_values[0]
print("after decode layer0 k:", tuple(k1.shape))
print("after decode layer0 v:", tuple(v1.shape))
print("KV cache OK")
test_kv_cache()
Recommended Sampling Settings
The following settings were used during local KV-cache inference testing:
max_new_tokens = 160
temperature = 0.85
top_k = 80
top_p = 0.92
repetition_penalty = 1.35
no_repeat_ngram_size = 3
For more stable output, try:
temperature = 0.7
top_k = 50
top_p = 0.9
repetition_penalty = 1.4
For more diverse output, try:
temperature = 1.0
top_k = 100
top_p = 0.95
repetition_penalty = 1.2
Example Prompt
Explain Transformer models in simple terms.
Answer:
Current Limitations
This is an experimental model. Output quality may include:
- repetition
- grammatical instability
- factual hallucination
- incomplete reasoning
- degraded long-form coherence
- unstable behavior with very high temperature
- weak instruction following compared with instruction-tuned models
The model should be evaluated carefully before any downstream use.
Safety Notice
This model may generate incorrect, biased, unsafe, or misleading content.
Do not use it as the sole source of truth for high-stakes decisions.
Recommended mitigations:
- use retrieval for factual tasks
- apply output filtering
- evaluate on task-specific benchmarks
- use human review for sensitive outputs
- avoid deployment without safety tuning
Research Notes
summerV2 is part of an experimental model-development line focused on fast training and inference for custom causal language models.
The current implementation emphasizes:
- Hugging Face compatibility
- direct model-code import fallback
- KV-cache streaming decode
- custom sampling controls
- inference stability checks
Future work may include:
- better pretraining data mixture
- instruction tuning
- DPO or preference optimization
- stronger tokenizer/model alignment
- long-context stability improvements
- benchmark reporting
- model card expansion with training details
Citation
If you use this model in experiments, cite the repository:
@misc{summerV2,
title = {summerMC/summerV2},
author = {summerMC},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/summerMC/summerV2}}
}
Disclaimer
This repository contains an experimental research model.
No warranty is provided regarding factuality, safety, performance, or fitness for a particular use case.
- Downloads last month
- 740