summerV2 / README.md
summerMC's picture
Update README.md
9c296b2 verified
---
license: apache-2.0
language:
- en
tags:
- causal-lm
- text-generation
- transformer
- custom-code
- kv-cache
- pytorch
pipeline_tag: text-generation
library_name: transformers
---
# summerMC/summerV2
`summerMC/summerV2` is an experimental causal language model based on a custom `VanFastForCausalLM` architecture.
This model was developed by a first-year vocational school student in Japan, age 18, as an independent research and engineering project.
The project focuses on building and testing a custom fast causal language model with:
- custom Hugging Face-compatible model code
- KV-cache enabled autoregressive inference
- streaming decode support
- anti-repetition sampling utilities
- NaN/Inf guarded logits handling
- local `modeling_van_fast.py` loading support
The model is primarily intended for research and experimentation, not production deployment.
---
## Model Details
| Item | Value |
|---|---|
| Model name | `summerMC/summerV2` |
| Architecture | `VanFastForCausalLM` |
| Task | Causal language modeling |
| Framework | PyTorch / Hugging Face Transformers |
| Inference style | Autoregressive text generation |
| Cache support | KV-cache enabled |
| Primary language | English |
| Developer | First-year vocational school student, age 18 |
| Status | Experimental |
---
## Developer Note
This model was developed by an 18-year-old first-year vocational school student as part of an independent AI research project.
The goal is to explore practical custom language-model architecture design, Hugging Face compatibility, fast inference, and KV-cache decoding. The project is experimental, but it is designed to be reproducible and inspectable for other researchers, students, and engineers.
---
## Intended Use
This model is intended for:
- language-model architecture research
- custom Transformer inference experiments
- KV-cache decoding tests
- sampling strategy experiments
- small-to-mid scale causal LM prototyping
- comparison against GPT-style baselines
- student-led AI research demonstrations
This model is not intended for:
- safety-critical use
- medical, legal, or financial advice
- autonomous decision-making
- deployment without additional evaluation
- factual answering without retrieval or verification
---
## Installation
```bash
pip install -U torch transformers accelerate safetensors
```
For GPU inference, install a CUDA-compatible PyTorch build.
---
## Basic Usage
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "summerMC/summerV2"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float32
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=dtype,
)
model.to(device)
model.eval()
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
prompt = "Explain Transformer models in simple terms.\n\nAnswer:"
inputs = tokenizer(
prompt,
return_tensors="pt",
add_special_tokens=False,
).to(device)
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=120,
do_sample=True,
temperature=0.85,
top_k=80,
top_p=0.92,
repetition_penalty=1.25,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
text = tokenizer.decode(
outputs[0],
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(text)
```
---
## Direct Local Import Inference
If remote-code loading causes cache or import issues, the model can be loaded by directly importing `modeling_van_fast.py`.
```python
import os
import sys
import json
import importlib.util
import torch
from transformers import AutoTokenizer
HF_OUT_DIR = "/content/van_fast_transformer/hf_compatible"
MODELING_PATH = os.path.join(HF_OUT_DIR, "modeling_van_fast.py")
CONFIG_PATH = os.path.join(HF_OUT_DIR, "config.json")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.float32
module_name = "modeling_van_fast_runtime"
if module_name in sys.modules:
del sys.modules[module_name]
spec = importlib.util.spec_from_file_location(module_name, MODELING_PATH)
mod = importlib.util.module_from_spec(spec)
sys.modules[module_name] = mod
spec.loader.exec_module(mod)
VanFastConfig = mod.VanFastConfig
VanFastForCausalLM = mod.VanFastForCausalLM
with open(CONFIG_PATH, "r", encoding="utf-8") as f:
cfg_json = json.load(f)
cfg_json["use_cache"] = True
cfg_json["tie_word_embeddings"] = False
config = VanFastConfig(**cfg_json)
config.use_cache = True
tokenizer = AutoTokenizer.from_pretrained(
HF_OUT_DIR,
use_fast=True,
trust_remote_code=True,
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = VanFastForCausalLM.from_pretrained(
HF_OUT_DIR,
config=config,
torch_dtype=DTYPE,
)
model.to(DEVICE)
model.eval()
```
---
## KV-cache Test
```python
import torch
@torch.inference_mode()
def test_kv_cache(prompt="Hello world"):
input_ids = tokenizer(
prompt,
return_tensors="pt",
add_special_tokens=False,
).input_ids.to(model.device)
out = model(
input_ids=input_ids,
use_cache=True,
return_dict=True,
)
print("input shape:", tuple(input_ids.shape))
print("logits:", tuple(out.logits.shape))
print("past_key_values is None:", out.past_key_values is None)
if out.past_key_values is None:
raise RuntimeError("KV cache is inactive.")
print("layers:", len(out.past_key_values))
k0, v0 = out.past_key_values[0]
print("layer0 k:", tuple(k0.shape))
print("layer0 v:", tuple(v0.shape))
next_id = torch.argmax(out.logits[:, -1, :], dim=-1, keepdim=True)
out2 = model(
input_ids=next_id,
past_key_values=out.past_key_values,
use_cache=True,
return_dict=True,
)
k1, v1 = out2.past_key_values[0]
print("after decode layer0 k:", tuple(k1.shape))
print("after decode layer0 v:", tuple(v1.shape))
print("KV cache OK")
test_kv_cache()
```
---
## Recommended Sampling Settings
The following settings were used during local KV-cache inference testing:
```python
max_new_tokens = 160
temperature = 0.85
top_k = 80
top_p = 0.92
repetition_penalty = 1.35
no_repeat_ngram_size = 3
```
For more stable output, try:
```python
temperature = 0.7
top_k = 50
top_p = 0.9
repetition_penalty = 1.4
```
For more diverse output, try:
```python
temperature = 1.0
top_k = 100
top_p = 0.95
repetition_penalty = 1.2
```
---
## Example Prompt
```text
Explain Transformer models in simple terms.
Answer:
```
---
## Current Limitations
This is an experimental model. Output quality may include:
- repetition
- grammatical instability
- factual hallucination
- incomplete reasoning
- degraded long-form coherence
- unstable behavior with very high temperature
- weak instruction following compared with instruction-tuned models
The model should be evaluated carefully before any downstream use.
---
## Safety Notice
This model may generate incorrect, biased, unsafe, or misleading content.
Do not use it as the sole source of truth for high-stakes decisions.
Recommended mitigations:
- use retrieval for factual tasks
- apply output filtering
- evaluate on task-specific benchmarks
- use human review for sensitive outputs
- avoid deployment without safety tuning
---
## Research Notes
`summerV2` is part of an experimental model-development line focused on fast training and inference for custom causal language models.
The current implementation emphasizes:
- Hugging Face compatibility
- direct model-code import fallback
- KV-cache streaming decode
- custom sampling controls
- inference stability checks
Future work may include:
- better pretraining data mixture
- instruction tuning
- DPO or preference optimization
- stronger tokenizer/model alignment
- long-context stability improvements
- benchmark reporting
- model card expansion with training details
---
## Citation
If you use this model in experiments, cite the repository:
```bibtex
@misc{summerV2,
title = {summerMC/summerV2},
author = {summerMC},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/summerMC/summerV2}}
}
```
---
## Disclaimer
This repository contains an experimental research model.
No warranty is provided regarding factuality, safety, performance, or fitness for a particular use case.