--- license: apache-2.0 language: - en tags: - causal-lm - text-generation - transformer - custom-code - kv-cache - pytorch pipeline_tag: text-generation library_name: transformers --- # summerMC/summerV2 `summerMC/summerV2` is an experimental causal language model based on a custom `VanFastForCausalLM` architecture. This model was developed by a first-year vocational school student in Japan, age 18, as an independent research and engineering project. The project focuses on building and testing a custom fast causal language model with: - custom Hugging Face-compatible model code - KV-cache enabled autoregressive inference - streaming decode support - anti-repetition sampling utilities - NaN/Inf guarded logits handling - local `modeling_van_fast.py` loading support The model is primarily intended for research and experimentation, not production deployment. --- ## Model Details | Item | Value | |---|---| | Model name | `summerMC/summerV2` | | Architecture | `VanFastForCausalLM` | | Task | Causal language modeling | | Framework | PyTorch / Hugging Face Transformers | | Inference style | Autoregressive text generation | | Cache support | KV-cache enabled | | Primary language | English | | Developer | First-year vocational school student, age 18 | | Status | Experimental | --- ## Developer Note This model was developed by an 18-year-old first-year vocational school student as part of an independent AI research project. The goal is to explore practical custom language-model architecture design, Hugging Face compatibility, fast inference, and KV-cache decoding. The project is experimental, but it is designed to be reproducible and inspectable for other researchers, students, and engineers. --- ## Intended Use This model is intended for: - language-model architecture research - custom Transformer inference experiments - KV-cache decoding tests - sampling strategy experiments - small-to-mid scale causal LM prototyping - comparison against GPT-style baselines - student-led AI research demonstrations This model is not intended for: - safety-critical use - medical, legal, or financial advice - autonomous decision-making - deployment without additional evaluation - factual answering without retrieval or verification --- ## Installation ```bash pip install -U torch transformers accelerate safetensors ``` For GPU inference, install a CUDA-compatible PyTorch build. --- ## Basic Usage ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "summerMC/summerV2" device = "cuda" if torch.cuda.is_available() else "cpu" dtype = torch.float32 tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, torch_dtype=dtype, ) model.to(device) model.eval() if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token prompt = "Explain Transformer models in simple terms.\n\nAnswer:" inputs = tokenizer( prompt, return_tensors="pt", add_special_tokens=False, ).to(device) with torch.inference_mode(): outputs = model.generate( **inputs, max_new_tokens=120, do_sample=True, temperature=0.85, top_k=80, top_p=0.92, repetition_penalty=1.25, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, ) text = tokenizer.decode( outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False, ) print(text) ``` --- ## Direct Local Import Inference If remote-code loading causes cache or import issues, the model can be loaded by directly importing `modeling_van_fast.py`. ```python import os import sys import json import importlib.util import torch from transformers import AutoTokenizer HF_OUT_DIR = "/content/van_fast_transformer/hf_compatible" MODELING_PATH = os.path.join(HF_OUT_DIR, "modeling_van_fast.py") CONFIG_PATH = os.path.join(HF_OUT_DIR, "config.json") DEVICE = "cuda" if torch.cuda.is_available() else "cpu" DTYPE = torch.float32 module_name = "modeling_van_fast_runtime" if module_name in sys.modules: del sys.modules[module_name] spec = importlib.util.spec_from_file_location(module_name, MODELING_PATH) mod = importlib.util.module_from_spec(spec) sys.modules[module_name] = mod spec.loader.exec_module(mod) VanFastConfig = mod.VanFastConfig VanFastForCausalLM = mod.VanFastForCausalLM with open(CONFIG_PATH, "r", encoding="utf-8") as f: cfg_json = json.load(f) cfg_json["use_cache"] = True cfg_json["tie_word_embeddings"] = False config = VanFastConfig(**cfg_json) config.use_cache = True tokenizer = AutoTokenizer.from_pretrained( HF_OUT_DIR, use_fast=True, trust_remote_code=True, ) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token model = VanFastForCausalLM.from_pretrained( HF_OUT_DIR, config=config, torch_dtype=DTYPE, ) model.to(DEVICE) model.eval() ``` --- ## KV-cache Test ```python import torch @torch.inference_mode() def test_kv_cache(prompt="Hello world"): input_ids = tokenizer( prompt, return_tensors="pt", add_special_tokens=False, ).input_ids.to(model.device) out = model( input_ids=input_ids, use_cache=True, return_dict=True, ) print("input shape:", tuple(input_ids.shape)) print("logits:", tuple(out.logits.shape)) print("past_key_values is None:", out.past_key_values is None) if out.past_key_values is None: raise RuntimeError("KV cache is inactive.") print("layers:", len(out.past_key_values)) k0, v0 = out.past_key_values[0] print("layer0 k:", tuple(k0.shape)) print("layer0 v:", tuple(v0.shape)) next_id = torch.argmax(out.logits[:, -1, :], dim=-1, keepdim=True) out2 = model( input_ids=next_id, past_key_values=out.past_key_values, use_cache=True, return_dict=True, ) k1, v1 = out2.past_key_values[0] print("after decode layer0 k:", tuple(k1.shape)) print("after decode layer0 v:", tuple(v1.shape)) print("KV cache OK") test_kv_cache() ``` --- ## Recommended Sampling Settings The following settings were used during local KV-cache inference testing: ```python max_new_tokens = 160 temperature = 0.85 top_k = 80 top_p = 0.92 repetition_penalty = 1.35 no_repeat_ngram_size = 3 ``` For more stable output, try: ```python temperature = 0.7 top_k = 50 top_p = 0.9 repetition_penalty = 1.4 ``` For more diverse output, try: ```python temperature = 1.0 top_k = 100 top_p = 0.95 repetition_penalty = 1.2 ``` --- ## Example Prompt ```text Explain Transformer models in simple terms. Answer: ``` --- ## Current Limitations This is an experimental model. Output quality may include: - repetition - grammatical instability - factual hallucination - incomplete reasoning - degraded long-form coherence - unstable behavior with very high temperature - weak instruction following compared with instruction-tuned models The model should be evaluated carefully before any downstream use. --- ## Safety Notice This model may generate incorrect, biased, unsafe, or misleading content. Do not use it as the sole source of truth for high-stakes decisions. Recommended mitigations: - use retrieval for factual tasks - apply output filtering - evaluate on task-specific benchmarks - use human review for sensitive outputs - avoid deployment without safety tuning --- ## Research Notes `summerV2` is part of an experimental model-development line focused on fast training and inference for custom causal language models. The current implementation emphasizes: - Hugging Face compatibility - direct model-code import fallback - KV-cache streaming decode - custom sampling controls - inference stability checks Future work may include: - better pretraining data mixture - instruction tuning - DPO or preference optimization - stronger tokenizer/model alignment - long-context stability improvements - benchmark reporting - model card expansion with training details --- ## Citation If you use this model in experiments, cite the repository: ```bibtex @misc{summerV2, title = {summerMC/summerV2}, author = {summerMC}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/summerMC/summerV2}} } ``` --- ## Disclaimer This repository contains an experimental research model. No warranty is provided regarding factuality, safety, performance, or fitness for a particular use case.