Text Generation
Transformers
Safetensors
PyTorch
English
van_fast_transformer
causal-lm
transformer
custom-code
kv-cache
custom_code
Instructions to use summerMC/summerV2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use summerMC/summerV2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="summerMC/summerV2", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("summerMC/summerV2", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use summerMC/summerV2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "summerMC/summerV2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "summerMC/summerV2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/summerMC/summerV2
- SGLang
How to use summerMC/summerV2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "summerMC/summerV2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "summerMC/summerV2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "summerMC/summerV2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "summerMC/summerV2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use summerMC/summerV2 with Docker Model Runner:
docker model run hf.co/summerMC/summerV2
| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - causal-lm | |
| - text-generation | |
| - transformer | |
| - custom-code | |
| - kv-cache | |
| - pytorch | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| # summerMC/summerV2 | |
| `summerMC/summerV2` is an experimental causal language model based on a custom `VanFastForCausalLM` architecture. | |
| This model was developed by a first-year vocational school student in Japan, age 18, as an independent research and engineering project. | |
| The project focuses on building and testing a custom fast causal language model with: | |
| - custom Hugging Face-compatible model code | |
| - KV-cache enabled autoregressive inference | |
| - streaming decode support | |
| - anti-repetition sampling utilities | |
| - NaN/Inf guarded logits handling | |
| - local `modeling_van_fast.py` loading support | |
| The model is primarily intended for research and experimentation, not production deployment. | |
| --- | |
| ## Model Details | |
| | Item | Value | | |
| |---|---| | |
| | Model name | `summerMC/summerV2` | | |
| | Architecture | `VanFastForCausalLM` | | |
| | Task | Causal language modeling | | |
| | Framework | PyTorch / Hugging Face Transformers | | |
| | Inference style | Autoregressive text generation | | |
| | Cache support | KV-cache enabled | | |
| | Primary language | English | | |
| | Developer | First-year vocational school student, age 18 | | |
| | Status | Experimental | | |
| --- | |
| ## Developer Note | |
| This model was developed by an 18-year-old first-year vocational school student as part of an independent AI research project. | |
| The goal is to explore practical custom language-model architecture design, Hugging Face compatibility, fast inference, and KV-cache decoding. The project is experimental, but it is designed to be reproducible and inspectable for other researchers, students, and engineers. | |
| --- | |
| ## Intended Use | |
| This model is intended for: | |
| - language-model architecture research | |
| - custom Transformer inference experiments | |
| - KV-cache decoding tests | |
| - sampling strategy experiments | |
| - small-to-mid scale causal LM prototyping | |
| - comparison against GPT-style baselines | |
| - student-led AI research demonstrations | |
| This model is not intended for: | |
| - safety-critical use | |
| - medical, legal, or financial advice | |
| - autonomous decision-making | |
| - deployment without additional evaluation | |
| - factual answering without retrieval or verification | |
| --- | |
| ## Installation | |
| ```bash | |
| pip install -U torch transformers accelerate safetensors | |
| ``` | |
| For GPU inference, install a CUDA-compatible PyTorch build. | |
| --- | |
| ## Basic Usage | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| model_id = "summerMC/summerV2" | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| dtype = torch.float32 | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| model_id, | |
| trust_remote_code=True, | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| trust_remote_code=True, | |
| torch_dtype=dtype, | |
| ) | |
| model.to(device) | |
| model.eval() | |
| if tokenizer.pad_token is None: | |
| tokenizer.pad_token = tokenizer.eos_token | |
| prompt = "Explain Transformer models in simple terms.\n\nAnswer:" | |
| inputs = tokenizer( | |
| prompt, | |
| return_tensors="pt", | |
| add_special_tokens=False, | |
| ).to(device) | |
| with torch.inference_mode(): | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=120, | |
| do_sample=True, | |
| temperature=0.85, | |
| top_k=80, | |
| top_p=0.92, | |
| repetition_penalty=1.25, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id, | |
| ) | |
| text = tokenizer.decode( | |
| outputs[0], | |
| skip_special_tokens=True, | |
| clean_up_tokenization_spaces=False, | |
| ) | |
| print(text) | |
| ``` | |
| --- | |
| ## Direct Local Import Inference | |
| If remote-code loading causes cache or import issues, the model can be loaded by directly importing `modeling_van_fast.py`. | |
| ```python | |
| import os | |
| import sys | |
| import json | |
| import importlib.util | |
| import torch | |
| from transformers import AutoTokenizer | |
| HF_OUT_DIR = "/content/van_fast_transformer/hf_compatible" | |
| MODELING_PATH = os.path.join(HF_OUT_DIR, "modeling_van_fast.py") | |
| CONFIG_PATH = os.path.join(HF_OUT_DIR, "config.json") | |
| DEVICE = "cuda" if torch.cuda.is_available() else "cpu" | |
| DTYPE = torch.float32 | |
| module_name = "modeling_van_fast_runtime" | |
| if module_name in sys.modules: | |
| del sys.modules[module_name] | |
| spec = importlib.util.spec_from_file_location(module_name, MODELING_PATH) | |
| mod = importlib.util.module_from_spec(spec) | |
| sys.modules[module_name] = mod | |
| spec.loader.exec_module(mod) | |
| VanFastConfig = mod.VanFastConfig | |
| VanFastForCausalLM = mod.VanFastForCausalLM | |
| with open(CONFIG_PATH, "r", encoding="utf-8") as f: | |
| cfg_json = json.load(f) | |
| cfg_json["use_cache"] = True | |
| cfg_json["tie_word_embeddings"] = False | |
| config = VanFastConfig(**cfg_json) | |
| config.use_cache = True | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| HF_OUT_DIR, | |
| use_fast=True, | |
| trust_remote_code=True, | |
| ) | |
| if tokenizer.pad_token is None: | |
| tokenizer.pad_token = tokenizer.eos_token | |
| model = VanFastForCausalLM.from_pretrained( | |
| HF_OUT_DIR, | |
| config=config, | |
| torch_dtype=DTYPE, | |
| ) | |
| model.to(DEVICE) | |
| model.eval() | |
| ``` | |
| --- | |
| ## KV-cache Test | |
| ```python | |
| import torch | |
| @torch.inference_mode() | |
| def test_kv_cache(prompt="Hello world"): | |
| input_ids = tokenizer( | |
| prompt, | |
| return_tensors="pt", | |
| add_special_tokens=False, | |
| ).input_ids.to(model.device) | |
| out = model( | |
| input_ids=input_ids, | |
| use_cache=True, | |
| return_dict=True, | |
| ) | |
| print("input shape:", tuple(input_ids.shape)) | |
| print("logits:", tuple(out.logits.shape)) | |
| print("past_key_values is None:", out.past_key_values is None) | |
| if out.past_key_values is None: | |
| raise RuntimeError("KV cache is inactive.") | |
| print("layers:", len(out.past_key_values)) | |
| k0, v0 = out.past_key_values[0] | |
| print("layer0 k:", tuple(k0.shape)) | |
| print("layer0 v:", tuple(v0.shape)) | |
| next_id = torch.argmax(out.logits[:, -1, :], dim=-1, keepdim=True) | |
| out2 = model( | |
| input_ids=next_id, | |
| past_key_values=out.past_key_values, | |
| use_cache=True, | |
| return_dict=True, | |
| ) | |
| k1, v1 = out2.past_key_values[0] | |
| print("after decode layer0 k:", tuple(k1.shape)) | |
| print("after decode layer0 v:", tuple(v1.shape)) | |
| print("KV cache OK") | |
| test_kv_cache() | |
| ``` | |
| --- | |
| ## Recommended Sampling Settings | |
| The following settings were used during local KV-cache inference testing: | |
| ```python | |
| max_new_tokens = 160 | |
| temperature = 0.85 | |
| top_k = 80 | |
| top_p = 0.92 | |
| repetition_penalty = 1.35 | |
| no_repeat_ngram_size = 3 | |
| ``` | |
| For more stable output, try: | |
| ```python | |
| temperature = 0.7 | |
| top_k = 50 | |
| top_p = 0.9 | |
| repetition_penalty = 1.4 | |
| ``` | |
| For more diverse output, try: | |
| ```python | |
| temperature = 1.0 | |
| top_k = 100 | |
| top_p = 0.95 | |
| repetition_penalty = 1.2 | |
| ``` | |
| --- | |
| ## Example Prompt | |
| ```text | |
| Explain Transformer models in simple terms. | |
| Answer: | |
| ``` | |
| --- | |
| ## Current Limitations | |
| This is an experimental model. Output quality may include: | |
| - repetition | |
| - grammatical instability | |
| - factual hallucination | |
| - incomplete reasoning | |
| - degraded long-form coherence | |
| - unstable behavior with very high temperature | |
| - weak instruction following compared with instruction-tuned models | |
| The model should be evaluated carefully before any downstream use. | |
| --- | |
| ## Safety Notice | |
| This model may generate incorrect, biased, unsafe, or misleading content. | |
| Do not use it as the sole source of truth for high-stakes decisions. | |
| Recommended mitigations: | |
| - use retrieval for factual tasks | |
| - apply output filtering | |
| - evaluate on task-specific benchmarks | |
| - use human review for sensitive outputs | |
| - avoid deployment without safety tuning | |
| --- | |
| ## Research Notes | |
| `summerV2` is part of an experimental model-development line focused on fast training and inference for custom causal language models. | |
| The current implementation emphasizes: | |
| - Hugging Face compatibility | |
| - direct model-code import fallback | |
| - KV-cache streaming decode | |
| - custom sampling controls | |
| - inference stability checks | |
| Future work may include: | |
| - better pretraining data mixture | |
| - instruction tuning | |
| - DPO or preference optimization | |
| - stronger tokenizer/model alignment | |
| - long-context stability improvements | |
| - benchmark reporting | |
| - model card expansion with training details | |
| --- | |
| ## Citation | |
| If you use this model in experiments, cite the repository: | |
| ```bibtex | |
| @misc{summerV2, | |
| title = {summerMC/summerV2}, | |
| author = {summerMC}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| howpublished = {\url{https://huggingface.co/summerMC/summerV2}} | |
| } | |
| ``` | |
| --- | |
| ## Disclaimer | |
| This repository contains an experimental research model. | |
| No warranty is provided regarding factuality, safety, performance, or fitness for a particular use case. |