Text Generation
Transformers
Safetensors
GGUF
Korean
English
llama
3b
korean
from-scratch
orpo
instruction-tuned
preference-aligned
fp8
b200
Eval Results (legacy)
text-generation-inference
Instructions to use pathcosmos/frankenstallm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pathcosmos/frankenstallm with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="pathcosmos/frankenstallm")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("pathcosmos/frankenstallm") model = AutoModelForCausalLM.from_pretrained("pathcosmos/frankenstallm") - llama-cpp-python
How to use pathcosmos/frankenstallm with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pathcosmos/frankenstallm", filename="gguf/frankenstallm-3b-Q4_K_M.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use pathcosmos/frankenstallm with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pathcosmos/frankenstallm:Q4_K_M # Run inference directly in the terminal: llama-cli -hf pathcosmos/frankenstallm:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pathcosmos/frankenstallm:Q4_K_M # Run inference directly in the terminal: llama-cli -hf pathcosmos/frankenstallm:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf pathcosmos/frankenstallm:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf pathcosmos/frankenstallm:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf pathcosmos/frankenstallm:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf pathcosmos/frankenstallm:Q4_K_M
Use Docker
docker model run hf.co/pathcosmos/frankenstallm:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use pathcosmos/frankenstallm with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pathcosmos/frankenstallm" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pathcosmos/frankenstallm", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/pathcosmos/frankenstallm:Q4_K_M
- SGLang
How to use pathcosmos/frankenstallm with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "pathcosmos/frankenstallm" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pathcosmos/frankenstallm", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "pathcosmos/frankenstallm" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pathcosmos/frankenstallm", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Ollama
How to use pathcosmos/frankenstallm with Ollama:
ollama run hf.co/pathcosmos/frankenstallm:Q4_K_M
- Unsloth Studio new
How to use pathcosmos/frankenstallm with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pathcosmos/frankenstallm to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pathcosmos/frankenstallm to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for pathcosmos/frankenstallm to start chatting
- Docker Model Runner
How to use pathcosmos/frankenstallm with Docker Model Runner:
docker model run hf.co/pathcosmos/frankenstallm:Q4_K_M
- Lemonade
How to use pathcosmos/frankenstallm with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull pathcosmos/frankenstallm:Q4_K_M
Run and chat with the model
lemonade run user.frankenstallm-Q4_K_M
List all available models
lemonade list
| """ | |
| Dataset classes for LLM training. | |
| TextDataset: Sliding window (stride 1) over a memory-mapped uint16 binary file. | |
| PackedDataset: Non-overlapping windows (stride = seq_len) over the same file format. | |
| """ | |
| from __future__ import annotations | |
| from pathlib import Path | |
| from typing import Tuple, Union | |
| import numpy as np | |
| import torch | |
| from torch.utils.data import Dataset | |
| class TextDataset(Dataset): | |
| """ | |
| Sliding-window dataset over a memory-mapped numpy uint16 binary token file. | |
| Each sample is a (input_ids, targets) pair of length seq_len, where | |
| targets is input_ids shifted by one position. Windows overlap by | |
| (seq_len - 1) tokens, i.e. stride = 1. | |
| Args: | |
| data_path: Path to the .bin file produced by data/prepare.py. | |
| seq_len: Number of tokens per sample (context length). | |
| """ | |
| def __init__(self, data_path: Union[str, Path], seq_len: int) -> None: | |
| super().__init__() | |
| self.seq_len = seq_len | |
| path = Path(data_path) | |
| if not path.exists(): | |
| raise FileNotFoundError(f"Data file not found: {path}") | |
| # Memory-map for zero-copy random access. | |
| self.data: np.ndarray = np.memmap(path, dtype="uint16", mode="r") | |
| # Hint OS to preload entire file into page cache (2.2TB RAM available) | |
| import mmap as _mmap | |
| try: | |
| self.data._mmap.madvise(_mmap.MADV_SEQUENTIAL) | |
| except (AttributeError, OSError): | |
| pass # madvise not available on all platforms | |
| if len(self.data) < seq_len + 1: | |
| raise ValueError( | |
| f"Data file has only {len(self.data)} tokens, " | |
| f"need at least {seq_len + 1}." | |
| ) | |
| def __len__(self) -> int: | |
| # Each window needs seq_len tokens plus one extra for the target shift. | |
| return len(self.data) - self.seq_len | |
| def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]: | |
| # Slice from the memmap (returns a uint16 numpy view). | |
| chunk = self.data[idx : idx + self.seq_len + 1] | |
| # Cast to int32 (not int64) to halve CPU worker memory usage: | |
| # uint16 (2 B) → int32 (4 B) instead of uint16 → int64 (8 B, 4× bloat). | |
| # int32 is sufficient for vocab_size=64000 (max token id 65535 fits in int32). | |
| # The int32→int64 (long) promotion happens on GPU inside _step(), for free. | |
| chunk = torch.from_numpy(chunk.astype(np.int32)) | |
| input_ids = chunk[:-1] # [seq_len] | |
| targets = chunk[1:] # [seq_len] | |
| return input_ids, targets | |
| class PackedDataset(Dataset): | |
| """ | |
| Non-overlapping packed dataset over a memory-mapped uint16 binary token file. | |
| Intended for data that has already been packed (documents concatenated with | |
| EOS tokens). Windows do not overlap; stride = seq_len. | |
| The target sequence is shifted by one token relative to input_ids. Because | |
| the last token of a window shares its target with the *first* token of the | |
| next window, the final target position is filled with -1 (the standard | |
| ``ignore_index`` for ``nn.CrossEntropyLoss``). | |
| Args: | |
| data_path: Path to the .bin file produced by data/prepare.py. | |
| seq_len: Number of tokens per sample (context length). | |
| """ | |
| def __init__(self, data_path: Union[str, Path], seq_len: int) -> None: | |
| super().__init__() | |
| self.seq_len = seq_len | |
| path = Path(data_path) | |
| if not path.exists(): | |
| raise FileNotFoundError(f"Data file not found: {path}") | |
| self.data: np.ndarray = np.memmap(path, dtype="uint16", mode="r") | |
| # Optimize mmap for shuffled random access pattern (DistributedSampler) | |
| import mmap as _mmap | |
| try: | |
| self.data._mmap.madvise(_mmap.MADV_RANDOM) # disable kernel read-ahead (random access) | |
| self.data._mmap.madvise(_mmap.MADV_WILLNEED) # async prefault into page cache | |
| except (AttributeError, OSError): | |
| pass | |
| if len(self.data) < seq_len: | |
| raise ValueError( | |
| f"Data file has only {len(self.data)} tokens, " | |
| f"need at least {seq_len}." | |
| ) | |
| def __len__(self) -> int: | |
| return len(self.data) // self.seq_len | |
| def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]: | |
| start = idx * self.seq_len | |
| end = start + self.seq_len | |
| # Cast to int32 (not int64) to halve CPU worker memory usage. | |
| # int32 is sufficient for vocab_size=64000; int32→long promotion on GPU. | |
| input_ids = torch.from_numpy( | |
| self.data[start:end].astype(np.int32) | |
| ) # [seq_len] | |
| # Targets are shifted by one. If end < len(data) we can read the | |
| # extra token normally; otherwise pad the last position with -1. | |
| if end < len(self.data): | |
| targets = torch.from_numpy( | |
| self.data[start + 1 : end + 1].astype(np.int32) | |
| ) # [seq_len] | |
| else: | |
| # Last window: all but the final position can be computed. | |
| # Use int32 for the filled portion; -1 fits in int32. | |
| targets = torch.full((self.seq_len,), fill_value=-1, dtype=torch.int32) | |
| if end - start - 1 > 0: | |
| targets[: self.seq_len - 1] = torch.from_numpy( | |
| self.data[start + 1 : end].astype(np.int32) | |
| ) | |
| return input_ids, targets | |