idah4
/

byteetm-korean-tiny

Text Generation

Model card Files Files and versions

byteetm-korean-tiny / README.md

idah4's picture

Update README.md

a12f029 verified 4 months ago

|

history blame contribute delete

1.74 kB

	---
	language: ko
	tags:
	- causal-lm
	- byteetm
	license: mit
	datasets:
	- roneneldan/TinyStories
	- HAERAE-HUB/KOREAN-WEBTEXT
	inference:
	parameters:
	max_new_tokens: 100
	temperature: 0.8
	top_k: 200
	inference_providers:
	- cpu
	- gpu
	- t4
	- a10g
	library_name: transformers
	widget:
	- text: "오늘은 날씨가"
	---
	# ByteETM-Korean
	소형 바이트-레벨 텍스트 디코더 LM
	- 133 MB byte-level causal LM trained on Korean web text.
	- 학습 데이터: roneneldan/TinyStories, HAERAE-HUB/KOREAN-WEBTEXT 일부
	- HAERAE-HUB/KOREAN-WEBTEXT 데이터셋 최종 val ppl ≈ 3.4

	## Example
	```python
	# %% ByteETM Inference (바이트 기반 추론)
	import torch
	from transformers import AutoModelForCausalLM

	# 1️⃣ 모델 로드
	repo_id = "idah4/byteetm-korean-tiny"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	model = AutoModelForCausalLM.from_pretrained(
	repo_id,
	trust_remote_code=True
	).to(device).eval()

	# 2️⃣ 바이트 기반 인코더 / 디코더
	def encode_bytes(text: str):
	return torch.tensor([[b for b in text.encode("utf-8")]], dtype=torch.long, device=device)

	def decode_bytes(ids: torch.Tensor):
	seq = [i for i in ids.tolist() if 0 <= i < 256]
	return bytes(seq).decode("utf-8", errors="ignore")

	# 3️⃣ 텍스트 생성 함수
	@torch.no_grad()
	def generate_text(prompt: str, max_new_tokens=200, temperature=0.8, top_k=200):
	input_ids = encode_bytes(prompt)
	out = model.generate(
	input_ids,
	max_new_tokens=max_new_tokens,
	temperature=temperature,
	top_k=top_k
	)
	return decode_bytes(out[0])

	# 4️⃣ 시연
	prompt = "오늘은 날씨가 좋아서"
	print(generate_text(prompt, max_new_tokens=150, temperature=0.9, top_k=150))