--- language: - ko base_model: - naver-hyperclovax/HyperCLOVAX-SEED-Text-Instruct-1.5B pipeline_tag: text-generation library_name: transformers license: other new_version: haebo/meow-clovax-v3 datasets: - haebo/meow-v1-dataset --- # 🐾 meow-clovax-v1 > **meow-clovax-v1은 감정(emotion)κ³Ό 동물 μœ ν˜•(post_type)에 따라 λ¬Έμž₯을 μžμ—°μŠ€λŸ½κ²Œ λ³€ν™˜ν•˜λŠ” ν•œκ΅­μ–΄ LLMμž…λ‹ˆλ‹€.** - nick_name : haebo/Meow-HyperCLOVAX-1.5B_FullFT_fp32_0615i
- λ³Έ λͺ¨λΈμ€ `naver-hyperclovax/HyperCLOVAX-SEED-Text-Instruct-1.5B`λ₯Ό 기반으둜 Supervised Finetuning(SFT) λ°©μ‹μœΌλ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€. --- ## 🧠 Model Details | ν•­λͺ© | μ„€λͺ… | |------|------| | **Base Model** | HyperCLOVAX-SEED-Text-Instruct-1.5B | | **Fine-tuning Method** | Supervised Finetuning (SFT) | | **Model Type** | Decoder-only | | **Language** | Korean (primary) | | **Parameters** | 1.5B | | **Precision** | fp16 / fp32 | | **Version** | v1 | | **Framework** | Transformers | | **license** | hyperclovax-seed | --- ## πŸ“¦ Training Details - **Dataset**: 감정 및 동물 λ§νˆ¬μ— 따라 μˆ˜μ§‘Β·ν•©μ„±λœ style transfer 데이터셋 (λΉ„κ³΅κ°œ) - 각 μƒ˜ν”Œμ€ `content`, `emotion`, `post_type`, `transformed_content` ν•„λ“œλ‘œ κ΅¬μ„±λœ jsonl 데이터셋 - **Task**: Instruct-style fine-tuning (prompt β†’ transformed response) - **Prompt ꡬ쑰**: - instruction:"λ‹€μŒ λ¬Έμž₯을 [동물]의 [감정]ν•œ 말투둜 λ°”κΏ”μ€˜.\nInput: ...\nOutput:" - **Epochs**: 3 - **Training Infrastructure**: Google Colab Pro+ (A100) - **Instruction Infrastructure**: Google Colab Pro+ (T4) / GCP T4 --- ## πŸ’‘ Intended Use - 감정 및 동물 말투 μŠ€νƒ€μΌ λ³€ν™˜ - 캐릭터 챗봇, 감정 ν‘œν˜„ 챗봇 λ“± ## ⚠️ Limitations & Bias - 감정 및 동물 μœ ν˜•μ— 따라 λ³€ν™˜μ΄ λΆ€μžμ—°μŠ€λŸ¬μšΈ 수 있음 - λ°μ΄ν„°μ…‹μ˜ ν•œκ³„λ‘œ νŠΉμ • 감정/동물 μœ ν˜•μ— 편ν–₯이 μžˆμ„ 수 있음 - λΆ€μ μ ˆν•œ μž…λ ₯에 λŒ€ν•΄ μ˜ˆμƒμΉ˜ λͺ»ν•œ 좜λ ₯을 생성할 수 있음 - 좜λ ₯에 λΆ€μ μ ˆν•œ μš”μ†Œκ°€ 많이 ν¬ν•¨λ˜μ–΄ μžˆμ–΄ ν›„μ²˜λ¦¬λ₯Ό μ§„ν–‰ν•œ κ²°κ³Όλ₯Ό μ‚¬μš© ## πŸš€ How to Use ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "haebo/meow-clovax-v1" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) content = "μ§œμ¦λ‚¬κ² λ„€ λ‚˜λ„ μ•„μΉ¨λ§ˆλ‹€ μ§œμ¦λ‚¨" emotion = "angry" post_type = "cat" instruction = f"λ‹€μŒ λ¬Έμž₯을 {post_type}의 {emotion}ν•œ 말투둜 λ°”κΏ”μ€˜." prompt = ( f"### Instruction:\n{example['instruction']}\n" f"### Input:\n{example['input']}\n" f"### Output:\n{example['output']}" ) inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=400) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## πŸ—‚οΈ Dataset > v1 λͺ¨λΈμ—λŠ” μ•„λž˜μ™€ 같은 데이터셋이 μ‚¬μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€.이 데이터듀은 λ³„λ„μ˜ μ „μ²˜λ¦¬(ν΄λžœμ§•/필터링) 없이 원본 κ·ΈλŒ€λ‘œ ν™œμš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
> νŒŒμΈνŠœλ‹ μ‹œ ν”„λ‘¬ν”„νŠΈ ꡬ쑰에 맞게 λ³€κ²½λ˜μ—ˆμŠ΅λ‹ˆλ‹€. - **데이터 ꡬ쑰** 각 μƒ˜ν”Œμ€ μ•„λž˜μ™€ 같은 ν•„λ“œλ‘œ κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. - `content`: 원본 λ¬Έμž₯ (일상 ν•œκ΅­μ–΄) - `emotion`: 감정 λ ˆμ΄λΈ” (예: happy, sad, angry λ“±) - `post_type`: 동물 μœ ν˜• (예: cat, dog) - `transformed_content`: 감정 및 동물 말투둜 λ³€ν™˜λœ λ¬Έμž₯ - **μ˜ˆμ‹œ** ```json { "content": "였늘 점심 뭐 λ¨Ήμ§€.", "emotion": "normal", "post_type": dog", "transformed_content": "였늘 점심 뭐 먹지멍? 🐾 λ§›μžˆλŠ” λƒ„μƒˆκ°€ λ‚˜λŠ” 것 같닀멍! μ£ΌμΈλ‹˜, μ € λ°₯ μ–΄λ”¨λƒμ™ˆ! 빨리 λ°₯그릇 μ±„μ›Œλ‹¬λΌλ©! 🦴 α“šβ‚Β΄ κ’³ `β‚Žαƒ" } ``` - **데이터셋 (총 4,827개)** - **dataset_0515_made (342개)**: 초기 μœ μ € 데이터 - **dataset_0527_made (818개)**: μœ μ € κ²Œμ‹œκΈ€ 기반 감정별/동물별 데이터 - **dataset_0530_made (2,986개)**: 감정별 증폭된 κ²Œμ‹œκΈ€ 기반 데이터 - **dataset_0613_made (681개)**: μœ μ € λŒ“κΈ€ μž…λ ₯에 λŒ€ν•œ κ·œμΉ™ 기반 λ³€ν™˜(cat)