Model Card for Model ID
Model Details
Model Description
This is the model card of a ð€ transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: [More Information Needed]
- Funded by [optional]: [More Information Needed]
- Shared by [optional]: [More Information Needed]
- Model type: [More Information Needed]
- Language(s) (NLP): [More Information Needed]
- License: [More Information Needed]
- Finetuned from model [optional]: [More Information Needed]
Model Sources [optional]
- Repository: [More Information Needed]
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
Direct Use
[More Information Needed]
Downstream Use [optional]
[More Information Needed]
Out-of-Scope Use
[More Information Needed]
Bias, Risks, and Limitations
[More Information Needed]
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Data
[More Information Needed]
Training Procedure
Preprocessing [optional]
[More Information Needed]
Training Hyperparameters
- Training regime: [More Information Needed]
Speeds, Sizes, Times [optional]
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
[More Information Needed]
Summary
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]
以äžã« elyza-tasks-100-TV_0.jsonl åçãåŸãã³ãŒããèšèŒããŸãã
'''python
python 3.10.12
!pip install -U pip !pip install -U transformers !pip install -U bitsandbytes !pip install -U accelerate !pip install -U datasets !pip install -U peft !pip install -U trl !pip install -U wandb !pip install ipywidgets --upgrade
from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, logging, ) from peft import ( LoraConfig, PeftModel, get_peft_model, ) import os, torch, gc from datasets import load_dataset import bitsandbytes as bnb from trl import SFTTrainer
Hugging Face Token
HF_TOKEN = "Your_Token" # Your_Token ã« WRITE æš©éã®ãã èªåã® huggingface token ãèšå ¥
ã¢ãã«ãèªã¿èŸŒã¿ã
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from unsloth import FastLanguageModel
import torch
max_seq_length = 512 # unslothã§ã¯RoPEããµããŒãããŠããã®ã§ã³ã³ããã¹ãé·ã¯èªç±ã«èšå®å¯èœ
dtype = None # Noneã«ããŠããã°èªåã§èšå®
load_in_4bit = True # ä»åã¯8Bã¯ã©ã¹ã®ã¢ãã«ãæ±ãããTrue
base_model_id = "llm-jp/llm-jp-3-13b" new_model_id = "llm-jp-3-13b-it" #Fine-Tuningããã¢ãã«ã«ã€ãããååãit: Instruction Tuning
FastLanguageModel ã€ã³ã¹ã¿ã³ã¹ãäœæ
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_id,
dtype=dtype,
load_in_4bit=load_in_4bit,
trust_remote_code=True,
)
llm-jp-3 1.8B, 3.7B, 13Bã®snapshotãããŠã³ããŒãæžã¿ã§modelsãã£ã¬ã¯ããªã«æ ŒçŽããŠãããŸãã
ãã®ä»ã®ã¢ãã«ã¯ååŸã«æ¿è«Ÿãå¿ èŠãªãããåèªã§ããŠã³ããŒããé¡ãããŸãã
base_model_id = "models/models--llm-jp--llm-jp-3-13b/snapshots/cd3823f4c1fcbb0ad2e2af46036ab1b0ca13192a" #Fine-TuningããããŒã¹ã¢ãã«
new_model_id = "llm-jp-3-13b-finetune" #Fine-Tuningããã¢ãã«ã«ã€ãããåå
""" bnb_config: éååã®èšå®
load_in_4bit:
- 4bitéåå圢åŒã§ã¢ãã«ãããŒã
bnb_4bit_quant_type:
- éååã®åœ¢åŒãæå®
bnb_4bit_compute_dtype:
- éååãããéã¿ãçšããŠèšç®ããéã®ããŒã¿å
"""
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # nf4ã¯éåžžã®INT4ãã粟床ãé«ãããã¥ãŒã©ã«ãããã¯ãŒã¯ã®ååžã«æé©ã§ã bnb_4bit_compute_dtype=torch.bfloat16, )
""" model: ã¢ãã«
base_model:
- èªã¿èŸŒãããŒã¹ã¢ãã« (äºåã«å®çŸ©ãããã®)
quantization_config:
- bnb_configã§èšå®ããéååèšå®
device_map:
- ã¢ãã«ãå²ãåœãŠãããã€ã¹ (CPU/GPU) "auto"ã§èªåã«å²ãåœãŠãããŸãã
tokenizer: ããŒã¯ãã€ã¶ãŒ
base_model:
- èªã¿èŸŒãããŒã¹ã¢ãã« (äºåã«å®çŸ©ãããã®)
trust_remote_code:
- ãªã¢ãŒãã³ãŒãã®å®è¡ãèš±å¯ (ã«ã¹ã¿ã ã¢ãã«ãªã©) """ model = AutoModelForCausalLM.from_pretrained( base_model_id, quantization_config=bnb_config, device_map="auto" )
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
""" find_all_linear_names: ã¢ãã«å ã®4bitéååç·åœ¢å±€ãæ¢ããŸãã """
def find_all_linear_names(model): cls = bnb.nn.Linear4bit # 4bitéååç·åœ¢å±€ã¯ã©ã¹ãæå® lora_module_names = set() # ããã«ååŸããç·åœ¢å±€ãä¿æããŸãã
# ã¢ãã«å
ã®å
šãŠã®ã¢ãžã¥ãŒã«ãæ¢çŽ¢ããŸã
for name, module in model.named_modules():
if isinstance(module, cls): # ã¢ãžã¥ãŒã«ã4bitéååç·åœ¢å±€ã®å Žå
names = name.split('.') # ã¢ãžã¥ãŒã«ã®ååãåå² (ãã¹ããããŠãéãªã©ã«å¯ŸåŠ)
lora_module_names.add(names[0] if len(names) == 1 else names[-1]) # æäžå±€ã®ååãlora_module_namesã«è¿œå
# 'lm_head' ã¯16ãããæŒç®ã®éã«é€å€ããå¿
èŠããããããlora_module_namesããåé€
if 'lm_head' in lora_module_names:
lora_module_names.remove('lm_head')
return list(lora_module_names) # lora_module_namesããªã¹ãã«å€æããŠè¿ããŸãã
modules = find_all_linear_names(model)
""" peft_config: PEFTã®æ§æèšå®
r
- LoRA ã®ã©ã³ã¯ (4, 8, 16 ,32...)
- å¢ããã»ã©åŠç¿ãæãã, éåŠç¿ã®ãªã¹ã¯ãé«ãŸãã®ã§æ³šæ
lora_alpha
- LoRAã®ã¹ã±ãŒãªã³ã°ä¿æ°
lora_dropout
- ããããã¢ãŠãçïŒéåŠç¿ãé²ãããã®å²åïŒ
bias
- ãã€ã¢ã¹é ã®æ±ã ("none"ã®å ŽåãLoRAã¯ãã€ã¢ã¹ãåŠç¿ããªã)
task_type
- ã¿ã¹ã¯ã¿ã€ã
target_modules
- LoRAãé©çšããã¿ãŒã²ããã¢ãžã¥ãŒã« (åã®ã³ãŒãã§ç¹å®ããå±€) """
peft_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", target_modules=modules, )
model = get_peft_model(model, peft_config)
""" åŠç¿ã«çšããããŒã¿ã»ããã®æå® ä»åã¯LLM-jp ã®å ¬éããŠãã Ichikara Instruction ã䜿ããŸããããŒã¿ã«ã¢ã¯ã»ã¹ããããã«ã¯ç³è«ãå¿ èŠã§ãã®ã§ã䜿ãããæ¹ã®ã¿ç³è«ãããŠãã ããã Ichikara Instruciton ã Hugging Face Hub ã«ãŠå ¬éããããšã¯ãæ§ããã ããã ãŸããCC-BY-NC-SAã§ãã®ã§ã¢ãã«ã¯ã©ã€ã»ã³ã¹ãç¶æ¿ããåæã§ã䜿ããã ããã
äžèšã®ãªã³ã¯ããç³è«ãçµããå ã« Google Drive ããããDistribution20241221_all ãšãããã©ã«ãããšããŠã³ããŒãããŠãã ããã ä»åã¯ãichikara-instruction-003-001-1.jsonãã䜿ããŸããå¿ èŠã§ããã°å±éïŒ!unzip ãªã©ïŒããããŒã¿ã»ããã®ãã¹ãé©åã«æå®ããŠãã ããã omnicampusã®éçºç°å¢ã§ã¯ååŸããããŒã¿ãå·ŠåŽã«ãã©ãã°ã¢ã³ãããããããŠã䜿ããã ããã
https://liat-aip.sakura.ne.jp/wp/llmã®ããã®æ¥æ¬èªã€ã³ã¹ãã©ã¯ã·ã§ã³ããŒã¿äœæ/llmã®ããã®æ¥æ¬èªã€ã³ã¹ãã©ã¯ã·ã§ã³ããŒã¿-å ¬é/ 颿 ¹è¡, å®è€ãŸã, åŸè€çŸç¥å, éŽæšä¹ çŸ, æ²³å倧èŒ, äºä¹äžçŽä¹, 也å¥å€ªé. ichikara-instruction: LLMã®ããã®æ¥æ¬èªã€ã³ã¹ãã©ã¯ã·ã§ã³ããŒã¿ã®æ§ç¯. èšèªåŠçåŠäŒç¬¬30å幎次倧äŒ(2024)
"""
dataset = load_dataset("json", data_files="./ichikara-instruction-003-001-1.json") dataset
åŠç¿æã®ããã³ãããã©ãŒãããã®å®çŸ©
prompt = """### æç€º {}
åç
{}"""
""" formatting_prompts_func: åããŒã¿ãããã³ããã«åããã圢åŒã«åããã """ EOS_TOKEN = tokenizer.eos_token # ããŒã¯ãã€ã¶ãŒã®EOSããŒã¯ã³ïŒææ«ããŒã¯ã³ïŒ def formatting_prompts_func(examples): input = examples["text"] # å ¥åããŒã¿ output = examples["output"] # åºåããŒã¿ text = prompt.format(input, output) + EOS_TOKEN # ããã³ããã®äœæ return { "formatted_text" : text, } # æ°ãããã£ãŒã«ã "formatted_text" ãè¿ã pass
# åããŒã¿ã«ãã©ãŒããããé©çš
dataset = dataset.map( formatting_prompts_func, num_proc= 4, # 䞊ååŠçæ°ãæå® )
dataset
ããŒã¿ã確èª
print(dataset["train"]["formatted_text"][1])
ããŒã¿ãtrainããŒã¿ãštestããŒã¿ã«åå² (test_sizeã®æ¯çã«)
dataset = dataset["train"].train_test_split(test_size=0.1)
dataset
""" training_arguments: åŠç¿ã®èšå®
output_dir: -ãã¬ãŒãã³ã°åŸã®ã¢ãã«ãä¿åãããã£ã¬ã¯ããª
per_device_train_batch_size:
- ããã€ã¹ããšã®ãã¬ãŒãã³ã°ããããµã€ãº
per_device_ _batch_size:
- ããã€ã¹ããšã®è©äŸ¡ããããµã€ãº
gradient_accumulation_steps:
- åŸé ãæŽæ°ããåã«ã¹ããããç©ã¿éããåæ°
optim:
- ãªããã£ãã€ã¶ã®èšå®
num_train_epochs:
- ãšããã¯æ°
eval_strategy:
- è©äŸ¡ã®æŠç¥ ("no"/"steps"/"epoch")
eval_steps:
- eval_strategyã"steps"ã®ãšããè©äŸ¡ãè¡ãstepéé
logging_strategy:
- ãã°èšé²ã®æŠç¥
logging_steps:
- ãã°ãåºåããã¹ãããéé
warmup_steps:
- åŠç¿çã®ãŠã©ãŒã ã¢ããã¹ãããæ°
save_steps:
- ã¢ãã«ãä¿åããã¹ãããéé
save_total_limit:
- ä¿åããŠããcheckpointã®æ°
max_steps:
- ãã¬ãŒãã³ã°ã®æå€§ã¹ãããæ°
learning_rate:
- åŠç¿ç
fp16:
- 16bitæµ®åå°æ°ç¹ã®äœ¿çšèšå®ïŒç¬¬8åæŒç¿ãåèã«ãããšè¯ãã§ãïŒ
bf16:
- BFloat16ã®äœ¿çšèšå®
group_by_length:
- å ¥åã·ãŒã±ã³ã¹ã®é·ãã«ããããããã°ã«ãŒãå (ãã¬ãŒãã³ã°ã®å¹çå)
report_to:
- ãã°ã®éä¿¡å ("wandb"/"tensorboard"ãªã©) """
training_arguments = TrainingArguments( output_dir=new_model_id, per_device_train_batch_size=1, gradient_accumulation_steps=2, optim="paged_adamw_32bit", num_train_epochs=1, logging_strategy="steps", logging_steps=10, warmup_steps=10, save_steps=100, save_total_limit = 2, max_steps = -1, learning_rate=5e-5, fp16=False, bf16=False, seed = 3407, group_by_length=True, report_to="none" )
""" SFTTrainer: Supervised Fine-Tuningã«é¢ããèšå®
model:
- èªã¿èŸŒãã ããŒã¹ã®ã¢ãã«
train_dataset:
- ãã¬ãŒãã³ã°ã«äœ¿çšããããŒã¿ã»ãã
eval_dataset:
- è©äŸ¡ã«äœ¿çšããããŒã¿ã»ãã
peft_config:
- PEFTïŒParameter-Efficient Fine-TuningïŒã®èšå®ïŒLoRAãå©çšããå Žåã«æå®ïŒ
max_seq_length:
- ã¢ãã«ã«å ¥åãããã·ãŒã±ã³ã¹ã®æå€§ããŒã¯ã³é·
dataset_text_field:
- ããŒã¿ã»ããå ã®åŠç¿ã«äœ¿ãããã¹ããå«ããã£ãŒã«ãå
tokenizer:
- ã¢ãã«ã«å¯Ÿå¿ããããŒã¯ãã€ã¶ãŒ
args:
- ãã¬ãŒãã³ã°ã«äœ¿çšãããã€ããŒãã©ã¡ãŒã¿ïŒTrainingArgumentsã®èšå®ãæå®ïŒ
packing:
- å ¥åã·ãŒã±ã³ã¹ã®ãããã³ã°ãè¡ããã©ããã®èšå® (False ã«èšå®ããããšã§ãåå ¥åãç¬ç«ããŠæ±ã) """ trainer = SFTTrainer( model=model, train_dataset=dataset["train"], peft_config=peft_config, max_seq_length= 512, dataset_text_field="formatted_text", tokenizer=tokenizer, args=training_arguments, packing= False, )
model.config.use_cache = False # ãã£ãã·ã¥æ©èœãç¡å¹å trainer.train() # ãã¬ãŒãã³ã°ãå®è¡
ã¿ã¹ã¯ãšãªãããŒã¿ã®èªã¿èŸŒã¿ã
omnicampusã®éçºç°å¢ã§ã¯ãå·Šã«ã¿ã¹ã¯ã®jsonlããã©ãã°ã¢ã³ãããããããŠããå®è¡ã
import json datasets = [] with open("./elyza-tasks-100-TV_0.jsonl", "r") as f: item = "" for line in f: line = line.strip() item += line if item.endswith("}"): datasets.append(json.loads(item)) item = ""
ã¢ãã«ã«ããã¿ã¹ã¯ã®æšè«ã
from tqdm import tqdm
results = [] for data in tqdm(datasets):
input = data["input"]
prompt = f"""### æç€º {input}
åç
"""
tokenized_input = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device) attention_mask = torch.ones_like(tokenized_input)
with torch.no_grad(): outputs = model.generate( tokenized_input, attention_mask=attention_mask, max_new_tokens=100, do_sample=False, repetition_penalty=1.2, pad_token_id=tokenizer.eos_token_id )[0] output = tokenizer.decode(outputs[tokenized_input.size(1):], skip_special_tokens=True)
results.append({"task_id": data["task_id"], "input": input, "output": output})
ãã¡ãã§çæãããjsolãæåºããŠãã ããã
æ¬ã³ãŒãã§ã¯inputãševal_aspectãå«ãã§ããŸããããªããŠãåé¡ãããŸããã
å¿ é ãªã®ã¯task_idãšoutputãšãªããŸãã
import re jsonl_id = re.sub(".*/", "", new_model_id) with open(f"./{jsonl_id}-outputs.jsonl", 'w', encoding='utf-8') as f: for result in results: json.dump(result, f, ensure_ascii=False) # ensure_ascii=False for handling non-ASCII characters f.write('\n')
ã¢ãã«ãšããŒã¯ãã€ã¶ãŒãHugging Faceã«ã¢ããããŒã
model.push_to_hub(new_model_id, token=HF_TOKEN, private=True) # Online saving tokenizer.push_to_hub(new_model_id, token=HF_TOKEN, private=True) # Online saving
'''python