Instructions to use seong67360/Qwen2.5-7B-Instruct_v4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use seong67360/Qwen2.5-7B-Instruct_v4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="seong67360/Qwen2.5-7B-Instruct_v4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("seong67360/Qwen2.5-7B-Instruct_v4")
model = AutoModelForCausalLM.from_pretrained("seong67360/Qwen2.5-7B-Instruct_v4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use seong67360/Qwen2.5-7B-Instruct_v4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "seong67360/Qwen2.5-7B-Instruct_v4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "seong67360/Qwen2.5-7B-Instruct_v4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/seong67360/Qwen2.5-7B-Instruct_v4

SGLang

How to use seong67360/Qwen2.5-7B-Instruct_v4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "seong67360/Qwen2.5-7B-Instruct_v4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "seong67360/Qwen2.5-7B-Instruct_v4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "seong67360/Qwen2.5-7B-Instruct_v4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "seong67360/Qwen2.5-7B-Instruct_v4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use seong67360/Qwen2.5-7B-Instruct_v4 with Docker Model Runner:
```
docker model run hf.co/seong67360/Qwen2.5-7B-Instruct_v4
```

Qwen 2.5 7B Instruct 모델 파인튜닝

이 저장소는 Amazon SageMaker를 사용하여 Qwen 2.5 7B Instruct 모델을 파인튜닝하는 코드를 포함하고 있습니다. 이 프로젝트는 대규모 언어 모델의 효율적인 파인튜닝을 위해 QLoRA(Quantized Low-Rank Adaptation)를 사용합니다.

모델 사용 방법

요구사항

Python 3.8 이상
CUDA 지원 GPU (최소 24GB VRAM 권장)
필요한 라이브러리:

pip install torch transformers accelerate

기본 사용 예시

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# CUDA 사용 가능 여부 확인
if torch.cuda.is_available():
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    print("Warning: CUDA not available, using CPU")

# 모델과 토크나이저 로드
model = AutoModelForCausalLM.from_pretrained(
    "seong67360/Qwen2.5-7B-Instruct_v4",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "seong67360/Qwen2.5-7B-Instruct_v4",
    trust_remote_code=True
)

# 대화 예시
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is quantum computing?"}
]

# 응답 생성
response = model.chat(tokenizer, messages)
print(response)

메모리 최적화 옵션

GPU 메모리가 제한된 경우, 8비트 또는 4비트 양자화를 사용할 수 있습니다:

# 8비트 양자화
model = AutoModelForCausalLM.from_pretrained(
    "seong67360/Qwen2.5-7B-Instruct_v4",
    device_map="auto",
    trust_remote_code=True,
    load_in_8bit=True
)

# 또는 4비트 양자화
model = AutoModelForCausalLM.from_pretrained(
    "seong67360/Qwen2.5-7B-Instruct_v4",
    device_map="auto",
    trust_remote_code=True,
    load_in_4bit=True
)

생성 파라미터 설정

response = model.chat(
    tokenizer, 
    messages,
    temperature=0.7,          # 높을수록 더 창의적인 응답
    top_p=0.9,               # 샘플링에 사용될 누적 확률의 임계값
    max_new_tokens=512,      # 생성할 최대 토큰 수
    repetition_penalty=1.1    # 반복 방지를 위한 페널티 (1.0 이상)
)

프로젝트 구조

.
├── scripts/
│   ├── train.py
│   ├── tokenization_qwen2.py
│   ├── requirements.txt
│   └── bootstrap.sh
├── sagemaker_train.py
└── README.md

사전 요구사항

Amazon SageMaker 접근 권한
Hugging Face 계정 및 접근 토큰
AWS 자격 증명 구성
Python 3.10+

환경 설정

프로젝트에서 사용하는 주요 의존성:

PyTorch 2.1.0
Transformers (main 브랜치의 최신 버전)
Accelerate >= 0.27.0
PEFT >= 0.6.0
BitsAndBytes >= 0.41.0

모델 구성

기본 모델: Qwen/Qwen2.5-7B-Instruct
학습 방법: QLoRA (4비트 양자화)
인스턴스 유형: ml.p5.48xlarge
분산 전략: PyTorch DDP

학습 구성

하이퍼파라미터

{
    'epochs': 3,
    'per_device_train_batch_size': 4,
    'gradient_accumulation_steps': 8,
    'learning_rate': 1e-5,
    'max_steps': 1000,
    'bf16': True,
    'max_length': 2048,
    'gradient_checkpointing': True,
    'optim': 'adamw_torch',
    'lr_scheduler_type': 'cosine',
    'warmup_ratio': 0.1,
    'weight_decay': 0.01,
    'max_grad_norm': 0.3
}

환경 변수

학습 환경은 분산 학습 및 메모리 관리를 위한 최적화로 구성되어 있습니다:

CUDA 장치 구성
메모리 최적화 설정
분산 학습을 위한 EFA(Elastic Fabric Adapter) 구성
Hugging Face 토큰 및 캐시 설정

학습 프로세스

환경 준비:
- 필요한 의존성이 포함된 requirements.txt 생성
- Transformers 설치를 위한 bootstrap.sh 생성
- SageMaker 학습 구성 설정
모델 로딩:
- 4비트 양자화로 기본 Qwen 2.5 7B 모델 로드
- 양자화를 위한 BitsAndBytes 구성
- k-bit 학습을 위한 모델 준비
데이터셋 처리:
- Sujet Finance 데이터셋 사용
- Qwen2 형식으로 대화 포맷팅
- 최대 2048 토큰 길이로 토크나이징
- 병렬 처리를 통한 데이터 전처리 구현
학습:
- 메모리 효율성을 위한 gradient checkpointing 구현
- 웜업이 포함된 코사인 학습률 스케줄 사용
- 50 스텝마다 체크포인트 저장
- 10 스텝마다 학습 메트릭 로깅

모니터링 및 메트릭

학습 과정에서 다음 메트릭을 추적합니다:

학습 손실(Training loss)
평가 손실(Evaluation loss)

오류 처리

구현에는 포괄적인 오류 처리 및 로깅이 포함되어 있습니다:

환경 유효성 검사
데이터셋 준비 검증
학습 프로세스 모니터링
자세한 오류 메시지 및 스택 추적

사용 방법

AWS 자격 증명 및 SageMaker 역할 구성
Hugging Face 토큰 설정
학습 스크립트 실행:

python sagemaker_train.py

커스텀 컴포넌트

커스텀 토크나이저

프로젝트는 다음 기능이 포함된 Qwen2 토크나이저의 커스텀 구현(tokenization_qwen2.py)을 포함합니다:

특수 토큰 처리
유니코드 정규화
어휘 관리
모델 학습을 위한 입력 준비

주의사항

학습 스크립트는 ml.p5.48xlarge 인스턴스 타입에 최적화되어 있습니다
PyTorch Distributed Data Parallel을 사용한 학습
메모리 최적화를 위한 gradient checkpointing 구현
학습 실패에 대한 자동 재시도 메커니즘 포함

Downloads last month: 17

Safetensors

Model size

8B params

Tensor type

F32

Model tree for seong67360/Qwen2.5-7B-Instruct_v4

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct