Spaces:

jeanbaptdzd
/

open-finance-llm-8b

Paused

App Files Files Community

open-finance-llm-8b / docs /transformers_verification.md

jeanbaptdzd

Fix OpenAI API compatibility: support tool_choice='required' and response_format

a82e45b about 1 month ago

preview code

raw

history blame

5.67 kB

	# Transformers Library Usage Verification

	## Current Implementation

	### ✅ Library Version
	- Dockerfile: `transformers>=4.45.0` (updated from 4.40.0)
	- Minimum Required: 4.37.0 for Qwen1.5, 4.35.0 for Qwen2.5
	- Recommended: 4.45.0+ for latest Qwen features and bug fixes

	### ✅ Correct Usage of Transformers API

	#### 1. Model Loading
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# ✅ Correct: Using AutoModelForCausalLM for causal language models
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_NAME,
	token=hf_token,
	trust_remote_code=True, # ✅ Required for Qwen models
	dtype=torch.bfloat16, # ✅ Memory-efficient precision
	device_map="auto", # ✅ Automatic device placement
	max_memory={0: "20GiB"}, # ✅ Memory management
	cache_dir=CACHE_DIR,
	low_cpu_mem_usage=True, # ✅ Efficient loading
	)
	```

	Verification:
	- ✅ `AutoModelForCausalLM` is correct for Qwen (causal LM architecture)
	- ✅ `trust_remote_code=True` is required for Qwen's custom code
	- ✅ `dtype=torch.bfloat16` is optimal for memory and performance
	- ✅ `device_map="auto"` automatically handles GPU/CPU placement
	- ✅ `max_memory` limits GPU memory usage

	#### 2. Tokenizer Loading
	```python
	# ✅ Correct: Using AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained(
	MODEL_NAME,
	token=hf_token,
	trust_remote_code=True, # ✅ Required for Qwen
	cache_dir=CACHE_DIR,
	)
	```

	Verification:
	- ✅ `AutoTokenizer` automatically detects Qwen tokenizer
	- ✅ `trust_remote_code=True` loads Qwen's custom tokenizer code
	- ✅ Chat template handling is correct

	#### 3. Chat Template Usage
	```python
	# ✅ Correct: Using apply_chat_template
	if hasattr(tokenizer, "apply_chat_template"):
	prompt = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	)
	```

	Verification:
	- ✅ `apply_chat_template` is the modern way (replaces manual formatting)
	- ✅ `tokenize=False` returns string (we tokenize separately)
	- ✅ `add_generation_prompt=True` adds assistant prompt

	#### 4. Model Generation
	```python
	# ✅ Correct: Using model.generate()
	outputs = model.generate(
	**inputs,
	max_new_tokens=max_tokens,
	temperature=temperature,
	top_p=top_p,
	top_k=DEFAULT_TOP_K,
	do_sample=temperature > 0,
	pad_token_id=PAD_TOKEN_ID,
	eos_token_id=EOS_TOKENS,
	repetition_penalty=REPETITION_PENALTY,
	use_cache=True,
	)
	```

	Verification:
	- ✅ `max_new_tokens` is correct (not `max_length`)
	- ✅ `do_sample` based on temperature is correct
	- ✅ `pad_token_id` and `eos_token_id` properly configured
	- ✅ `repetition_penalty` helps avoid repetition
	- ✅ `use_cache=True` improves performance

	#### 5. Streaming Support
	```python
	# ✅ Correct: Using TextIteratorStreamer
	from transformers import TextIteratorStreamer

	streamer = TextIteratorStreamer(
	tokenizer,
	skip_prompt=True,
	skip_special_tokens=True
	)
	```

	Verification:
	- ✅ `TextIteratorStreamer` is the correct class for streaming
	- ✅ `skip_prompt=True` avoids re-printing the prompt
	- ✅ `skip_special_tokens=True` produces clean output

	## Qwen-Specific Considerations

	### ✅ Model Architecture
	- Qwen-Open-Finance-R-8B is based on Qwen architecture
	- Uses CausalLM architecture (autoregressive generation)
	- Compatible with `AutoModelForCausalLM`

	### ✅ Tokenizer Features
	- Qwen tokenizer supports chat templates
	- Custom chat template can be loaded from model repo
	- Handles special tokens correctly

	### ✅ Generation Parameters
	- Qwen works well with:
	- `temperature`: 0.1-1.0 (we use 0.7 default)
	- `top_p`: 0.9-1.0 (we use 1.0 default)
	- `top_k`: 50-100 (we use DEFAULT_TOP_K)
	- `repetition_penalty`: 1.0-1.2 (we use REPETITION_PENALTY)

	## Best Practices Followed

	1. ✅ Memory Management: Using `bfloat16`, `low_cpu_mem_usage`, `max_memory`
	2. ✅ Device Handling: `device_map="auto"` for automatic GPU/CPU
	3. ✅ Caching: Using `cache_dir` for model/tokenizer caching
	4. ✅ Error Handling: Proper exception handling in initialization
	5. ✅ Thread Safety: Using locks for concurrent initialization
	6. ✅ Streaming: Proper async streaming implementation

	## Potential Improvements

	### 1. Consider Using `torch.compile()` (PyTorch 2.0+)
	```python
	# Optional: Compile model for faster inference
	if hasattr(torch, 'compile'):
	model = torch.compile(model, mode="reduce-overhead")
	```

	### 2. Consider Flash Attention 2
	```python
	# For faster attention computation (if supported)
	model = AutoModelForCausalLM.from_pretrained(
	...,
	attn_implementation="flash_attention_2", # If available
	)
	```

	### 3. Consider Quantization (if memory constrained)
	```python
	# 8-bit quantization (requires bitsandbytes)
	from transformers import BitsAndBytesConfig

	quantization_config = BitsAndBytesConfig(
	load_in_8bit=True,
	)
	```

	## Version Compatibility Matrix

	\| Component \| Minimum \| Recommended \| Current \|
	\|-----------\|---------\|-------------\|---------\|
	\| Transformers \| 4.37.0 \| 4.45.0+ \| 4.45.0+ ✅ \|
	\| PyTorch \| 2.0.0 \| 2.5.0+ \| 2.5.0+ ✅ \|
	\| Python \| 3.8 \| 3.11+ \| 3.11 ✅ \|
	\| CUDA \| 11.8 \| 12.4 \| 12.4 ✅ \|

	## Conclusion

	✅ Our Transformers implementation is correct and follows best practices.

	The code:
	- Uses correct Transformers API methods
	- Properly handles Qwen-specific requirements
	- Implements efficient memory management
	- Supports streaming correctly
	- Uses appropriate generation parameters

	The version update to 4.45.0+ ensures:
	- Latest bug fixes
	- Better Qwen support
	- Improved performance
	- Security updates