Spaces:

jeanbaptdzd
/

open-finance-llm-8b

Paused

App Files Files Community

open-finance-llm-8b / docs /transformers_verification.md

jeanbaptdzd

Fix OpenAI API compatibility: support tool_choice='required' and response_format

a82e45b 26 days ago

preview code

raw

history blame

5.67 kB

Transformers Library Usage Verification

Current Implementation

✅ Library Version

Dockerfile: transformers>=4.45.0 (updated from 4.40.0)
Minimum Required: 4.37.0 for Qwen1.5, 4.35.0 for Qwen2.5
Recommended: 4.45.0+ for latest Qwen features and bug fixes

✅ Correct Usage of Transformers API

1. Model Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

# ✅ Correct: Using AutoModelForCausalLM for causal language models
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    token=hf_token,
    trust_remote_code=True,  # ✅ Required for Qwen models
    dtype=torch.bfloat16,    # ✅ Memory-efficient precision
    device_map="auto",       # ✅ Automatic device placement
    max_memory={0: "20GiB"}, # ✅ Memory management
    cache_dir=CACHE_DIR,
    low_cpu_mem_usage=True,  # ✅ Efficient loading
)

Verification:

✅ AutoModelForCausalLM is correct for Qwen (causal LM architecture)
✅ trust_remote_code=True is required for Qwen's custom code
✅ dtype=torch.bfloat16 is optimal for memory and performance
✅ device_map="auto" automatically handles GPU/CPU placement
✅ max_memory limits GPU memory usage

2. Tokenizer Loading

# ✅ Correct: Using AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    token=hf_token,
    trust_remote_code=True,  # ✅ Required for Qwen
    cache_dir=CACHE_DIR,
)

Verification:

✅ AutoTokenizer automatically detects Qwen tokenizer
✅ trust_remote_code=True loads Qwen's custom tokenizer code
✅ Chat template handling is correct

3. Chat Template Usage

# ✅ Correct: Using apply_chat_template
if hasattr(tokenizer, "apply_chat_template"):
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

Verification:

✅ apply_chat_template is the modern way (replaces manual formatting)
✅ tokenize=False returns string (we tokenize separately)
✅ add_generation_prompt=True adds assistant prompt

4. Model Generation

# ✅ Correct: Using model.generate()
outputs = model.generate(
    **inputs,
    max_new_tokens=max_tokens,
    temperature=temperature,
    top_p=top_p,
    top_k=DEFAULT_TOP_K,
    do_sample=temperature > 0,
    pad_token_id=PAD_TOKEN_ID,
    eos_token_id=EOS_TOKENS,
    repetition_penalty=REPETITION_PENALTY,
    use_cache=True,
)

Verification:

✅ max_new_tokens is correct (not max_length)
✅ do_sample based on temperature is correct
✅ pad_token_id and eos_token_id properly configured
✅ repetition_penalty helps avoid repetition
✅ use_cache=True improves performance

5. Streaming Support

# ✅ Correct: Using TextIteratorStreamer
from transformers import TextIteratorStreamer

streamer = TextIteratorStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

Verification:

✅ TextIteratorStreamer is the correct class for streaming
✅ skip_prompt=True avoids re-printing the prompt
✅ skip_special_tokens=True produces clean output

Qwen-Specific Considerations

✅ Model Architecture

Qwen-Open-Finance-R-8B is based on Qwen architecture
Uses CausalLM architecture (autoregressive generation)
Compatible with AutoModelForCausalLM

✅ Tokenizer Features

Qwen tokenizer supports chat templates
Custom chat template can be loaded from model repo
Handles special tokens correctly

✅ Generation Parameters

Qwen works well with:
- temperature: 0.1-1.0 (we use 0.7 default)
- top_p: 0.9-1.0 (we use 1.0 default)
- top_k: 50-100 (we use DEFAULT_TOP_K)
- repetition_penalty: 1.0-1.2 (we use REPETITION_PENALTY)

Best Practices Followed

✅ Memory Management: Using bfloat16, low_cpu_mem_usage, max_memory
✅ Device Handling: device_map="auto" for automatic GPU/CPU
✅ Caching: Using cache_dir for model/tokenizer caching
✅ Error Handling: Proper exception handling in initialization
✅ Thread Safety: Using locks for concurrent initialization
✅ Streaming: Proper async streaming implementation

Potential Improvements

1. Consider Using `torch.compile()` (PyTorch 2.0+)

# Optional: Compile model for faster inference
if hasattr(torch, 'compile'):
    model = torch.compile(model, mode="reduce-overhead")

2. Consider Flash Attention 2

# For faster attention computation (if supported)
model = AutoModelForCausalLM.from_pretrained(
    ...,
    attn_implementation="flash_attention_2",  # If available
)

3. Consider Quantization (if memory constrained)

# 8-bit quantization (requires bitsandbytes)
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
)

Version Compatibility Matrix

Component	Minimum	Recommended	Current
Transformers	4.37.0	4.45.0+	4.45.0+ ✅
PyTorch	2.0.0	2.5.0+	2.5.0+ ✅
Python	3.8	3.11+	3.11 ✅
CUDA	11.8	12.4	12.4 ✅

Conclusion

✅ Our Transformers implementation is correct and follows best practices.

The code:

Uses correct Transformers API methods
Properly handles Qwen-specific requirements
Implements efficient memory management
Supports streaming correctly
Uses appropriate generation parameters

The version update to 4.45.0+ ensures:

Latest bug fixes
Better Qwen support
Improved performance
Security updates

Transformers Library Usage Verification

Current Implementation

✅ Library Version

✅ Correct Usage of Transformers API

1. Model Loading

2. Tokenizer Loading

3. Chat Template Usage

4. Model Generation

5. Streaming Support

Qwen-Specific Considerations

✅ Model Architecture

✅ Tokenizer Features

✅ Generation Parameters

Best Practices Followed

Potential Improvements

1. Consider Using torch.compile() (PyTorch 2.0+)

2. Consider Flash Attention 2

3. Consider Quantization (if memory constrained)

Version Compatibility Matrix

Conclusion

1. Consider Using `torch.compile()` (PyTorch 2.0+)