--- base_model: Qwen/Qwen2.5-VL-3B-Instruct library_name: peft pipeline_tag: image-text-to-text license: apache-2.0 tags: - base_model:adapter:Qwen/Qwen2.5-VL-3B-Instruct - llama-factory - lora - transformers - finance - vision-language --- # PyFi-QwenVL-3B-47K This model is a parameter-efficient fine-tuned version (LoRA) of [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) specialized for financial image understanding. It was introduced as part of the **PyFi** framework. - **Paper:** [PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents](https://arxiv.org/abs/2512.14735) - **Repository:** [https://github.com/AgenticFinLab/PyFi](https://github.com/AgenticFinLab/PyFi) - **Dataset:** [PyFi-600K](https://huggingface.co/datasets/AgenticFinLab/PyFi-600K) ## Model Description PyFi (Pyramid-like Financial Image Understanding) is a framework designed to enable Vision Language Models (VLMs) to reason through financial images—such as stock charts, financial reports, and economic diagrams—in a progressive, simple-to-complex manner. This specific checkpoint is the 3B variant fine-tuned on approximately 47,000 reasoning chains. This version was trained **without Chain-of-Thought (CoT)**, focusing on the model's ability to provide the final answer in the financial reasoning pyramid. The model is designed to handle tasks across six hierarchical capability levels: 1. **Perception**: Basic visual understanding. 2. **Data Extraction**: Information retrieval from charts and tables. 3. **Calculation Analysis**: Numerical analysis tasks. 4. **Pattern Recognition**: Identifying trends and patterns. 5. **Logical Reasoning**: Complex logical analysis. 6. **Decision Support**: Strategic decision-making assistance. ## Training Details - **Finetuning approach:** LoRA (Parameter-Efficient Fine-Tuning) with full-module adaptation. - **Training Data:** 47K sample chains from the PyFi-600K dataset. - **Optimizer:** AdamW - **Learning Rate:** $1.0 \times 10^{-4}$ - **Learning Rate Schedule:** Cosine scheduling with a warmup ratio of 0.1. - **Training Epochs:** 1 - **Effective Batch Size:** 8 - **Hardware:** 4x NVIDIA RTX 5090 GPUs. ## Evaluation Results In the PyFi benchmark, fine-tuning on pyramid-structured question chains showed significant improvements. The PyFi models (when using CoT) yielded average accuracy improvements of 19.52% for the 3B variant over baseline pre-trained models. ## Citation If you use PyFi in your research, please cite: ```bibtex @article{pyfi2025, title={PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents}, author={Zhang, Yuqun and Zhao, Yuxuan and Chen, Sijia}, journal={arXiv preprint arXiv:2512.14735}, year={2025} } ```