Update README.md
Browse files# None-HSFT\_Qwen\_Image (HSFT Optimized Model)
## Model Summary
I am proud to introduce **None-HSFT\_Qwen\_Image**, an upgraded version of the Qwen Image Transformer-2D (20B) architecture achieved through deep **Full Fine-Tuning**.
I utilized a unique **HSFT (Hybrid Supervised Fine-Tuning) strategy** and successfully trained the model on a **single NVIDIA RTX Pro 6000 96G GPU**. This **ultra-low-cost** approach successfully solved core challenges common in large-scale diffusion models at **2048x2048** pixel resolution, such as **high-frequency detail loss and the "plastic" or smoothed appearance**.
### Key Features
* **HSFT Optimization:** I pioneered the calibration of high-precision frequency loss against structural loss at a **2:1 dominance ratio**, ensuring high-frequency details are prioritized during learning.
* **2K Pixel Native Support:** The model's intrinsic weights have been deeply optimized, enabling native output of extreme textures without relying on external upscalers.
* **Ultimate Texture Fidelity:** The model completely eliminates the smoothed aesthetic, showing superior detail in materials like skin, fabrics, and metals.
-----
## Training Methodology & Innovation
### 1\. Resource Efficiency & Cost Control (The 96G Challenge)
The key innovation of this project lies in the extreme utilization of computational resources. I proved that it is possible to efficiently complete deep fine-tuning tasks for a 20B-class DiT model, which traditionally requires large clusters, using limited single-card resources.
* **Hardware Challenge:** Pushing the VRAM limits of a single **NVIDIA RTX Pro 6000 96G GPU**.
* **Efficiency Strategy:** I adopted an aggressive $\mathbf{BS=1 \text{ combined with } Acc=60}$ ultra-high gradient accumulation strategy, effectively simulating a stable EBS=60 environment, thus avoiding the complex communication and high leasing costs of multi-card clusters.
* **Time/Cost:** The training, involving **4000 2K images**, was completed over approximately **144 hours**, drastically minimizing the otherwise high training expenditure.
### 2\. Core HSFT Loss Function (FFT-Perceptual Hybrid Loss)
My fine-tuning strategy was designed to counteract the inherent smoothing tendency of MSE.
* **Loss Ratio:** The final calibrated FFT contribution is **2 times** that of the Flow Loss ($\lambda_{FFT} : \lambda_{Flow} \approx 2:1$).
* **Effect:** This aggressive strategy effectively mimics the high-frequency refinement of Generative Adversarial Networks (GANs) while maintaining the structural stability of the diffusion model.
### 3\. Scheduling & Scope (Advanced Scheduling)
* **Sampling Strategy:** I used **Logit-Normal scheduling** combined with **Flow Shift 3.2**, focusing over 75% of the training resources on the late stages of the denoising process (the high-frequency detail formation phase).
* **LoRA Injection Scope:** I employed full layer coverage (Blocks 0-59), targeting only the **Image Stream** modules, to ensure consistent texture enhancement without interfering with the model's language understanding ability.
-----
## Usage Instructions
This model was trained directly on the Qwen official base model framework and can be loaded as a complete Qwen Image model.
### Notes
* **Resolution:** I recommend a **long side of at least 1536 pixels** for best results.
### Example Code (Python)
```python
import torch
from transformers import QwenImageTransformer2DModel
# Use the Qwen official framework to load this fine-tuned model.
model_id = "Your/Repo/ID"
model = QwenImageTransformer2DModel.from_pretrained(model_id, torch_dtype=torch.bfloat16)
# ... (subsequent inference logic)
```
-----
## Sponsorship & Contact
This model demonstrates my exceptional ability to perform stable, high-efficiency fine-tuning on large-scale, high-resolution (20B/2K) diffusion models using limited resources.
I am passionate about solving cutting-edge algorithmic optimization and hardware bottlenecks, having achieved breakthroughs in the following technical areas:
* **Photorealistic enhancement of diffusion models.**
* **Stable training of ultra-high resolution (4K+) models.**
* **Extreme efficiency optimization on single high-VRAM cards.**
* **Engineering application of customized hybrid loss functions.**
This model is non-commercial open-source. For commercial inquiries, please contact me. Furthermore, I sincerely need sponsorship from friends who are enthusiastic about diffusion models, as the high computational cost places significant pressure on me as an individual author.
**📧 Contact me for Sponsorship and Commercial Cooperation:** lihaonan1082@gmail.com
**Thank you:** I extend my gratitude to the kohya\_ss and the open-source community for providing the underlying pipeline support.