None1082 commited on
Commit
249fea5
·
verified ·
1 Parent(s): f2b592f

Update README.md

Browse files

# None-HSFT\_Qwen\_Image (HSFT Optimized Model)

## Model Summary

I am proud to introduce **None-HSFT\_Qwen\_Image**, an upgraded version of the Qwen Image Transformer-2D (20B) architecture achieved through deep **Full Fine-Tuning**.

I utilized a unique **HSFT (Hybrid Supervised Fine-Tuning) strategy** and successfully trained the model on a **single NVIDIA RTX Pro 6000 96G GPU**. This **ultra-low-cost** approach successfully solved core challenges common in large-scale diffusion models at **2048x2048** pixel resolution, such as **high-frequency detail loss and the "plastic" or smoothed appearance**.

### Key Features

* **HSFT Optimization:** I pioneered the calibration of high-precision frequency loss against structural loss at a **2:1 dominance ratio**, ensuring high-frequency details are prioritized during learning.
* **2K Pixel Native Support:** The model's intrinsic weights have been deeply optimized, enabling native output of extreme textures without relying on external upscalers.
* **Ultimate Texture Fidelity:** The model completely eliminates the smoothed aesthetic, showing superior detail in materials like skin, fabrics, and metals.

-----

## Training Methodology & Innovation

### 1\. Resource Efficiency & Cost Control (The 96G Challenge)

The key innovation of this project lies in the extreme utilization of computational resources. I proved that it is possible to efficiently complete deep fine-tuning tasks for a 20B-class DiT model, which traditionally requires large clusters, using limited single-card resources.

* **Hardware Challenge:** Pushing the VRAM limits of a single **NVIDIA RTX Pro 6000 96G GPU**.
* **Efficiency Strategy:** I adopted an aggressive $\mathbf{BS=1 \text{ combined with } Acc=60}$ ultra-high gradient accumulation strategy, effectively simulating a stable EBS=60 environment, thus avoiding the complex communication and high leasing costs of multi-card clusters.
* **Time/Cost:** The training, involving **4000 2K images**, was completed over approximately **144 hours**, drastically minimizing the otherwise high training expenditure.

### 2\. Core HSFT Loss Function (FFT-Perceptual Hybrid Loss)

My fine-tuning strategy was designed to counteract the inherent smoothing tendency of MSE.

* **Loss Ratio:** The final calibrated FFT contribution is **2 times** that of the Flow Loss ($\lambda_{FFT} : \lambda_{Flow} \approx 2:1$).
* **Effect:** This aggressive strategy effectively mimics the high-frequency refinement of Generative Adversarial Networks (GANs) while maintaining the structural stability of the diffusion model.

### 3\. Scheduling & Scope (Advanced Scheduling)

* **Sampling Strategy:** I used **Logit-Normal scheduling** combined with **Flow Shift 3.2**, focusing over 75% of the training resources on the late stages of the denoising process (the high-frequency detail formation phase).
* **LoRA Injection Scope:** I employed full layer coverage (Blocks 0-59), targeting only the **Image Stream** modules, to ensure consistent texture enhancement without interfering with the model's language understanding ability.

-----

## Usage Instructions

This model was trained directly on the Qwen official base model framework and can be loaded as a complete Qwen Image model.

### Notes

* **Resolution:** I recommend a **long side of at least 1536 pixels** for best results.

### Example Code (Python)

```python
import torch
from transformers import QwenImageTransformer2DModel

# Use the Qwen official framework to load this fine-tuned model.
model_id = "Your/Repo/ID"
model = QwenImageTransformer2DModel.from_pretrained(model_id, torch_dtype=torch.bfloat16)

# ... (subsequent inference logic)
```

-----

## Sponsorship & Contact

This model demonstrates my exceptional ability to perform stable, high-efficiency fine-tuning on large-scale, high-resolution (20B/2K) diffusion models using limited resources.

I am passionate about solving cutting-edge algorithmic optimization and hardware bottlenecks, having achieved breakthroughs in the following technical areas:

* **Photorealistic enhancement of diffusion models.**
* **Stable training of ultra-high resolution (4K+) models.**
* **Extreme efficiency optimization on single high-VRAM cards.**
* **Engineering application of customized hybrid loss functions.**

This model is non-commercial open-source. For commercial inquiries, please contact me. Furthermore, I sincerely need sponsorship from friends who are enthusiastic about diffusion models, as the high computational cost places significant pressure on me as an individual author.

**📧 Contact me for Sponsorship and Commercial Cooperation:** lihaonan1082@gmail.com

**Thank you:** I extend my gratitude to the kohya\_ss and the open-source community for providing the underlying pipeline support.

Files changed (1) hide show
  1. README.md +1 -0
README.md CHANGED
@@ -2,4 +2,5 @@
2
  license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen-Image
 
5
  ---
 
2
  license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen-Image
5
+ pipeline_tag: text-to-image
6
  ---