GQFth
/

Uprm-i1

Image Classification

vision-language

Model card Files Files and versions

Uprm-i1 / README.md

GQFth's picture

Update README.md

ba18d2f verified 4 months ago

|

history blame contribute delete

2.53 kB

	---
	license: mit
	datasets:
	- uoft-cs/cifar10
	language:
	- zh
	- en
	metrics:
	- accuracy
	pipeline_tag: image-classification
	tags:
	- multimodal
	- cifar10
	- cnn
	- bert
	- vision-language
	---

	<div align="center">
	<img src="assets/log1.png" alt="UprmT_T AI" width="180"/>
	<h1>UprmT_T</h1>
	<p><strong>多模态图像分类 · 从依赖文本到真正看图</strong></p>

	<div style="display: flex; justify-content: center; gap: 12px; margin: 16px 0; flex-wrap: wrap;">
	<a href="https://huggingface.co/GQFth/Uprm-i1">
	<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Model-Uprm--i1-ffc107?style=for-the-badge" alt="HF"/>
	</a>
	<a href="https://swanlab.cn/@020202/multimodal-object-detection/runs/u2nvr8dtqnfs7iv86r7xs/chart">
	<img src="https://img.shields.io/badge/SwanLab-Run-4B8BF5?style=for-the-badge&logo=swan" alt="SwanLab"/>
	</a>
	<a href="https://github.com/GQFth/UprmT_T">
	<img src="https://img.shields.io/badge/GitHub-Code-181717?style=for-the-badge&logo=github" alt="GitHub"/>
	</a>
	</div>
	</div>

	---

	## 模型概览

	\| 版本 \| 图像输入 \| CNN 层数 \| BN \| 噪声 \| 准确率 \| GPU 利用率 \|
	\|------\|----------\|---------\|----\|------\|--------\|------------\|
	\| v01 \| 32×32 \| 2 \| ❌ \| ✅ \| 72.3% \| 34% \|
	\| v02 \| 128×128 \| 3 \| ✅ \| ❌ \| 86.7% \| 78% \|

	> 核心升级：三层 CNN + BN + 高分辨率输入 → 解决「看不清图」「GPU 吃不饱」两大痛点

	---

	## 实验日志（完整记录）

	<details>
	<summary><strong>2025/11/6 · test_v_02.py</strong> （点击展开）</summary>

	```text
	模型训练完成时间：2025.10.31
	模型文件：multimodal_cifar10_epoch10.pth

	结构升级：
	2层图像特征解析 → 3层解析层
	新增：BatchNorm 层
	移除：训练时噪声注入

	问题发现：
	• 训练集分辨率过低（32×32），无法泛化到高分辨率图像
	• 显卡算力增加，但利用率低（<40%）
	• 可能原因：输入太小、batch_size 不足、数据加载瓶颈

	模型文件：
	• multimodal_model_epoch50.pth
	• multimodal_model_epoch50_1.pth

	结构：
	• 双层图像特征解析
	• BERT 预生成文本解码

	问题：
	• 模型严重依赖文本提示
	• 为逼模型学习图像，加入大量噪声
	• 但 CNN 结构极简 → 学到的图像特征过于浅层

	下一步计划

	-[ ] 升输入到 224×224
	-[ ] 替换 CNN 为 ViT-tiny
	-[ ] 加入 CLIP-style 对比学习
	-[ ] 开放 Inference API