VeuReu commited on
Commit
5ef2eb1
·
verified ·
1 Parent(s): e9198d6

Upload 9 files

Browse files
Files changed (4) hide show
  1. README.md +8 -50
  2. app.py +80 -22
  3. clients/client_test.py +9 -0
  4. requirements.txt +0 -5
README.md CHANGED
@@ -1,62 +1,20 @@
1
  ---
2
- title: veureu-schat
3
  emoji: 🦎
4
  colorFrom: purple
5
  colorTo: indigo
6
  sdk: gradio
7
- sdk_version: "4.44.0"
8
  app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- # 🦎 veureu-schat (Salamandra-Vision 7B · ZeroGPU)
13
 
14
- Este Space despliega el modelo **[BSC-LT/salamandra-7b-vision](https://huggingface.co/BSC-LT/salamandra-7b-vision)** —una variante de **LLaVA-OneVision** entrenada por el *Barcelona Supercomputing Center*— utilizando **máquinas ZeroGPU**.
 
 
 
15
 
16
- Permite enviar una **imagen y un texto (prompt)** para recibir una **descripción generada automáticamente**.
17
- Funciona tanto desde la **interfaz web (Gradio)** como desde **clientes externos** (por ejemplo, otro Space con Streamlit o una app Python local).
18
 
19
- ---
20
-
21
- ## 🚀 Características
22
-
23
- - **ZeroGPU**: utiliza GPU bajo demanda, sin necesidad de hardware dedicado.
24
- - **Entrada multimodal**: imagen + texto.
25
- - **Salida**: texto descriptivo (en catalán o español).
26
- - **API REST directa** (`/api/describe_raw`) + **API Gradio** (`/api/predict/describe`).
27
- - Compatible con clientes HTTP (`requests`) o `gradio_client`.
28
-
29
- ---
30
-
31
- ## 🧠 Modelo
32
-
33
- - **Modelo:** `BSC-LT/salamandra-7b-vision`
34
- - **Arquitectura:** LLaVA-OneVision 7B
35
- - **Framework:** PyTorch + Transformers
36
- - **Capa de entrada:** `AutoProcessor`
37
- - **Generación:** `LlavaOnevisionForConditionalGeneration`
38
-
39
- El modelo combina visión y lenguaje para generar texto a partir de imágenes, siguiendo el esquema de conversación (“chat template”) oficial de OneVision.
40
-
41
- ---
42
-
43
- ## ⚙️ Configuración del Space
44
-
45
- **Hardware:** ZeroGPU
46
- **SDK:** Gradio
47
- **Archivo principal:** `app.py`
48
- **Requisitos:** `requirements.txt`
49
-
50
- Ejemplo del bloque de configuración YAML (este ya está en la cabecera del README):
51
-
52
- ```yaml
53
- ---
54
- title: Salamandra-Vision 7B · ZeroGPU
55
- emoji: 🦎
56
- colorFrom: purple
57
- colorTo: indigo
58
- sdk: gradio
59
- sdk_version: "4.44.0"
60
- app_file: app.py
61
- pinned: false
62
- ---
 
1
  ---
2
+ title: veureu-svision
3
  emoji: 🦎
4
  colorFrom: purple
5
  colorTo: indigo
6
  sdk: gradio
7
+ sdk_version: "4.44.1"
8
  app_file: app.py
9
  pinned: false
10
  ---
11
 
12
+ # 🦎 veureu-svision (Salamandra-Vision 7B · ZeroGPU)
13
 
14
+ ## Endpoints
15
+ - **`/api/predict`** (Gradio): **batch** — entrada `[[<file1>, <file2>, ...], "{...context_json...}", 256, 0.7]` → salida `["desc1", "desc2", ...]`.
16
+ - **`/api/describe_raw`** (multipart): `image`, `text`, `max_new_tokens`, `temperature` → `{"text": "..."}`.
17
+ - **`/api/describe`** (Gradio UI single).
18
 
19
+ Compatibilidad con el `engine`: el `VisionClient` del engine llama a **`api_name="/predict"`** con *lista de imágenes* y **`context_json`**.
 
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -1,6 +1,8 @@
1
- # app.py
2
  import os
3
- from typing import Dict
 
 
4
  import gradio as gr
5
  import spaces
6
  import torch
@@ -8,13 +10,14 @@ from PIL import Image
8
  from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
9
 
10
  MODEL_ID = os.environ.get("MODEL_ID", "BSC-LT/salamandra-7b-vision")
11
- DTYPE = torch.float16
12
- DEVICE = "cuda"
13
 
14
  _model = None
15
  _processor = None
16
 
17
- def _lazy_load():
 
18
  global _model, _processor
19
  if _model is None or _processor is None:
20
  _processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
@@ -26,32 +29,67 @@ def _lazy_load():
26
  use_safetensors=True,
27
  device_map=None,
28
  )
 
29
  return _model, _processor
30
 
31
- def _compose_prompt(user_text: str):
32
- convo = [{"role": "user", "content": [{"type": "image"},
33
- {"type": "text", "text": user_text or "Describe la imagen con detalle."}]}]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  return convo
35
 
36
- @spaces.GPU
37
- def infer_core(image: Image.Image, text: str, max_new_tokens: int = 256, temperature: float = 0.7) -> str:
 
 
38
  model, processor = _lazy_load()
39
- prompt = processor.apply_chat_template(_compose_prompt(text), add_generation_prompt=True)
40
- model = model.to(DEVICE)
41
- inputs = processor(images=image, text=prompt, return_tensors="pt").to(DEVICE, DTYPE)
42
  with torch.inference_mode():
43
  out = model.generate(**inputs, max_new_tokens=int(max_new_tokens), temperature=float(temperature))
44
  return processor.decode(out[0], skip_special_tokens=True).strip()
45
 
46
 
47
- # ---------- Helper for API ----------
 
48
  def describe_raw(image: Image.Image, text: str = "Describe la imagen con detalle.",
49
  max_new_tokens: int = 256, temperature: float = 0.7) -> Dict[str, str]:
50
- result = infer_core(image, text, max_new_tokens, temperature)
51
  return {"text": result}
52
 
53
 
54
- # ---------- UI and API ----------
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  with gr.Blocks(title="Salamandra Vision 7B · ZeroGPU") as demo:
56
  gr.Markdown("## Salamandra-Vision 7B · ZeroGPU\nImagen + texto → descripción.")
57
  with gr.Row():
@@ -64,12 +102,32 @@ with gr.Blocks(title="Salamandra Vision 7B · ZeroGPU") as demo:
64
  with gr.Column():
65
  out = gr.Textbox(label="Descripción", lines=18)
66
 
67
- # Endpoint for UI
68
- btn.click(infer_core, [in_img, in_txt, max_new, temp], out, api_name="describe")
 
 
 
 
 
 
 
 
 
 
 
69
 
70
- # Endpoint for API (no UI)
71
- demo.load(None, [gr.Image(label="image", type="pil"), gr.Textbox(value="Describe la imagen con detalle."),
72
- gr.Slider(16, 1024, value=256), gr.Slider(0.0, 1.5, value=0.7)],
73
- describe_raw, api_name="describe_raw")
 
 
 
 
 
 
 
 
74
 
75
  demo.queue(concurrency_count=1, max_size=16).launch()
 
 
1
+ # app.py — veureu/svision (Salamandra Vision 7B · ZeroGPU) — compatible con ENGINE
2
  import os
3
+ import json
4
+ from typing import Dict, List, Optional, Tuple, Union
5
+
6
  import gradio as gr
7
  import spaces
8
  import torch
 
10
  from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
11
 
12
  MODEL_ID = os.environ.get("MODEL_ID", "BSC-LT/salamandra-7b-vision")
13
+ DTYPE = torch.float16 if torch.cuda.is_available() else torch.float32
14
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
15
 
16
  _model = None
17
  _processor = None
18
 
19
+
20
+ def _lazy_load() -> Tuple[LlavaOnevisionForConditionalGeneration, AutoProcessor]:
21
  global _model, _processor
22
  if _model is None or _processor is None:
23
  _processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
 
29
  use_safetensors=True,
30
  device_map=None,
31
  )
32
+ _model.to(DEVICE)
33
  return _model, _processor
34
 
35
+
36
+ def _compose_prompt(user_text: str, context: Optional[Dict] = None) -> List[Dict]:
37
+ """Construye el chat template con imagen + texto + contexto opcional."""
38
+ ctx_txt = ""
39
+ if context:
40
+ try:
41
+ # breve, sin ruido
42
+ ctx_txt = "\n\nContexto adicional:\n" + json.dumps(context, ensure_ascii=False)[:2000]
43
+ except Exception:
44
+ pass
45
+ user_txt = (user_text or "Describe la imagen con detalle.") + ctx_txt
46
+ convo = [
47
+ {
48
+ "role": "user",
49
+ "content": [
50
+ {"type": "image"},
51
+ {"type": "text", "text": user_txt},
52
+ ],
53
+ }
54
+ ]
55
  return convo
56
 
57
+
58
+ @spaces.GPU # en HF Spaces usará GPU cuando haya disponibilidad (ZeroGPU)
59
+ def _infer_one(image: Image.Image, text: str, max_new_tokens: int = 256, temperature: float = 0.7,
60
+ context: Optional[Dict] = None) -> str:
61
  model, processor = _lazy_load()
62
+ prompt = processor.apply_chat_template(_compose_prompt(text, context), add_generation_prompt=True)
63
+ inputs = processor(images=image, text=prompt, return_tensors="pt").to(DEVICE, dtype=DTYPE)
 
64
  with torch.inference_mode():
65
  out = model.generate(**inputs, max_new_tokens=int(max_new_tokens), temperature=float(temperature))
66
  return processor.decode(out[0], skip_special_tokens=True).strip()
67
 
68
 
69
+ # ----------------------------- API helpers -----------------------------------
70
+
71
  def describe_raw(image: Image.Image, text: str = "Describe la imagen con detalle.",
72
  max_new_tokens: int = 256, temperature: float = 0.7) -> Dict[str, str]:
73
+ result = _infer_one(image, text, max_new_tokens, temperature, context=None)
74
  return {"text": result}
75
 
76
 
77
+ def describe_batch(images: List[Image.Image], context_json: str,
78
+ max_new_tokens: int = 256, temperature: float = 0.7) -> List[str]:
79
+ """Endpoint batch para ENGINE: lista de imágenes + contexto (JSON) → lista de textos."""
80
+ try:
81
+ context = json.loads(context_json) if context_json else None
82
+ except Exception:
83
+ context = None
84
+ outputs: List[str] = []
85
+ for img in images:
86
+ outputs.append(_infer_one(img, text="Describe la imagen con detalle.", max_new_tokens=max_new_tokens,
87
+ temperature=temperature, context=context))
88
+ return outputs
89
+
90
+
91
+ # ----------------------------- UI & Endpoints --------------------------------
92
+
93
  with gr.Blocks(title="Salamandra Vision 7B · ZeroGPU") as demo:
94
  gr.Markdown("## Salamandra-Vision 7B · ZeroGPU\nImagen + texto → descripción.")
95
  with gr.Row():
 
102
  with gr.Column():
103
  out = gr.Textbox(label="Descripción", lines=18)
104
 
105
+ # UI
106
+ btn.click(_infer_one, [in_img, in_txt, max_new, temp], out, api_name="describe")
107
+
108
+ # API simple (multipart) compatible con tu versión anterior
109
+ demo.load(
110
+ None,
111
+ [gr.Image(label="image", type="pil"),
112
+ gr.Textbox(value="Describe la imagen con detalle."),
113
+ gr.Slider(16, 1024, value=256),
114
+ gr.Slider(0.0, 1.5, value=0.7)],
115
+ describe_raw,
116
+ api_name="describe_raw"
117
+ )
118
 
119
+ # API BATCH para ENGINE (Gradio Client): images + context_json → list[str]
120
+ # Firma que espera el VisionClient del engine (api_name="/predict")
121
+ batch_in_images = gr.Gallery(label="Imágenes (batch)", show_label=False).style(grid=[4], height="auto")
122
+ batch_context = gr.Textbox(label="context_json", value="{}", lines=4)
123
+ batch_max = gr.Slider(16, 1024, value=256, step=16, label="max_new_tokens")
124
+ batch_temp = gr.Slider(0.0, 1.5, value=0.7, step=0.05, label="temperature")
125
+ batch_btn = gr.Button("Describir lote")
126
+ batch_out = gr.JSON(label="Descripciones (lista)")
127
+
128
+ # Nota: Gradio Gallery entrega rutas/obj; nos apoyamos en el cliente para cargar archivos
129
+ batch_btn.click(describe_batch, [batch_in_images, batch_context, batch_max, batch_temp], batch_out,
130
+ api_name="predict")
131
 
132
  demo.queue(concurrency_count=1, max_size=16).launch()
133
+
clients/client_test.py ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ from gradio_client import Client
2
+ c = Client("https://veureu-svision.hf.space")
3
+ out = c.predict(
4
+ ["tests/cat.jpg", "tests/dog.jpg"], # lista de imágenes
5
+ '{"hint":"animales domésticos"}', # context_json
6
+ 256, 0.7,
7
+ api_name="/predict"
8
+ )
9
+ print(out) # -> ["desc para cat", "desc para dog"]
requirements.txt CHANGED
@@ -1,4 +1,3 @@
1
- # app (ZeroGPU Gradio)
2
  gradio>=4.44.1
3
  spaces>=0.25.0
4
  transformers>=4.44.0
@@ -6,7 +5,3 @@ torch>=2.2
6
  accelerate>=0.30.0
7
  safetensors>=0.4.2
8
  pillow>=10.3
9
-
10
- # clients
11
- #requests>=2.31.0
12
- #streamlit>=1.36.0
 
 
1
  gradio>=4.44.1
2
  spaces>=0.25.0
3
  transformers>=4.44.0
 
5
  accelerate>=0.30.0
6
  safetensors>=0.4.2
7
  pillow>=10.3