small change for text

Browse files

Files changed (3) hide show

README.md +2 -0
conversion/README.md +12 -3
conversion/convert.py +110 -70

README.md CHANGED Viewed

@@ -19,6 +19,8 @@ pipeline_tag: visual-document-retrieval
 # ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy
 ColPali is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features.
 It is a [PaliGemma-3B](https://huggingface.co/google/paligemma-3b-mix-448) extension that generates [ColBERT](https://arxiv.org/abs/2004.12832)- style multi-vector representations of text and images.
 It was introduced in the paper [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) and first released in [this repository](https://github.com/ManuelFay/colpali)

 # ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy
+> Please read `conversion/readme.md` for details about the conversion process and notes.
 ColPali is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features.
 It is a [PaliGemma-3B](https://huggingface.co/google/paligemma-3b-mix-448) extension that generates [ColBERT](https://arxiv.org/abs/2004.12832)- style multi-vector representations of text and images.
 It was introduced in the paper [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) and first released in [this repository](https://github.com/ManuelFay/colpali)

conversion/README.md CHANGED Viewed

@@ -1,6 +1,15 @@
 # ONNX Model Conversion Notes
 First of all, this was rather fun to do!
 The `convert.py` script is based on code I made on Google Colab in order to have access to a GPU.
 The `requirements.txt` might not be perfect, I'd much rather use UV which I use on a daily basis however this was created in Google colab in a fast manner.
@@ -9,7 +18,7 @@ Also note that I checked the output of the converted models and the original to
 - The fp32 (default ONNX) is nearly the same as the original HF model.
 - However, the FP16 converted ONNX model is not exactly the same, there is a margin of error.
-Below is a code snippet that showcases the comparison:
 ```python
 import torch
@@ -25,8 +34,8 @@ DEVICE    = "cpu"
 hf = (
     ColPaliForRetrieval
-    # NOTE change this to torch.float32 when we are comparing to ONNX fp32
-    # same for fpt16 to make it fair comparison
       .from_pretrained(MODEL_ID, torch_dtype=torch.float16)
       .to(DEVICE)
       .eval()

 # ONNX Model Conversion Notes
 First of all, this was rather fun to do!
+I figured out that it might not be so explicit that the convert script I made only applies to vision and you would do, almost, the exact same for image inputs. So I have extended it a tiny bit and left this comment. Especially since the intended use of Colpali is to run image embedding at offline time (getting your vector db ready) and the text model is intended for online (query) time.
+However now Ive included that in the `convert.py` script as well, its not much of a change. Ive excluded uploading the text those model files since it is exactly the same process as the vision one, so results will be the same and uploading takes a long time with my home wifi unfortunately.
+Ive opted for two models, in theory you could split up the image and text inputs into several graphs and call them in the correct order since they do share (some) weights for each input type. However given the intended use of Colpali offline/online case its not necassary and probably overkill for this exercise.
+## Some practical notes
 The `convert.py` script is based on code I made on Google Colab in order to have access to a GPU.
 The `requirements.txt` might not be perfect, I'd much rather use UV which I use on a daily basis however this was created in Google colab in a fast manner.
 - The fp32 (default ONNX) is nearly the same as the original HF model.
 - However, the FP16 converted ONNX model is not exactly the same, there is a margin of error.
+Below is a code snippet that showcases the comparison for image input:
 ```python
 import torch
 hf = (
     ColPaliForRetrieval
+    # NOTE change this to torch.float16 when we are doing ONNX fp16
+    # Also change
       .from_pretrained(MODEL_ID, torch_dtype=torch.float16)
       .to(DEVICE)
       .eval()

conversion/convert.py CHANGED Viewed

@@ -12,11 +12,16 @@ from onnxconverter_common import float16
 from onnx.external_data_helper import convert_model_to_external_data
-def export_model(model_id, output_dir, device, fp16=False):
-    """Export HuggingFace model ColPaliForRetrieval to ONNX format"""
     os.makedirs(output_dir, exist_ok=True)
-    # Load HF model & processor
     model = (
         ColPaliForRetrieval.from_pretrained(
             model_id,
@@ -27,83 +32,118 @@ def export_model(model_id, output_dir, device, fp16=False):
         .eval()
     )
     processor = ColPaliProcessor.from_pretrained(model_id)
-    # Save HF artifacts
     model.config.save_pretrained(output_dir)
     processor.save_pretrained(output_dir)
-    # patched forward method
     _orig_forward = model.forward
-    def _patched_forward(
-        self, pixel_values=None, input_ids=None, attention_mask=None, **kwargs
-    ):
-        # Call the original .forward
-        out = _orig_forward(
-            pixel_values=pixel_values,
-            input_ids=input_ids,
-            attention_mask=attention_mask,
-            **kwargs,
-        )
-        return out.embeddings
-    model.forward = _patched_forward.__get__(model, model.__class__)
-    # check with dummy image batch
     dummy_img = Image.new("RGB", (32, 32), color="white")
     vision_pt = processor(images=[dummy_img], return_tensors="pt").to(device)
-    pv = vision_pt["pixel_values"]
-    ids = vision_pt["input_ids"]
-    msk = vision_pt["attention_mask"]
-    with torch.no_grad():
-        emb = model(pv, ids, msk)
-        print("Sanity-check embedding shape:", emb.shape)
-    # Export to ONNX + external data
-    GLOBALS.onnx_shape_inference = False  # Workaround shape bugs
-    onnx_path = os.path.join(output_dir, "model.onnx")
-    external_binfile = os.path.join(output_dir, "model.onnx_data")
-    torch.onnx.export(
-        model,
-        (pv, ids, msk),
-        onnx_path,
-        export_params=True,
-        opset_version=14,
-        do_constant_folding=True,
-        use_external_data_format=True,
-        all_tensors_to_one_file=True,
-        size_threshold=0,
-        external_data_filename=os.path.basename(external_binfile),
-        input_names=["pixel_values", "input_ids", "attention_mask"],
-        output_names=["embeddings"],
-        dynamic_axes={
-            "pixel_values": {0: "batch_size"},
-            "input_ids": {0: "batch_size", 1: "seq_len"},
-            "attention_mask": {0: "batch_size", 1: "seq_len"},
-            "embeddings": {0: "batch_size", 1: "seq_len"},
-        },
-    )
-    print("Exported ONNX to", onnx_path)
-    # Shape-infer & fix external-data refs
-    onnx_model = onnx.shape_inference.infer_shapes_path(onnx_path)
-    onnx_model = onnx.load(onnx_path)
-    check_and_save_model(onnx_model, onnx_path)
-    print("Shape-inference + external refs fixed")
-    # Minify tokenizer.json
-    tok = os.path.join(output_dir, "tokenizer.json")
-    if os.path.isfile(tok):
-        data = json.load(open(tok))
-        with open(tok, "w") as f:
-            json.dump(data, f, separators=(",", ":"))
-        print("✔ Minified tokenizer.json")
-    print("✅ ONNX + HF artifacts exported to", output_dir)
-    return onnx_path
 def quantize_fp16_and_externalize(

 from onnx.external_data_helper import convert_model_to_external_data
+def export_model(
+    model_id: str,
+    output_dir: str,
+    device: str,
+    fp16: bool = False,
+    export_type: str = "both",
+):
+    """Export ColPaliForRetrieval to ONNX vision/text/both"""
     os.makedirs(output_dir, exist_ok=True)
     model = (
         ColPaliForRetrieval.from_pretrained(
             model_id,
         .eval()
     )
     processor = ColPaliProcessor.from_pretrained(model_id)
     model.config.save_pretrained(output_dir)
     processor.save_pretrained(output_dir)
     _orig_forward = model.forward
+    #dummy inputs
     dummy_img = Image.new("RGB", (32, 32), color="white")
     vision_pt = processor(images=[dummy_img], return_tensors="pt").to(device)
+    pv, ids, msk = (
+        vision_pt["pixel_values"],
+        vision_pt["input_ids"],
+        vision_pt["attention_mask"],
+    )
+    fake_ids = torch.zeros((pv.size(0), 1), device=device, dtype=torch.long)
+    fake_mask = torch.zeros_like(fake_ids, device=device)
+    fake_pv = torch.zeros_like(pv)
+    out_paths = {}
+    # vision model
+    if export_type in ("vision", "both"):
+        def vision_forward(
+            self, pixel_values=None, input_ids=None, attention_mask=None, **kw
+        ):
+            return _orig_forward(
+                pixel_values=pixel_values,
+                input_ids=None,
+                attention_mask=None,
+                **kw,
+            ).embeddings
+        model.forward = vision_forward.__get__(model, model.__class__)
+        vision_onnx = os.path.join(output_dir, "model_vision.onnx")
+        vision_bin = "model_vision.onnx_data"
+        GLOBALS.onnx_shape_inference = False
+        torch.onnx.export(
+            model,
+            (pv, fake_ids, fake_mask),
+            vision_onnx,
+            export_params=True,
+            opset_version=14,
+            do_constant_folding=True,
+            use_external_data_format=True,
+            all_tensors_to_one_file=True,
+            size_threshold=0,
+            external_data_filename=vision_bin,
+            input_names=["pixel_values", "input_ids", "attention_mask"],
+            output_names=["embeddings"],
+            dynamic_axes={
+                "pixel_values": {0: "batch_size"},
+                "embeddings": {0: "batch_size", 1: "seq_len"},
+            },
+        )
+        print("✅ Exported VISION ONNX to", vision_onnx)
+        # fix shapes & external refs
+        m = onnx.shape_inference.infer_shapes_path(vision_onnx)
+        m = onnx.load(vision_onnx, load_external_data=True)
+        check_and_save_model(m, vision_onnx)
+        print("   (shape‐inferred + external‐data fixed)")
+        out_paths["vision"] = vision_onnx
+    # text model
+    if export_type in ("text", "both"):
+        def text_forward(
+            self, pixel_values=None, input_ids=None, attention_mask=None, **kw
+        ):
+            return _orig_forward(
+                pixel_values=None,
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                **kw,
+            ).embeddings
+        model.forward = text_forward.__get__(model, model.__class__)
+        text_onnx = os.path.join(output_dir, "model_text.onnx")
+        text_bin = "model_text.onnx_data"
+        torch.onnx.export(
+            model,
+            (fake_pv, ids, msk),
+            text_onnx,
+            export_params=True,
+            opset_version=14,
+            do_constant_folding=True,
+            use_external_data_format=True,
+            all_tensors_to_one_file=True,
+            size_threshold=0,
+            external_data_filename=text_bin,
+            input_names=["pixel_values", "input_ids", "attention_mask"],
+            output_names=["embeddings"],
+            dynamic_axes={
+                "input_ids": {0: "batch_size", 1: "seq_len"},
+                "attention_mask": {0: "batch_size", 1: "seq_len"},
+                "embeddings": {0: "batch_size", 1: "seq_len"},
+            },
+        )
+        print("✅ Exported TEXT ONNX to", text_onnx)
+        m = onnx.shape_inference.infer_shapes_path(text_onnx)
+        m = onnx.load(text_onnx, load_external_data=True)
+        check_and_save_model(m, text_onnx)
+        print("   (shape‐inferred + external‐data fixed)")
+        out_paths["text"] = text_onnx
+    print("🎉 Done exporting model(s):", out_paths)
+    return out_paths
 def quantize_fp16_and_externalize(