Upload folder using huggingface_hub
Browse files- .byaldi/image_index/doc_ids_to_file_names.json.gz +3 -0
- .byaldi/image_index/embed_id_to_doc_id.json.gz +3 -0
- .byaldi/image_index/embeddings/embeddings_0.pt +3 -0
- .byaldi/image_index/index_config.json.gz +3 -0
- .byaldi/image_index/metadata.json.gz +3 -0
- .github/workflows/update_space.yml +28 -0
- README.md +29 -7
- app.py +82 -0
- copali-qwen.ipynb +280 -0
- image.png +0 -0
- packages.txt +1 -0
- requirements.txt +10 -0
.byaldi/image_index/doc_ids_to_file_names.json.gz
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a5adce3e520525f462d8f71c09a42b3ca10cc5039b79cd1640e0c0d97acd9e17
|
| 3 |
+
size 68
|
.byaldi/image_index/embed_id_to_doc_id.json.gz
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:60aedd13c343e38d2cb81b0c953b2f4b3db530f44b96af3167f63ff218c831ba
|
| 3 |
+
size 79
|
.byaldi/image_index/embeddings/embeddings_0.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:404010331a12bd6c9dd18c87358a10c1c1ecf58e19c2dd402ae1757cded340e6
|
| 3 |
+
size 264885
|
.byaldi/image_index/index_config.json.gz
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:83da74cae33705bdcbdc43f436f45ec099b683245a56d7fd72336954916e9a3c
|
| 3 |
+
size 174
|
.byaldi/image_index/metadata.json.gz
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a23514d5d0a1b04d797c42e596342a4b3203e7ed7886d6cad63c97ee0ae49b58
|
| 3 |
+
size 38
|
.github/workflows/update_space.yml
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: Run Python script
|
| 2 |
+
|
| 3 |
+
on:
|
| 4 |
+
push:
|
| 5 |
+
branches:
|
| 6 |
+
- main
|
| 7 |
+
|
| 8 |
+
jobs:
|
| 9 |
+
build:
|
| 10 |
+
runs-on: ubuntu-latest
|
| 11 |
+
|
| 12 |
+
steps:
|
| 13 |
+
- name: Checkout
|
| 14 |
+
uses: actions/checkout@v2
|
| 15 |
+
|
| 16 |
+
- name: Set up Python
|
| 17 |
+
uses: actions/setup-python@v2
|
| 18 |
+
with:
|
| 19 |
+
python-version: '3.9'
|
| 20 |
+
|
| 21 |
+
- name: Install Gradio
|
| 22 |
+
run: python -m pip install gradio
|
| 23 |
+
|
| 24 |
+
- name: Log in to Hugging Face
|
| 25 |
+
run: python -c 'import huggingface_hub; huggingface_hub.login(token="${{ secrets.hf_token }}")'
|
| 26 |
+
|
| 27 |
+
- name: Deploy to Spaces
|
| 28 |
+
run: gradio deploy
|
README.md
CHANGED
|
@@ -1,12 +1,34 @@
|
|
| 1 |
---
|
| 2 |
title: ColPali
|
| 3 |
-
emoji: π₯
|
| 4 |
-
colorFrom: gray
|
| 5 |
-
colorTo: pink
|
| 6 |
-
sdk: gradio
|
| 7 |
-
sdk_version: 4.44.0
|
| 8 |
app_file: app.py
|
| 9 |
-
|
|
|
|
| 10 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: ColPali
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
app_file: app.py
|
| 4 |
+
sdk: gradio
|
| 5 |
+
sdk_version: 4.41.0
|
| 6 |
---
|
| 7 |
+
# RAG-based PDF Search and Keyword Extraction using Qwen2VL
|
| 8 |
+
|
| 9 |
+
This repository contains an implementation of a **RAG (Retrieval-Augmented Generation)** based PDF search system using **Copali's implementation** of the Byaldi library and **Qwen2VL** for creating the RAG pipeline. Additionally, the repository includes a Gradio app that allows users to extract text from images and highlight searched keywords using **Qwen2VL**.
|
| 10 |
+
|
| 11 |
+
## Table of Contents
|
| 12 |
+
- [Overview](#overview)
|
| 13 |
+
- [Installation](#installation)
|
| 14 |
+
- [Usage](#usage)
|
| 15 |
+
- [RAG PDF Search](#rag-pdf-search)
|
| 16 |
+
- [Gradio App for Keyword Extraction](#gradio-app-for-keyword-extraction)
|
| 17 |
+
- [License](#license)
|
| 18 |
+
|
| 19 |
+
## Overview
|
| 20 |
+
|
| 21 |
+
### RAG PDF Search
|
| 22 |
+
|
| 23 |
+
In `copali-qwen.ipynb`, you will find the complete implementation of the **RAG-based PDF search**. The pipeline is built using the **Copali** implementation of the Byaldi library, along with **Qwen2VL**. By default, the code indexes and searches through an image (`image.png`), but you can easily modify the path to a PDF file or any other desired document.
|
| 24 |
+
|
| 25 |
+
### Gradio App for Keyword Extraction
|
| 26 |
+
|
| 27 |
+
The `app.py` file contains a **Gradio app** that utilizes only **Qwen2VL** to extract text from an image and highlight the keywords matching the user's search query. This app is an easy-to-use interface for real-time keyword extraction from images.
|
| 28 |
+
|
| 29 |
+
## Installation
|
| 30 |
+
|
| 31 |
+
To run this project, you will need to install the following dependencies:
|
| 32 |
|
| 33 |
+
```bash
|
| 34 |
+
pip install transformers byaldi qwen-vl-utils gradio pillow torch
|
app.py
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
import torch
|
| 3 |
+
from PIL import Image
|
| 4 |
+
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
|
| 5 |
+
from qwen_vl_utils import process_vision_info
|
| 6 |
+
import re
|
| 7 |
+
|
| 8 |
+
min_pixels = 256 * 28 * 28
|
| 9 |
+
max_pixels = 1280 * 28 * 28
|
| 10 |
+
|
| 11 |
+
def model_inference(images):
|
| 12 |
+
model = Qwen2VLForConditionalGeneration.from_pretrained(
|
| 13 |
+
"Qwen/Qwen2-VL-2B-Instruct",
|
| 14 |
+
trust_remote_code=True,
|
| 15 |
+
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
| 16 |
+
)
|
| 17 |
+
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
|
| 18 |
+
|
| 19 |
+
images = [{"type": "image", "image": Image.open(image[0])} for image in images]
|
| 20 |
+
|
| 21 |
+
messages = [{"role": "user", "content": images}]
|
| 22 |
+
|
| 23 |
+
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 24 |
+
image_inputs, video_inputs = process_vision_info(messages)
|
| 25 |
+
inputs = processor(
|
| 26 |
+
text=[text],
|
| 27 |
+
images=image_inputs,
|
| 28 |
+
videos=video_inputs,
|
| 29 |
+
padding=True,
|
| 30 |
+
return_tensors="pt",
|
| 31 |
+
)
|
| 32 |
+
|
| 33 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 34 |
+
inputs = inputs.to(device)
|
| 35 |
+
model = model.to(device)
|
| 36 |
+
|
| 37 |
+
generated_ids = model.generate(**inputs, max_new_tokens=512)
|
| 38 |
+
generated_ids_trimmed = [
|
| 39 |
+
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
| 40 |
+
]
|
| 41 |
+
|
| 42 |
+
output_text = processor.batch_decode(
|
| 43 |
+
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 44 |
+
)
|
| 45 |
+
|
| 46 |
+
del model
|
| 47 |
+
del processor
|
| 48 |
+
return output_text[0]
|
| 49 |
+
|
| 50 |
+
def search_and_highlight(text, keywords):
|
| 51 |
+
if not keywords:
|
| 52 |
+
return text
|
| 53 |
+
|
| 54 |
+
keywords = [kw.strip().lower() for kw in keywords.split(',')]
|
| 55 |
+
highlighted_text = text
|
| 56 |
+
|
| 57 |
+
for keyword in keywords:
|
| 58 |
+
pattern = re.compile(re.escape(keyword), re.IGNORECASE)
|
| 59 |
+
highlighted_text = pattern.sub(f'**{keyword}**', highlighted_text)
|
| 60 |
+
|
| 61 |
+
return highlighted_text
|
| 62 |
+
|
| 63 |
+
def extract_and_search(images, keywords):
|
| 64 |
+
extracted_text = model_inference(images)
|
| 65 |
+
highlighted_text = search_and_highlight(extracted_text, keywords)
|
| 66 |
+
return extracted_text, highlighted_text
|
| 67 |
+
|
| 68 |
+
with gr.Blocks(theme=gr.themes.Soft()) as demo:
|
| 69 |
+
with gr.Row():
|
| 70 |
+
output_gallery = gr.Gallery(label="Image", height=300, show_label=True)
|
| 71 |
+
keywords = gr.Textbox(placeholder="Enter keywords to search (comma-separated)", label="Search Keywords")
|
| 72 |
+
|
| 73 |
+
extract_button = gr.Button("Extract Text and Search", variant="primary")
|
| 74 |
+
|
| 75 |
+
with gr.Row():
|
| 76 |
+
raw_output = gr.Textbox(label="Interpreted Text")
|
| 77 |
+
highlighted_output = gr.Markdown(label="Highlighted Search Results")
|
| 78 |
+
|
| 79 |
+
extract_button.click(extract_and_search, inputs=[output_gallery, keywords], outputs=[raw_output, highlighted_output])
|
| 80 |
+
|
| 81 |
+
if __name__ == "__main__":
|
| 82 |
+
demo.queue(max_size=10).launch(share=True)
|
copali-qwen.ipynb
ADDED
|
@@ -0,0 +1,280 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"Implementing Colpali with Qwen2VL"
|
| 8 |
+
]
|
| 9 |
+
},
|
| 10 |
+
{
|
| 11 |
+
"cell_type": "code",
|
| 12 |
+
"execution_count": 1,
|
| 13 |
+
"metadata": {},
|
| 14 |
+
"outputs": [
|
| 15 |
+
{
|
| 16 |
+
"name": "stderr",
|
| 17 |
+
"output_type": "stream",
|
| 18 |
+
"text": [
|
| 19 |
+
"c:\\Users\\atuli\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
| 20 |
+
" from .autonotebook import tqdm as notebook_tqdm\n"
|
| 21 |
+
]
|
| 22 |
+
},
|
| 23 |
+
{
|
| 24 |
+
"name": "stdout",
|
| 25 |
+
"output_type": "stream",
|
| 26 |
+
"text": [
|
| 27 |
+
"Verbosity is set to 1 (active). Pass verbose=0 to make quieter.\n"
|
| 28 |
+
]
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"name": "stderr",
|
| 32 |
+
"output_type": "stream",
|
| 33 |
+
"text": [
|
| 34 |
+
"`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.\n",
|
| 35 |
+
"Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use\n",
|
| 36 |
+
"`config.hidden_activation` if you want to override this behaviour.\n",
|
| 37 |
+
"See https://github.com/huggingface/transformers/pull/29402 for more details.\n",
|
| 38 |
+
"Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:00<00:00, 6.01it/s]\n"
|
| 39 |
+
]
|
| 40 |
+
}
|
| 41 |
+
],
|
| 42 |
+
"source": [
|
| 43 |
+
"from byaldi import RAGMultiModalModel\n",
|
| 44 |
+
"\n",
|
| 45 |
+
"RAG = RAGMultiModalModel.from_pretrained(\"vidore/colpali\")"
|
| 46 |
+
]
|
| 47 |
+
},
|
| 48 |
+
{
|
| 49 |
+
"cell_type": "code",
|
| 50 |
+
"execution_count": 2,
|
| 51 |
+
"metadata": {},
|
| 52 |
+
"outputs": [
|
| 53 |
+
{
|
| 54 |
+
"name": "stderr",
|
| 55 |
+
"output_type": "stream",
|
| 56 |
+
"text": [
|
| 57 |
+
"You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text and `<bos>` token after that. For this call, we will infer how many images each text has and add special tokens.\n",
|
| 58 |
+
"Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)\n"
|
| 59 |
+
]
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"name": "stdout",
|
| 63 |
+
"output_type": "stream",
|
| 64 |
+
"text": [
|
| 65 |
+
"Added page 1 of document 0 to index.\n",
|
| 66 |
+
"Index exported to .byaldi\\image_index\n",
|
| 67 |
+
"Index exported to .byaldi\\image_index\n"
|
| 68 |
+
]
|
| 69 |
+
},
|
| 70 |
+
{
|
| 71 |
+
"data": {
|
| 72 |
+
"text/plain": [
|
| 73 |
+
"{0: 'image.png'}"
|
| 74 |
+
]
|
| 75 |
+
},
|
| 76 |
+
"execution_count": 2,
|
| 77 |
+
"metadata": {},
|
| 78 |
+
"output_type": "execute_result"
|
| 79 |
+
}
|
| 80 |
+
],
|
| 81 |
+
"source": [
|
| 82 |
+
"RAG.index(\n",
|
| 83 |
+
" input_path=\"image.png\",\n",
|
| 84 |
+
" index_name=\"image_index\",\n",
|
| 85 |
+
" store_collection_with_index=False,\n",
|
| 86 |
+
" overwrite=True\n",
|
| 87 |
+
")"
|
| 88 |
+
]
|
| 89 |
+
},
|
| 90 |
+
{
|
| 91 |
+
"cell_type": "code",
|
| 92 |
+
"execution_count": 3,
|
| 93 |
+
"metadata": {},
|
| 94 |
+
"outputs": [
|
| 95 |
+
{
|
| 96 |
+
"name": "stderr",
|
| 97 |
+
"output_type": "stream",
|
| 98 |
+
"text": [
|
| 99 |
+
"You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text and `<bos>` token after that. For this call, we will infer how many images each text has and add special tokens.\n"
|
| 100 |
+
]
|
| 101 |
+
},
|
| 102 |
+
{
|
| 103 |
+
"data": {
|
| 104 |
+
"text/plain": [
|
| 105 |
+
"[{'doc_id': 0, 'page_num': 1, 'score': 18.75, 'metadata': {}, 'base64': None}]"
|
| 106 |
+
]
|
| 107 |
+
},
|
| 108 |
+
"execution_count": 3,
|
| 109 |
+
"metadata": {},
|
| 110 |
+
"output_type": "execute_result"
|
| 111 |
+
}
|
| 112 |
+
],
|
| 113 |
+
"source": [
|
| 114 |
+
"text_query = \"What is the structure of the compiler?\"\n",
|
| 115 |
+
"results = RAG.search(text_query, k=1)\n",
|
| 116 |
+
"results"
|
| 117 |
+
]
|
| 118 |
+
},
|
| 119 |
+
{
|
| 120 |
+
"cell_type": "code",
|
| 121 |
+
"execution_count": 5,
|
| 122 |
+
"metadata": {},
|
| 123 |
+
"outputs": [
|
| 124 |
+
{
|
| 125 |
+
"name": "stderr",
|
| 126 |
+
"output_type": "stream",
|
| 127 |
+
"text": [
|
| 128 |
+
"The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.\n",
|
| 129 |
+
"Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}\n",
|
| 130 |
+
"Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:13<00:00, 6.88s/it]\n"
|
| 131 |
+
]
|
| 132 |
+
}
|
| 133 |
+
],
|
| 134 |
+
"source": [
|
| 135 |
+
"from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor\n",
|
| 136 |
+
"from qwen_vl_utils import process_vision_info\n",
|
| 137 |
+
"import torch\n",
|
| 138 |
+
"\n",
|
| 139 |
+
"model = Qwen2VLForConditionalGeneration.from_pretrained(\n",
|
| 140 |
+
" \"Qwen/Qwen2-VL-2B-Instruct\",\n",
|
| 141 |
+
" trust_remote_code=True,\n",
|
| 142 |
+
" torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32\n",
|
| 143 |
+
" )"
|
| 144 |
+
]
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
"cell_type": "code",
|
| 148 |
+
"execution_count": 7,
|
| 149 |
+
"metadata": {},
|
| 150 |
+
"outputs": [
|
| 151 |
+
{
|
| 152 |
+
"data": {
|
| 153 |
+
"text/plain": [
|
| 154 |
+
"0"
|
| 155 |
+
]
|
| 156 |
+
},
|
| 157 |
+
"execution_count": 7,
|
| 158 |
+
"metadata": {},
|
| 159 |
+
"output_type": "execute_result"
|
| 160 |
+
}
|
| 161 |
+
],
|
| 162 |
+
"source": [
|
| 163 |
+
"results[0][\"page_num\"] -1"
|
| 164 |
+
]
|
| 165 |
+
},
|
| 166 |
+
{
|
| 167 |
+
"cell_type": "code",
|
| 168 |
+
"execution_count": 8,
|
| 169 |
+
"metadata": {},
|
| 170 |
+
"outputs": [],
|
| 171 |
+
"source": [
|
| 172 |
+
"from PIL import Image\n",
|
| 173 |
+
"processor = AutoProcessor.from_pretrained(\"Qwen/Qwen2-VL-2B-Instruct\", trust_remote_code=True)\n",
|
| 174 |
+
"\n",
|
| 175 |
+
"messages = [\n",
|
| 176 |
+
" {\n",
|
| 177 |
+
" \"role\": \"user\",\n",
|
| 178 |
+
" \"content\": [\n",
|
| 179 |
+
" {\n",
|
| 180 |
+
" \"type\": \"image\",\n",
|
| 181 |
+
" \"image\": Image.open(\"image.png\"),\n",
|
| 182 |
+
" },\n",
|
| 183 |
+
" {\"type\": \"text\", \"text\": text_query},\n",
|
| 184 |
+
" ],\n",
|
| 185 |
+
" }\n",
|
| 186 |
+
"]"
|
| 187 |
+
]
|
| 188 |
+
},
|
| 189 |
+
{
|
| 190 |
+
"cell_type": "code",
|
| 191 |
+
"execution_count": 9,
|
| 192 |
+
"metadata": {},
|
| 193 |
+
"outputs": [],
|
| 194 |
+
"source": [
|
| 195 |
+
"text = processor.apply_chat_template(\n",
|
| 196 |
+
" messages, tokenize=False, add_generation_prompt=True\n",
|
| 197 |
+
")"
|
| 198 |
+
]
|
| 199 |
+
},
|
| 200 |
+
{
|
| 201 |
+
"cell_type": "code",
|
| 202 |
+
"execution_count": 11,
|
| 203 |
+
"metadata": {},
|
| 204 |
+
"outputs": [],
|
| 205 |
+
"source": [
|
| 206 |
+
"image_inputs, video_inputs = process_vision_info(messages)\n",
|
| 207 |
+
"inputs = processor(\n",
|
| 208 |
+
" text=[text],\n",
|
| 209 |
+
" images=image_inputs,\n",
|
| 210 |
+
" videos=video_inputs,\n",
|
| 211 |
+
" padding=True,\n",
|
| 212 |
+
" return_tensors=\"pt\",\n",
|
| 213 |
+
")\n",
|
| 214 |
+
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
| 215 |
+
"inputs = inputs.to(device)\n",
|
| 216 |
+
"model = model.to(device)"
|
| 217 |
+
]
|
| 218 |
+
},
|
| 219 |
+
{
|
| 220 |
+
"cell_type": "code",
|
| 221 |
+
"execution_count": 12,
|
| 222 |
+
"metadata": {},
|
| 223 |
+
"outputs": [],
|
| 224 |
+
"source": [
|
| 225 |
+
"generated_ids = model.generate(**inputs, max_new_tokens=50)\n",
|
| 226 |
+
"generated_ids_trimmed = [\n",
|
| 227 |
+
" out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)\n",
|
| 228 |
+
"]\n",
|
| 229 |
+
"output_text = processor.batch_decode(\n",
|
| 230 |
+
" generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False\n",
|
| 231 |
+
")\n"
|
| 232 |
+
]
|
| 233 |
+
},
|
| 234 |
+
{
|
| 235 |
+
"cell_type": "code",
|
| 236 |
+
"execution_count": 13,
|
| 237 |
+
"metadata": {},
|
| 238 |
+
"outputs": [
|
| 239 |
+
{
|
| 240 |
+
"name": "stdout",
|
| 241 |
+
"output_type": "stream",
|
| 242 |
+
"text": [
|
| 243 |
+
"['The structure of the compiler, as described in the syllabus, includes the following components:\\n\\n1. **Lexical Analysis**: This involves the role of the lexical analyzer, input buffering, and the design of lexical analyzers, specification and recognition of tokens']\n"
|
| 244 |
+
]
|
| 245 |
+
}
|
| 246 |
+
],
|
| 247 |
+
"source": [
|
| 248 |
+
"print(output_text)"
|
| 249 |
+
]
|
| 250 |
+
},
|
| 251 |
+
{
|
| 252 |
+
"cell_type": "code",
|
| 253 |
+
"execution_count": null,
|
| 254 |
+
"metadata": {},
|
| 255 |
+
"outputs": [],
|
| 256 |
+
"source": []
|
| 257 |
+
}
|
| 258 |
+
],
|
| 259 |
+
"metadata": {
|
| 260 |
+
"kernelspec": {
|
| 261 |
+
"display_name": "Python 3",
|
| 262 |
+
"language": "python",
|
| 263 |
+
"name": "python3"
|
| 264 |
+
},
|
| 265 |
+
"language_info": {
|
| 266 |
+
"codemirror_mode": {
|
| 267 |
+
"name": "ipython",
|
| 268 |
+
"version": 3
|
| 269 |
+
},
|
| 270 |
+
"file_extension": ".py",
|
| 271 |
+
"mimetype": "text/x-python",
|
| 272 |
+
"name": "python",
|
| 273 |
+
"nbconvert_exporter": "python",
|
| 274 |
+
"pygments_lexer": "ipython3",
|
| 275 |
+
"version": "3.10.11"
|
| 276 |
+
}
|
| 277 |
+
},
|
| 278 |
+
"nbformat": 4,
|
| 279 |
+
"nbformat_minor": 2
|
| 280 |
+
}
|
image.png
ADDED
|
packages.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
poppler-utils
|
requirements.txt
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
colpali-engine==0.2.0
|
| 2 |
+
pdf2image
|
| 3 |
+
GPUtil
|
| 4 |
+
accelerate==0.30.1
|
| 5 |
+
mteb>=1.12.22
|
| 6 |
+
git+https://github.com/huggingface/transformers
|
| 7 |
+
qwen-vl-utils
|
| 8 |
+
torchvision
|
| 9 |
+
fastapi<0.113.0
|
| 10 |
+
byaldi
|