diff --git a/Paper2Video/LICENSE b/Paper2Video/LICENSE
deleted file mode 100644
index dd76e7908a4ba73fe1bd0cf725a597bed9352f2a..0000000000000000000000000000000000000000
--- a/Paper2Video/LICENSE
+++ /dev/null
@@ -1,21 +0,0 @@
-MIT License
-
-Copyright (c) 2025 Show Lab
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
diff --git a/Paper2Video/README-CN.md b/Paper2Video/README-CN.md
deleted file mode 100644
index 1a8106ac91d695af96e07bd92b2b3f98bd2206c3..0000000000000000000000000000000000000000
--- a/Paper2Video/README-CN.md
+++ /dev/null
@@ -1,248 +0,0 @@
-# Paper2Video
-
-
- English | 简体中文
-
-
-
-
- Paper2Video: 从学术论文自动生成演讲视频
-
-
-
-
- Zeyu Zhu*,
- Kevin Qinghong Lin*,
- Mike Zheng Shou
- 新加坡国立大学 Show Lab
-
-
-
-
- 📄 论文 |
- 🤗 Daily Paper |
- 📊 数据集 |
- 🌐 项目主页 |
- 💬 推特
-
-
-- **输入:** 一篇论文 ➕ 一张图像 ➕ 一段音频
-
-| 论文 | 图像 | 音频 |
-|--------|--------|--------|
-| 
[🔗 论文链接](https://arxiv.org/pdf/1509.01626) |
Hinton的图像| 
[🔗 音频样本](https://github.com/showlab/Paper2Video/blob/page/assets/hinton/ref_audio_10.wav) |
-
-
-- **输出:** 演讲视频
-
-
-
-https://github.com/user-attachments/assets/39221a9a-48cb-4e20-9d1c-080a5d8379c4
-
-
-
-
-查看更多生成结果 [🌐 project page](https://showlab.github.io/Paper2Video/).
-
-## 🔥 Update
-- [x] [2025.10.11] 我们的工作在[YC Hacker News](https://news.ycombinator.com/item?id=45553701)上受到关注.
-- [x] [2025.10.9] 感谢AK在[Twitter](https://x.com/_akhaliq/status/1976099830004072849)上分享我们的工作!
-- [x] [2025.10.9] 我们的工作被 [Medium](https://medium.com/@dataism/how-ai-learned-to-make-scientific-videos-from-slides-to-a-talking-head-0d807e491b27)报道.
-- [x] [2025.10.8] 下方查看我们的demo视频!
-- [x] [2025.10.7] 我们发布了 [Arxiv 论文](https://arxiv.org/abs/2510.05096).
-- [x] [2025.10.6] 我们发布了 [代码](https://github.com/showlab/Paper2Video) and [数据集](https://huggingface.co/datasets/ZaynZhu/Paper2Video).
-- [x] [2025.9.28] Paper2Video 已经被 **Scaling Environments for Agents Workshop([SEA](https://sea-workshop.github.io/)) at NeurIPS 2025** 接受.
-
-
-https://github.com/user-attachments/assets/a655e3c7-9d76-4c48-b946-1068fdb6cdd9
-
-
-
-
----
-
-### Table of Contents
-- [🌟 项目总览](#-项目总览)
-- [🚀 快速上手: PaperTalker](#-快速上手-PaperTalker)
- - [1. 环境配置](#1-环境配置)
- - [2. 大语言模型配置](#2-大语言模型配置)
- - [3. 推理](#3-推理)
-- [📊 评价指标: Paper2Video](#-评价指标-Paper2Video)
-- [😼 乐趣: Paper2Video 生成 Paper2Video 演讲视频](#-乐趣-Paper2Video生成Paper2Video演讲视频)
-- [🙏 致谢](#-致谢)
-- [📌 引用](#-引用)
----
-
-## 🌟 项目总览
-
-
-
-
-这项工作解决了学术演讲的两个核心问题:
-
-- **左边: 如何根据论文制作学术演讲?**
- *PaperTalker* — 集成**幻灯片**、**字幕**、**光标**、**语音合成**和**演讲者视频渲染**的多智能体。
-
-- **右边: 如何评估学术演讲视频?**
- *Paper2Video* — 一个具有精心设计的指标来评估演示质量的基准。
-
-
----
-
-## 🚀 尝试 PaperTalker 为你的论文制作演讲视频 !
-
-
-
-
-### 1. 环境配置
-准备Python环境:
-```bash
-cd src
-conda create -n p2v python=3.10
-conda activate p2v
-pip install -r requirements.txt
-conda install -c conda-forge tectonic
-````
-下载所依赖代码,并按照[Hallo2](https://github.com/fudan-generative-vision/hallo2)中的说明下载模型权重。
-```bash
-git clone https://github.com/fudan-generative-vision/hallo2.git
-```
-您需要**单独准备用于 talking-head generation 的环境**,以避免潜在的软件包冲突,请参考Hallo2。安装完成后,使用 `which python` 命令获取 Python 环境路径。
-```bash
-cd hallo2
-conda create -n hallo python=3.10
-conda activate hallo
-pip install -r requirements.txt
-```
-
-### 2. 大语言模型配置
-在终端配置您的**API 凭证**:
-```bash
-export GEMINI_API_KEY="your_gemini_key_here"
-export OPENAI_API_KEY="your_openai_key_here"
-```
-最佳实践是针对 LLM 和 VLM 使用 **GPT4.1** 或 **Gemini2.5-Pro**。我们也支持本地部署开源模型(例如 Qwen),详情请参阅 Paper2Poster。
-
-### 3. 推理
-脚本 `pipeline.py` 提供了一个自动化的学术演示视频生成流程。它以 **LaTeX 论文素材** 和 **参考图像/音频** 作为输入,并经过多个子模块(幻灯片 → 字幕 → 语音 → 光标 → 头部特写)生成完整的演示视频。⚡ 运行此流程的最低推荐 GPU 为 **NVIDIA A6000**,显存 48G。
-
-#### 示例用法
-
-运行以下命令来启动完整生成:
-
-```bash
-python pipeline.py \
- --model_name_t gpt-4.1 \
- --model_name_v gpt-4.1 \
- --model_name_talking hallo2 \
- --result_dir /path/to/output \
- --paper_latex_root /path/to/latex_proj \
- --ref_img /path/to/ref_img.png \
- --ref_audio /path/to/ref_audio.wav \
- --talking_head_env /path/to/hallo2_env \
- --gpu_list [0,1,2,3,4,5,6,7]
-```
-
-| 参数名 | 类型 | 默认值 | 说明 |
-|----------|------|---------|-------------|
-| `--model_name_t` | `str` | `gpt-4.1` | 文本大语言模型(LLM) |
-| `--model_name_v` | `str` | `gpt-4.1` | 视觉语言模型(VLM) |
-| `--model_name_talking` | `str` | `hallo2` | Talking Head 模型。目前仅支持 **hallo2** |
-| `--result_dir` | `str` | `/path/to/output` | 输出目录(包括幻灯片、字幕、视频等) |
-| `--paper_latex_root` | `str` | `/path/to/latex_proj` | 论文 LaTeX 项目的根目录 |
-| `--ref_img` | `str` | `/path/to/ref_img.png` | 参考图像(必须为**正方形**人像) |
-| `--ref_audio` | `str` | `/path/to/ref_audio.wav` | 参考音频(建议时长约为 10 秒) |
-| `--ref_text` | `str` | `None` | 可选参考文本(用于字幕风格指导) |
-| `--beamer_templete_prompt` | `str` | `None` | 可选参考文本(用于幻灯片风格指导) |
-| `--gpu_list` | `list[int]` | `""` | GPU 列表,用于并行执行(适用于**光标生成**与 **Talking Head 渲染**) |
-| `--if_tree_search` | `bool` | `True` | 是否启用树搜索(用于幻灯片布局优化) |
-| `--stage` | `str` | `"[0]"` | 需要运行的阶段(例如 `[0]` 表示完整流程,`[1,2,3]` 表示部分阶段) |
-| `--talking_head_env` | `str` | `/path/to/hallo2_env` | Talking Head 生成的 Python 环境路径 |
----
-
-## 📊 评价指标: Paper2Video
-
-
-
-
-与自然视频生成不同,学术演示视频发挥着高度专业化的作用:它们不仅关乎视觉保真度,更关乎**学术交流**。这使得直接应用视频合成中的传统指标(例如 FVD、IS 或基于 CLIP 的相似度)变得困难。相反,它们的价值在于它们如何有效地**传播研究成果**并**提升学术知名度**。从这个角度来看,我们认为,评判高质量的学术演示视频应该从两个互补的维度进行评判:
-#### 对于观众
-- 视频应**忠实传达论文的核心思想**。
-- 视频应**易于不同受众观看**。
-
-#### 对于作者
-- 视频应**突出作者的智力贡献和身份**。
-- 视频应**提升作品的知名度和影响力**。
-
-为了实现这些目标,我们引入了专门为学术演示视频设计的评估指标:Meta Similarity, PresentArena, PresentQuiz, IP Memory.
-
-### 运行评价
-- 准备环境:
-```bash
-cd src/evaluation
-conda create -n p2v_e python=3.10
-conda activate p2v_e
-pip install -r requirements.txt
-```
-- 对于 Meta Similarity 和 PresentArena:
-```bash
-python MetaSim_audio.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
-python MetaSim_content.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
-```
-```bash
-python PresentArena.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
-```
-- 对于**PresentQuiz**,首先基于论文生成问题并使用 Gemini 进行评估:
-```bash
-cd PresentQuiz
-python create_paper_questions.py ----paper_folder /path/to/data
-python PresentQuiz.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
-```
-
-- 对于**IP Memory**,首先从生成的视频中生成问题对,然后使用 Gemini 进行评估:
-```bash
-cd IPMemory
-python construct.py
-python ip_qa.py
-```
-更多详情请查看代码!
-
-👉 Paper2Video 数据集可在以下网址获取:
-[HuggingFace](https://huggingface.co/datasets/ZaynZhu/Paper2Video)
-
----
-
-## 😼 乐趣: Paper2Video 生成 Paper2Video 演讲视频
-查看 **Paper2Video 生成 Paper2Video 演讲视频**:
-
-https://github.com/user-attachments/assets/ff58f4d8-8376-4e12-b967-711118adf3c4
-
-## 🙏 致谢
-
-* 数据集中演示视频的来源是 SlideLive 和 YouTube。
-* 感谢所有为制作演示视频付出辛勤努力的作者!
-* 感谢 [CAMEL](https://github.com/camel-ai/camel) 开源了组织良好的多智能体框架代码库。
-* 感谢 [Hallo2](https://github.com/fudan-generative-vision/hallo2.git) 和 [Paper2Poster](https://github.com/Paper2Poster/Paper2Poster.git) 作者开源代码。
-* 感谢 [Wei Jia](https://github.com/weeadd) 在数据收集和baselines实现方面所做的努力。我们也感谢所有参与用户调研的参与者。
-* 感谢所有 **Show Lab @ NUS** 成员的支持!
-
-
-
----
-
-## 📌 引用
-
-
-如果我们的工作对您有帮助,欢迎引用我们的工作:
-
-```bibtex
-@misc{paper2video,
- title={Paper2Video: Automatic Video Generation from Scientific Papers},
- author={Zeyu Zhu and Kevin Qinghong Lin and Mike Zheng Shou},
- year={2025},
- eprint={2510.05096},
- archivePrefix={arXiv},
- primaryClass={cs.CV},
- url={https://arxiv.org/abs/2510.05096},
-}
-```
diff --git a/Paper2Video/README.md b/Paper2Video/README.md
deleted file mode 100644
index 258f23247b73bc980a3d0d67dcbcc8cd9d5bc45c..0000000000000000000000000000000000000000
--- a/Paper2Video/README.md
+++ /dev/null
@@ -1,251 +0,0 @@
-# Paper2Video
-
-
- English | 简体中文
-
-
-
-
- Paper2Video: Automatic Video Generation from Scientific Papers
-
-从学术论文自动生成演讲视频
-
-
-
- Zeyu Zhu*,
- Kevin Qinghong Lin*,
- Mike Zheng Shou
- Show Lab, National University of Singapore
-
-
-
-
- 📄 Paper |
- 🤗 Daily Paper |
- 📊 Dataset |
- 🌐 Project Website |
- 💬 X (Twitter)
-
-
-- **Input:** a paper ➕ an image ➕ an audio
-
-| Paper | Image | Audio |
-|--------|--------|--------|
-| 
[🔗 Paper link](https://arxiv.org/pdf/1509.01626) |
Hinton's photo| 
[🔗 Audio sample](https://github.com/showlab/Paper2Video/blob/page/assets/hinton/ref_audio_10.wav) |
-
-
-- **Output:** a presentation video
-
-
-
-https://github.com/user-attachments/assets/39221a9a-48cb-4e20-9d1c-080a5d8379c4
-
-
-
-
-Check out more examples at [🌐 project page](https://showlab.github.io/Paper2Video/).
-
-## 🔥 Update
-- [x] [2025.10.11] Our work receives attention on [YC Hacker News](https://news.ycombinator.com/item?id=45553701).
-- [x] [2025.10.9] Thanks AK for sharing our work on [Twitter](https://x.com/_akhaliq/status/1976099830004072849)!
-- [x] [2025.10.9] Our work is reported by [Medium](https://medium.com/@dataism/how-ai-learned-to-make-scientific-videos-from-slides-to-a-talking-head-0d807e491b27).
-- [x] [2025.10.8] Check out our demo video below!
-- [x] [2025.10.7] We release the [arxiv paper](https://arxiv.org/abs/2510.05096).
-- [x] [2025.10.6] We release the [code](https://github.com/showlab/Paper2Video) and [dataset](https://huggingface.co/datasets/ZaynZhu/Paper2Video).
-- [x] [2025.9.28] Paper2Video has been accepted to the **Scaling Environments for Agents Workshop([SEA](https://sea-workshop.github.io/)) at NeurIPS 2025**.
-
-
-https://github.com/user-attachments/assets/a655e3c7-9d76-4c48-b946-1068fdb6cdd9
-
-
-
-
----
-
-### Table of Contents
-- [🌟 Overview](#-overview)
-- [🚀 Quick Start: PaperTalker](#-try-papertalker-for-your-paper-)
- - [1. Requirements](#1-requirements)
- - [2. Configure LLMs](#2-configure-llms)
- - [3. Inference](#3-inference)
-- [📊 Evaluation: Paper2Video](#-evaluation-paper2video)
-- [😼 Fun: Paper2Video for Paper2Video](#-fun-paper2video-for-paper2video)
-- [🙏 Acknowledgements](#-acknowledgements)
-- [📌 Citation](#-citation)
-
----
-
-## 🌟 Overview
-
-
-
-
-This work solves two core problems for academic presentations:
-
-- **Left: How to create a presentation video from a paper?**
- *PaperTalker* — an agent that integrates **slides**, **subtitling**, **cursor grounding**, **speech synthesis**, and **talking-head video rendering**.
-
-- **Right: How to evaluate a presentation video?**
- *Paper2Video* — a benchmark with well-designed metrics to evaluate presentation quality.
-
-
----
-
-## 🚀 Try PaperTalker for your Paper!
-
-
-
-
-### 1. Requirements
-Prepare the environment:
-```bash
-cd src
-conda create -n p2v python=3.10
-conda activate p2v
-pip install -r requirements.txt
-conda install -c conda-forge tectonic
-````
-Download the dependent code and follow the instructions in **[Hallo2](https://github.com/fudan-generative-vision/hallo2)** to download the model weight.
-```bash
-git clone https://github.com/fudan-generative-vision/hallo2.git
-```
-You need to **prepare the environment separately for talking-head generation** to potential avoide package conflicts, please refer to Hallo2. After installing, use `which python` to get the python environment path.
-```bash
-cd hallo2
-conda create -n hallo python=3.10
-conda activate hallo
-pip install -r requirements.txt
-```
-
-### 2. Configure LLMs
-Export your **API credentials**:
-```bash
-export GEMINI_API_KEY="your_gemini_key_here"
-export OPENAI_API_KEY="your_openai_key_here"
-```
-The best practice is to use **GPT4.1** or **Gemini2.5-Pro** for both LLM and VLMs. We also support locally deployed open-source model(e.g., Qwen), details please referring to Paper2Poster.
-
-### 3. Inference
-The script `pipeline.py` provides an automated pipeline for generating academic presentation videos. It takes **LaTeX paper sources** together with **reference image/audio** as input, and goes through multiple sub-modules (Slides → Subtitles → Speech → Cursor → Talking Head) to produce a complete presentation video. ⚡ The minimum recommended GPU for running this pipeline is **NVIDIA A6000** with 48G.
-
-#### Example Usage
-
-Run the following command to launch a full generation:
-
-```bash
-python pipeline.py \
- --model_name_t gpt-4.1 \
- --model_name_v gpt-4.1 \
- --model_name_talking hallo2 \
- --result_dir /path/to/output \
- --paper_latex_root /path/to/latex_proj \
- --ref_img /path/to/ref_img.png \
- --ref_audio /path/to/ref_audio.wav \
- --talking_head_env /path/to/hallo2_env \
- --gpu_list [0,1,2,3,4,5,6,7]
-```
-
-| Argument | Type | Default | Description |
-|----------|------|---------|-------------|
-| `--model_name_t` | `str` | `gpt-4.1` | LLM |
-| `--model_name_v` | `str` | `gpt-4.1` | VLM |
-| `--model_name_talking` | `str` | `hallo2` | Talking Head model. Currently only **hallo2** is supported |
-| `--result_dir` | `str` | `/path/to/output` | Output directory (slides, subtitles, videos, etc.) |
-| `--paper_latex_root` | `str` | `/path/to/latex_proj` | Root directory of the LaTeX paper project |
-| `--ref_img` | `str` | `/path/to/ref_img.png` | Reference image (must be **square** portrait) |
-| `--ref_audio` | `str` | `/path/to/ref_audio.wav` | Reference audio (recommended: ~10s) |
-| `--ref_text` | `str` | `None` | Optional reference text (for style guidance for subtitles) |
-| `--beamer_templete_prompt` | `str` | `None` | Optional reference text (for style guidance for slides) |
-| `--gpu_list` | `list[int]` | `""` | GPU list for parallel execution (used in **cursor generation** and **Talking Head rendering**) |
-| `--if_tree_search` | `bool` | `True` | Whether to enable tree search for slide layout refinement |
-| `--stage` | `str` | `"[0]"` | Pipeline stages to run (e.g., `[0]` full pipeline, `[1,2,3]` partial stages) |
-| `--talking_head_env` | `str` | `/path/to/hallo2_env` | python environment path for talking-head generation |
----
-
-## 📊 Evaluation: Paper2Video
-
-
-
-
-Unlike natural video generation, academic presentation videos serve a highly specialized role: they are not merely about visual fidelity but about **communicating scholarship**. This makes it difficult to directly apply conventional metrics from video synthesis(e.g., FVD, IS, or CLIP-based similarity). Instead, their value lies in how well they **disseminate research** and **amplify scholarly visibility**.From this perspective, we argue that a high-quality academic presentation video should be judged along two complementary dimensions:
-#### For the Audience
-- The video is expected to **faithfully convey the paper’s core ideas**.
-- It should remain **accessible to diverse audiences**.
-
-#### For the Author
-- The video should **foreground the authors’ intellectual contribution and identity**.
-- It should **enhance the work’s visibility and impact**.
-
-To capture these goals, we introduce evaluation metrics specifically designed for academic presentation videos: Meta Similarity, PresentArena, PresentQuiz, IP Memory.
-
-### Run Eval
-- Prepare the environment:
-```bash
-cd src/evaluation
-conda create -n p2v_e python=3.10
-conda activate p2v_e
-pip install -r requirements.txt
-```
-- For MetaSimilarity and PresentArena:
-```bash
-python MetaSim_audio.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
-python MetaSim_content.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
-```
-```bash
-python PresentArena.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
-```
-- For **PresentQuiz**, first generate questions from paper and eval using Gemini:
-```bash
-cd PresentQuiz
-python create_paper_questions.py ----paper_folder /path/to/data
-python PresentQuiz.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
-```
-
-- For **IP Memory**, first generate question pairs from generated videos and eval using Gemini:
-```bash
-cd IPMemory
-python construct.py
-python ip_qa.py
-```
-See the codes for more details!
-
-👉 Paper2Video Benchmark is available at:
-[HuggingFace](https://huggingface.co/datasets/ZaynZhu/Paper2Video)
-
----
-
-## 😼 Fun: Paper2Video for Paper2Video
-Check out **How Paper2Video for Paper2Video**:
-
-https://github.com/user-attachments/assets/ff58f4d8-8376-4e12-b967-711118adf3c4
-
-## 🙏 Acknowledgements
-
-* The souces of the presentation videos are SlideLive and YouTuBe.
-* We thank all the authors who spend a great effort to create presentation videos!
-* We thank [CAMEL](https://github.com/camel-ai/camel) for open-source well-organized multi-agent framework codebase.
-* We thank the authors of [Hallo2](https://github.com/fudan-generative-vision/hallo2.git) and [Paper2Poster](https://github.com/Paper2Poster/Paper2Poster.git) for their open-sourced codes.
-* We thank [Wei Jia](https://github.com/weeadd) for his effort in collecting the data and implementing the baselines. We also thank all the participants involved in the human studies.
-* We thank all the **Show Lab @ NUS** members for support!
-
-
-
----
-
-## 📌 Citation
-
-
-If you find our work useful, please cite:
-
-```bibtex
-@misc{paper2video,
- title={Paper2Video: Automatic Video Generation from Scientific Papers},
- author={Zeyu Zhu and Kevin Qinghong Lin and Mike Zheng Shou},
- year={2025},
- eprint={2510.05096},
- archivePrefix={arXiv},
- primaryClass={cs.CV},
- url={https://arxiv.org/abs/2510.05096},
-}
-```
-[](https://star-history.com/#showlab/Paper2Video&Date)
diff --git a/Paper2Video/__init__.py b/Paper2Video/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/Paper2Video/src/__init__.py b/Paper2Video/src/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/Paper2Video/src/evaluation/IPMemory/construct.py b/Paper2Video/src/evaluation/IPMemory/construct.py
deleted file mode 100644
index 7389f8ff8c768e331c2657cbfd1e9a1ad9d0244a..0000000000000000000000000000000000000000
--- a/Paper2Video/src/evaluation/IPMemory/construct.py
+++ /dev/null
@@ -1,69 +0,0 @@
-"""
- construct question about Academic IP
- input query: 4 video clips from 4 different paper presentation + query (image/audio)
- input question: 4 understanding qa from corresponding paper
- output task: choose the right question to ask
-"""
-import os, re
-import json
-import random
-import itertools
-from os import path
-from typing import List
-from pathlib import Path
-from tqdm import tqdm
-
-def generate_combinations(total_num, comb_size):
- return list(itertools.combinations(range(total_num), comb_size))
-
-def generate_ip_task(vaild_data_name, num_qa_pair):
- combs = list(itertools.combinations(range(len(vaild_data_name)), 4))
- combs = random.sample(combs, num_qa_pair)
-
- qa_list = []
- for comb in combs:
- ## questions
- question_list = []
- question_index = random.randint(1, 50)
- for index in comb:
- question_path = path.join(vaild_data_name[index][1], "4o-mini_qa.json")
- with open(question_path, 'r') as f: question = json.load(f)["understanding"]["questions"]
- question_list.append(question["Question {}".format(str(question_index))]["question"])
- ## query
- query_list = []
- for index in comb:
- ref_img_path = path.join(vaild_data_name[index][1], "ref_img.png")
- ref_audio_path = path.join(vaild_data_name[index][1], "ref_audio.wav")
- query_list.append((ref_img_path, ref_audio_path))
- ## qa
- qa = {}
- qa["videos"] = []
- for idx in range(len(comb)):
- qa["videos"].append(vaild_data_name[comb[idx]][0])
-
- qa["querys"] = query_list
- qa["questions"] = question_list
- qa_list.append(qa)
- with open("ip_qa.json", 'w') as f: json.dump(qa_list, f, indent=4)
-
-_num_at_start = re.compile(r'^\s*["\']?(\d+)')
-def sort_by_leading_number(paths: List[str]) -> List[str]:
- def key(p: str):
- name = Path(p).name
- m = _num_at_start.match(name)
- return (int(m.group(1)) if m else float('inf'), name)
- return sorted(paths, key=key)
-
-if __name__ == "__main__":
- num_qa_pair = 10 # C (num_data) (4)
- root_dir = "/path/to/result"
- gt_dir = "/path/to/data"
-
- all_data_name = sort_by_leading_number(os.listdir(root_dir))
- all_groundtruth = sort_by_leading_number(os.listdir(gt_dir))
- vaild_data_name = []
- for data_idx in range(len(all_data_name)):
- if path.basename(root_dir) == "paper2video":
- video_result_1 = path.join(root_dir, all_data_name[data_idx], "3_merage.mp4")
- video_result_2 = path.join(root_dir.replace("paper2video", "presentagent"), all_data_name[data_idx], "result.mp4")
- generate_ip_task(vaild_data_name, num_qa_pair)
diff --git a/Paper2Video/src/evaluation/IPMemory/ip_qa.py b/Paper2Video/src/evaluation/IPMemory/ip_qa.py
deleted file mode 100644
index 0c3d7a6c5f5b2f4cb4105739edbb3e2fe1a5346a..0000000000000000000000000000000000000000
--- a/Paper2Video/src/evaluation/IPMemory/ip_qa.py
+++ /dev/null
@@ -1,142 +0,0 @@
-import os
-import re
-import json
-import time
-import random
-import argparse, pdb
-from os import path
-import google.generativeai as genai
-from moviepy.editor import VideoFileClip
-from camel.models import ModelFactory
-from camel.types import ModelType, ModelPlatformType
-from camel.configs import GeminiConfig
-from typing import List
-from pathlib import Path
-
-
-genai.configure(api_key="")
-
-_num_at_start = re.compile(r'^\s*["\']?(\d+)')
-def sort_by_leading_number(paths: List[str]) -> List[str]:
- def key(p: str):
- name = Path(p).name
- m = _num_at_start.match(name)
- return (int(m.group(1)) if m else float('inf'), name)
- return sorted(paths, key=key)
-dataset_path = "/path/to/data"
-dataset_list = sort_by_leading_number(os.listdir(dataset_path))
-
-
-def eval_ip(root_path, clip_duration, model_list, prompt_path, question_path, test_type='image'):
- tmp_dir = "tmp"
- os.makedirs(tmp_dir, exist_ok=True)
- gemini_model = genai.GenerativeModel("models/gemini-2.5-pro-flash")
-
- with open(prompt_path, 'r') as f: prompt = f.readlines()
- prompt = "/n".join(prompt)
- with open(question_path, 'r') as f: questions = json.load(f)
-
- result_each_question = []
- for question in questions:
- video_ids = question["videos"]
- querys = question["querys"]
- qs = question["questions"]
-
- ## get video clips
- video_clips_path = {}
- for model in model_list: video_clips_path[model] = []
-
- start_p2v = None
- for vid_id in video_ids:
- tmp_dir_id = path.join(tmp_dir, str(vid_id))
- os.makedirs(tmp_dir_id, exist_ok=True)
- for model in model_list:
- if model == 'p2v': video_path = path.join(root_path, "paper2video", str(vid_id), '3_merage.mp4')
- elif model == 'p2v-o': video_path = path.join(root_path, "paper2video_wo_presenter", str(vid_id), 'result.mp4')
- elif model == 'veo3': video_path = path.join(root_path, "veo3", str(vid_id)+".mp4")
- elif model == 'wan2.2': video_path = path.join(root_path, "wan2.2", str(int(vid_id)-1), "result.mp4")
- elif model == 'presentagent': video_path = path.join(root_path, "presentagent", str(vid_id), "result.mp4")
- elif model == 'human-made': video_path = path.join(dataset_path, dataset_list[int(vid_id)-1], "gt_presentation_video.mp4")
-
- video = VideoFileClip(video_path)
- start = random.uniform(0, video.duration-clip_duration-1)
- end = start + clip_duration
- if model == 'p2v' or model == "p2v-o":
- if start_p2v is None:
- start_p2v = random.uniform(0, video.duration-clip_duration-1)
- start = start_p2v
- end = start_p2v + clip_duration
- else:
- start = start_p2v
- end = start_p2v + clip_duration
- else:
- start = random.uniform(0, video.duration-clip_duration-1)
- end = start + clip_duration
-
- clip_save_path = path.join(tmp_dir_id, model+".mp4")
- subclip = video.subclip(start, end)
- subclip.write_videofile(clip_save_path, codec="libx264", audio_codec="aac")
- video_clips_path[model].append(clip_save_path)
- ## test for each model, 4 qas
- result_each_model = {}
- for model in model_list:
- video_input = video_clips_path[model]
- videos = upload_videos(video_input)
- result_each_model[model] = []
- for idx, query in enumerate(querys):
- if test_type == 'image':
- query = query[0]
- query_state = genai.upload_file(path=query, mime_type="image/png")
- elif test_type == 'aduio':
- query = query[1]
-
- answer = idx
- ori_idxs = [0, 1, 2, 3]
- shuffled_idx = ori_idxs.copy()
- random.shuffle(shuffled_idx)
- mapping = {orig: shuffled for orig, shuffled in zip(ori_idxs, shuffled_idx)}
- new_answer = mapping[idx]
- new_qs = [qs[mapping[idx]] for idx in ori_idxs]
-
- contents = [prompt, "Here are the quary", genai.get_file(query_state.name), "Here are the video clips"]
- contents.extend(videos)
- contents.extend(["Here are the questions"])
- contents.extend(new_qs)
-
- response = gemini_model.generate_content(contents)
- #pdb.set_trace()
- match = re.search(r"My choice:\s*(\d+)", response.text)
- if match: choice_num = int(match.group(1)) - 1
- if choice_num == new_answer:
- result_each_model[model].append([query, new_qs, choice_num, new_answer, True])
- else:
- result_each_model[model].append([query, new_qs, choice_num, new_answer, False])
- result_each_question.append(result_each_model)
- print(result_each_question)
- with open("ip_qa_result.json", 'w') as f: json.dump(result_each_question, f, indent=4)
-
-def upload_videos(video_list):
- videos = video_list.copy()
- for idx, value in enumerate(videos):
- videos[idx] = genai.upload_file(path=value, mime_type="video/mp4")
- while True:
- flag = True
- for idx, value in enumerate(videos):
- file_state = genai.get_file(videos[idx].name)
- if file_state.state.name != "ACTIVE":
- flag = False
- time.sleep(5)
- print(f"waiting 5 seconds...")
- break
- if flag: break
- for idx, value in enumerate(videos):
- videos[idx] = genai.get_file(videos[idx].name)
- return videos
-
-if __name__ == "__main__":
- clip_duration = 4
- prompt_path = "./prompt/ip_qa.txt"
- model_list = ["p2v", "p2v-o", "veo3", "wan2.2", "presentagent", "human-made"]
- root_path = "/path/to/result"
- question_path = "ip_qa.json"
- eval_ip(root_path, clip_duration, model_list, prompt_path, question_path)
\ No newline at end of file
diff --git a/Paper2Video/src/evaluation/MetaSim_audio.py b/Paper2Video/src/evaluation/MetaSim_audio.py
deleted file mode 100644
index 6ef3456099602b22d97e8276a59eba463533b6f4..0000000000000000000000000000000000000000
--- a/Paper2Video/src/evaluation/MetaSim_audio.py
+++ /dev/null
@@ -1,102 +0,0 @@
-import os, re, json
-import random
-import argparse
-import moviepy.editor as mp
-from os import path
-from pathlib import Path
-from typing import List
-from pyannote.audio import Audio
-from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
-from scipy.spatial.distance import cosine
-
-
-def extract_random_audio_segment(video_path: str, output_wav_path: str, duration: float = 5.0):
- print(video_path)
- video = mp.VideoFileClip(video_path)
- audio = video.audio
-
- total_duration = audio.duration
- if duration >= total_duration: start_time = 0
- else: start_time = random.uniform(0, total_duration - duration)
-
- audio_subclip = audio.subclip(start_time, start_time + duration)
- audio_subclip.write_audiofile(output_wav_path, codec='pcm_s16le', fps=16000)
-
-def compute_speaker_similarity(audio_path_1: str, audio_path_2: str, device: str = "cuda") -> float:
- embedding_model = PretrainedSpeakerEmbedding("speechbrain/spkrec-ecapa-voxceleb", device=device)
- audio_loader = Audio(sample_rate=16000)
-
- wav1, _ = audio_loader(audio_path_1)
- wav2, _ = audio_loader(audio_path_2)
-
- wav1 = wav1[0:1].unsqueeze(0)
- wav2 = wav2[0:1].unsqueeze(0)
-
- embedding1 = embedding_model(wav1)
- embedding2 = embedding_model(wav2)
- embedding1 = embedding1.reshape(embedding1.shape[1])
- embedding2 = embedding2.reshape(embedding2.shape[1])
-
- similarity = 1 - cosine(embedding1, embedding2)
- return similarity
-
-
-def get_audio_sim_score(gen_video_path, gt_video_path):
- extract_random_audio_segment(gen_video_path, gen_video_path.replace('.mp4', '.wav'), duration=5)
- extract_random_audio_segment(gt_video_path, gt_video_path.replace('.mp4', '.wav'), duration=5)
- similarity = compute_speaker_similarity(gen_video_path.replace('.mp4', '.wav'),
- gt_video_path.replace('.mp4', '.wav'))
- return similarity
-
-_num_at_start = re.compile(r'^\s*["\']?(\d+)')
-def sort_by_leading_number(paths: List[str]) -> List[str]:
- def key(p: str):
- name = Path(p).name
- m = _num_at_start.match(name)
- return (int(m.group(1)) if m else float('inf'), name)
- return sorted(paths, key=key)
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument("-r", "--result_dir", default="/path/to/result_dir")
- parser.add_argument("-g", "--gt_dir", default="/path/to/gt_dir")
- parser.add_argument("-s", "--save_dir", default="/path/to/save_dir")
- args = parser.parse_args()
-
- ## load exist result if have
- save_dir = args.save_dir
- save_dir = path.join(save_dir, path.basename(args.result_dir))
- save_path = path.join(save_dir, "audio_sim.json")
- os.makedirs(save_dir, exist_ok=True)
- if path.exists(save_path):
- with open(save_path, 'r') as f: audio_similarity_list = json.load(f)
- else: audio_similarity_list = []
-
- ## path
- gt_dir, result_dir = args.gt_dir, args.result_dir
- groundtruth_list = sort_by_leading_number([path.join(gt_dir, name) for name in os.listdir(gt_dir)])
- result_list = sort_by_leading_number([path.join(result_dir, name) for name in os.listdir(result_dir)])
-
- for index in range(len(audio_similarity_list), 40):
- if path.basename(args.result_dir) == "paper2video":
- p2v_video_path = path.join(result_list[index], "3_merage.mp4")
- elif path.basename(args.result_dir) == "veo3":
- p2v_video_path = path.join(result_list[index])
- else:
- p2v_video_path = path.join(result_list[index], "result.mp4")
- if path.exists(p2v_video_path) is False: continue
- gt_video_path = path.join(groundtruth_list[index], "gt_presentation_video.mp4")
- if path.exists(gt_video_path) is False: continue
- print(p2v_video_path, gt_video_path)
- similarity = get_audio_sim_score(p2v_video_path, gt_video_path)
- audio_similarity_list.append({
- "data_idx": index,
- "score": similarity.item()
- })
- print(audio_similarity_list)
- with open(save_path, 'w') as f: json.dump(audio_similarity_list, f, indent=4)
-
- # import numpy as np
- # avg = np.average(similarity_all)
- # var = np.var(similarity_all)
- # print(avg, var)
\ No newline at end of file
diff --git a/Paper2Video/src/evaluation/MetaSim_content.py b/Paper2Video/src/evaluation/MetaSim_content.py
deleted file mode 100644
index f4913e0919a3777a334a9b741fe277efd9048e3a..0000000000000000000000000000000000000000
--- a/Paper2Video/src/evaluation/MetaSim_content.py
+++ /dev/null
@@ -1,144 +0,0 @@
-import os, re, pdb, json
-from PIL import Image
-import pytesseract
-
-import whisperx
-import argparse
-import torch
-import numpy as np
-from os import path
-from pathlib import Path
-from typing import List
-from camel.models import ModelFactory
-from camel.types import ModelType, ModelPlatformType
-from camel.configs import GeminiConfig
-
-
-os.environ["GEMINI_API_KEY"] = ""
-prompt_path = "./prompt/content_sim_score.txt"
-
-agent_config = {
- "model_type": ModelType.GEMINI_2_5_FLASH,
- "model_config": GeminiConfig().as_dict(),
- "model_platform": ModelPlatformType.GEMINI,}
-actor_model = ModelFactory.create(
- model_platform=agent_config['model_platform'],
- model_type=agent_config['model_type'],
- model_config_dict=agent_config['model_config'],)
-
-def extract_slide_texts(slide_dir):
- slide_texts = []
- for fname in sorted(os.listdir(slide_dir)):
- if fname.lower().endswith(('.png', '.jpg', '.jpeg')):
- path = os.path.join(slide_dir, fname)
- text = pytesseract.image_to_string(Image.open(path))
- slide_texts.append(text.strip())
- return slide_texts
-
-def load_subtitles(sub_path):
- with open(sub_path, "r") as f:
- lines = f.readlines()
- return [line.strip() for line in lines if line.strip()]
-
-def build_prompt(slides_1, subs_1, slides_2, subs_2):
- prompt = (
- "Human Presentation:\n"
- "Slides:\n" + "\n".join(slides_1) + "\n"
- "Subtitles:\n" + "\n".join(subs_1) + "\n\n"
- "Generated Presentation:\n"
- "Slides:\n" + "\n".join(slides_2) + "\n"
- "Subtitles:\n" + "\n".join(subs_2) + "\n\n")
- return prompt
-
-def run_similarity_eval(slide_dir_1, slide_dir_2, sub_path_1, sub_path_2):
- slides_1 = extract_slide_texts(slide_dir_1)
- slides_2 = extract_slide_texts(slide_dir_2)
- subs_1 = load_subtitles(sub_path_1)
- subs_2 = load_subtitles(sub_path_2)
-
- with open(prompt_path, 'r') as f: prompt = f.readlines()
- prompt = "\n".join(prompt)
- prompt_q = build_prompt(slides_1, subs_1, slides_2, subs_2)
- prompt = prompt + '/n' + prompt_q
-
- output = actor_model.run([{"role": "user", "content": prompt}])
- print("=== Similarity Evaluation ===\n")
- print(output.choices[0].message.content)
- return output.choices[0].message.content
-
-def extract_plain_subtitle_with_whisperx(video_path: str, output_path: str, model_name: str = "large-v3", language: str = "en"):
- device = "cuda" if torch.cuda.is_available() else "cpu"
- model = whisperx.load_model(model_name, device=device, language=language)
-
- audio = whisperx.load_audio(video_path)
- result = model.transcribe(audio, batch_size=16)
-
- with open(output_path, "w") as f:
- for seg in result["segments"]:
- f.write(seg["text"].strip() + "\n")
-
-def extract_similarity_scores(text):
- content_match = re.search(r"Content Similarity:\s*(\d+)/5", text)
- if content_match:
- content_score = int(content_match.group(1))
- return content_score
-
-_num_at_start = re.compile(r'^\s*["\']?(\d+)')
-def sort_by_leading_number(paths: List[str]) -> List[str]:
- def key(p: str):
- name = Path(p).name
- m = _num_at_start.match(name)
- return (int(m.group(1)) if m else float('inf'), name)
- return sorted(paths, key=key)
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument("-r", "--result_dir", default="/path/to/result_dir")
- parser.add_argument("-g", "--gt_dir", default="/path/to/gt_dir")
- parser.add_argument("-s", "--save_dir", default="/path/to/save_dir")
- args = parser.parse_args()
-
- ## load exist result if have
- save_dir = args.save_dir
- save_dir = path.join(save_dir, path.basename(args.result_dir))
- save_path = path.join(save_dir, "content_sim.json")
- os.makedirs(save_dir, exist_ok=True)
- if path.exists(save_path):
- with open(save_path, 'r') as f: content_sim_list = json.load(f)
- else: content_sim_list = []
-
- ## path
- gt_dir, result_dir = args.gt_dir, args.result_dir
- groundtruth_list = sort_by_leading_number([path.join(gt_dir, name) for name in os.listdir(gt_dir)])
- result_list = sort_by_leading_number([path.join(result_dir, name) for name in os.listdir(result_dir)])
-
- ## eval
- for index in range(25, 100):
- # video -> subtitle
- if path.basename(args.result_dir) == "paper2video":
- p2v_video_path = path.join(result_list[index], "3_merage.mp4")
- if path.exists(p2v_video_path) is False: continue
- else:
- p2v_video_path = path.join(result_list[index], "result.mp4")
- if path.exists(p2v_video_path) is False: continue
- gt_video_path = path.join(groundtruth_list[index], "gt_presentation_video.mp4")
- extract_plain_subtitle_with_whisperx(gt_video_path, gt_video_path.replace(".mp4", "_sub.txt"))
- extract_plain_subtitle_with_whisperx(p2v_video_path, p2v_video_path.replace(".mp4", "_sub.txt"))
-
- # slide dir
- gt_slide_dir = path.join(groundtruth_list[index], "slide_imgs")
- p2v_slide_dir = path.join(result_list[index], "slide_imgs")
-
- # eval
- result = run_similarity_eval(
- slide_dir_1=gt_slide_dir,
- slide_dir_2=p2v_slide_dir,
- sub_path_1=gt_video_path.replace(".mp4", "_sub.txt"),
- sub_path_2=p2v_video_path.replace(".mp4", "_sub.txt"))
- content_score = extract_similarity_scores(result)
- content_sim_list.append({
- "data_idx": index,
- "score": content_score
- })
-
- with open(save_path, 'w') as f: json.dump(content_sim_list, f)
\ No newline at end of file
diff --git a/Paper2Video/src/evaluation/PresentArena.py b/Paper2Video/src/evaluation/PresentArena.py
deleted file mode 100644
index dfe79f7bbf87880ccae2b50525a8e9fa19b98255..0000000000000000000000000000000000000000
--- a/Paper2Video/src/evaluation/PresentArena.py
+++ /dev/null
@@ -1,106 +0,0 @@
-'''
- Using VideoLLM (Gemini) as judger
-'''
-import os, re, json
-import time
-import argparse
-import google.generativeai as genai
-from os import path
-from typing import List
-from pathlib import Path
-from tqdm import tqdm
-
-
-genai.configure(api_key="")
-def eval_gemini(gt_vid_path, gen_vid_path):
- model = genai.GenerativeModel("models/gemini-2.5-pro")
- gt_vid = genai.upload_file(path=gt_vid_path, mime_type="video/mp4")
- gen_vid = genai.upload_file(path=gen_vid_path, mime_type="video/mp4")
- while True:
- refreshed_1 = genai.get_file(gt_vid.name)
- refreshed_2 = genai.get_file(gen_vid.name)
- if refreshed_1.state.name == "ACTIVE" and refreshed_2.state.name == "ACTIVE": break
- elif refreshed_1.state.name == "FAILED" or refreshed_2.state.name == "FAILED":
- #raise RuntimeError("❌ File processing failed.")
- return None
- else:
- print(f"waiting 5 seconds...")
- time.sleep(5)
-
- prompt_path = "./prompt/which_is_better.txt"
- with open(prompt_path, 'r') as f: prompt = f.readlines()
- prompt = "/n".join(prompt)
- print("Sending prompt to Gemini...")
- response = model.generate_content([prompt, refreshed_1, refreshed_2])
- print("\n===== Evaluation Result =====")
- print(response.text)
- print("=============================\n")
-
- return response.text
-
-_num_at_start = re.compile(r'^\s*["\']?(\d+)')
-def sort_by_leading_number(paths: List[str]) -> List[str]:
- def key(p: str):
- name = Path(p).name
- m = _num_at_start.match(name)
- return (int(m.group(1)) if m else float('inf'), name)
- return sorted(paths, key=key)
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument("-r", "--result_dir", default="/path/to/result_dir")
- parser.add_argument("-g", "--gt_dir", default="/path/to/gt_dir")
- parser.add_argument("-s", "--save_dir", default="/path/to/save_dir")
- args = parser.parse_args()
-
- ## load exist result if have
- save_dir = args.save_dir
- if path.basename(args.result_dir) == "paper2video":
- save_dir = path.join(save_dir, path.basename(args.result_dir))
- else: save_dir = path.join(save_dir, path.basename(args.result_dir))
-
- save_path = path.join(save_dir, "video_arena.json")
- os.makedirs(save_dir, exist_ok=True)
- if path.exists(save_path):
- with open(save_path, 'r') as f: arena_score_list = json.load(f)
- else: arena_score_list = []
-
- ## path
- gt_dir, result_dir = args.gt_dir, args.result_dir
- groundtruth_list = sort_by_leading_number([path.join(gt_dir, name) for name in os.listdir(gt_dir)])
- result_list = sort_by_leading_number([path.join(result_dir, name) for name in os.listdir(result_dir)])
-
- ## Generated v.s GT (1)
- for index in tqdm(len(result_list)):
- if path.basename(args.result_dir) == "paper2video":
- test_video_path = path.join(result_list[index], "3_merage.mp4")
- elif path.basename(args.result_dir) == 'veo3':
- test_video_path = result_list[index]
- else:
- test_video_path = path.join(result_list[index], "result.mp4")
-
- if path.exists(test_video_path) is False: continue
- gt_video_path = path.join(groundtruth_list[index], "gt_presentation_video.mp4")
- if path.exists(gt_video_path) is False:
- gt_video_path = path.join(groundtruth_list[index], "raw_video.mp4")
- if path.exists(gt_video_path) is False: continue
- result = eval_gemini(gt_video_path, test_video_path)
- if result is None: continue
-
- pat = r"\[(?:A|B)\]"
- m = re.findall(pat, result, flags=re.I)
- score = 0
- if m[0][1] == "B": score += 1
-
- result = eval_gemini(test_video_path, gt_video_path)
- if result is None: continue
-
- pat = r"\[(?:A|B)\]"
- m = re.findall(pat, result, flags=re.I)
- if m[0][1] == "A": score += 1
-
- arena_score_list.append({
- "data_idx": index,
- "score": score/2
- })
- with open(save_path, 'w') as f: json.dump(arena_score_list, f, indent=4)
diff --git a/Paper2Video/src/evaluation/PresentQuiz/PresentQuiz.py b/Paper2Video/src/evaluation/PresentQuiz/PresentQuiz.py
deleted file mode 100644
index 4d3c0647843abe57897c4aee4a79ee5e650deee7..0000000000000000000000000000000000000000
--- a/Paper2Video/src/evaluation/PresentQuiz/PresentQuiz.py
+++ /dev/null
@@ -1,264 +0,0 @@
-import random
-import string
-import yaml
-import PIL
-import tempfile
-import io
-import argparse
-from os import path
-from camel.models import ModelFactory
-from math import ceil
-from openai import OpenAI
-from camel.messages import BaseMessage
-from utils.src.model_utils import parse_pdf
-from urllib.parse import unquote
-from copy import deepcopy
-from transformers import AutoTokenizer, AutoModelForCausalLM
-from pytorch_fid.fid_score import compute_statistics_of_path
-import pytorch_fid.fid_score as fid
-from PIL import Image
-from httpx import Timeout
-from docling.document_converter import DocumentConverter, PdfFormatOption
-import re
-import shutil
-import pytesseract
-from utils.wei_utils import account_token
-from camel.types import ModelPlatformType, ModelType
-from marker.models import create_model_dict
-from camel.configs import ChatGPTConfig
-from camel.agents import ChatAgent
-from jinja2 import Environment, StrictUndefined
-from utils.src.utils import get_json_from_response
-from pathlib import Path
-from docling_core.types.doc import ImageRefMode, PictureItem, TableItem
-from collections import defaultdict
-from camel.configs import ChatGPTConfig, QwenConfig, VLLMConfig, OpenRouterConfig, GeminiConfig
-
-from docling.datamodel.base_models import InputFormat
-from docling.datamodel.pipeline_options import PdfPipelineOptions
-from docling.document_converter import DocumentConverter, PdfFormatOption
-
-import math
-import base64
-import requests
-from io import BytesIO
-from PIL import Image
-
-import torch
-import json
-import os
-import pickle as pkl
-import numpy as np
-from transformers import AltCLIPProcessor, AltCLIPModel
-from pathlib import Path
-from typing import List
-from moviepy.editor import VideoFileClip
-
-
-os.environ["GEMINI_API_KEY"] = ""
-
-def compute_accuracy(predicted, ground_truth, aspects):
- """
- Parameters
- ----------
- predicted : dict
- {question: {'answer': , 'reference': ...}, ...}
- ground_truth : dict
- {question: '. full answer', ...}
- aspects : dict
- {question: '', ...}
-
- Returns
- -------
- overall_accuracy : float
- aspect_summary : dict
- {
- '': {
- 'total': , # questions in this aspect
- 'correct': , # correctly answered questions
- 'accuracy': # correct / total (0–1)
- },
- ...
- }
- """
- correct_global = 0
- total_global = len(ground_truth)
-
- total_by_aspect = defaultdict(int)
- correct_by_aspect = defaultdict(int)
-
- for q, pred_info in predicted.items():
- letter_pred = pred_info['answer']
- aspect = aspects.get(q, 'Unknown')
- total_by_aspect[aspect] += 1
-
- if q in ground_truth:
- letter_gt = ground_truth[q].split('.')[0].strip()
-
- if len(letter_pred) > 0:
- letter_pred = letter_pred[0].upper()
- if letter_pred == letter_gt:
- correct_global += 1
- correct_by_aspect[aspect] += 1
-
- overall_accuracy = correct_global / total_global if total_global else 0.0
-
- # Build the per-aspect dictionary
- aspect_summary = {}
- for aspect, total in total_by_aspect.items():
- correct = correct_by_aspect[aspect]
- acc = correct / total if total else 0.0
- aspect_summary[aspect] = {
- 'total': total,
- 'correct': correct,
- 'accuracy': acc
- }
-
- return overall_accuracy, aspect_summary
-
-def eval_qa_get_answer(video_input, questions, answers, aspects, agent_config, input_type='video'):
- agent_name = f'answer_question_from_{input_type}'
- with open(f"prompt/{agent_name}.yaml", "r") as f: config = yaml.safe_load(f)
-
- actor_model = ModelFactory.create(
- model_platform=agent_config['model_platform'],
- model_type=agent_config['model_type'],
- model_config_dict=agent_config['model_config'],)
-
- actor_sys_msg = config['system_prompt']
-
- actor_agent = ChatAgent(system_message=actor_sys_msg, model=actor_model, message_window_size=None,)
- actor_agent.reset()
-
- jinja_env = Environment(undefined=StrictUndefined)
- template = jinja_env.from_string(config["template"])
- with open(video_input, "rb") as f: video_bytes = f.read()
- if input_type == 'video':
- prompt = template.render(**{'questions': questions,})
-
- clip = VideoFileClip(video_input)
- duration = clip.duration
- msg = BaseMessage.make_user_message(
- role_name="User",
- content=prompt+"The video length is {}, you should NOT reference the timesteps if it exceeds video length".format(str(duration)),
- video_bytes=video_bytes,
- video_detail="low")
- response = actor_agent.step(msg)
- agent_answers = get_json_from_response(response.msgs[0].content)
-
- input_token, output_token = account_token(response)
- accuracy, aspect_accuracy = compute_accuracy(agent_answers, answers, aspects)
- return accuracy, aspect_accuracy, agent_answers, input_token, output_token
-
-def run_qa_metric(question_path, video_path, result_path, test_model):
- if test_model == "gemini":
- agent_config = {
- "model_type": ModelType.GEMINI_2_5_FLASH,
- "model_config": GeminiConfig().as_dict(),
- "model_platform": ModelPlatformType.GEMINI,
- }
- overall_qa_result = {"qa_result": {}}
-
- qa_dict = json.load(open(question_path, 'r'))
- detail_qa, understanding_qa = qa_dict['detail'], qa_dict['understanding']
- input_token_all, output_token_all =0, 0
- detail_accuracy, detail_aspect_accuracy, detail_agent_answers, input_token, output_token = eval_qa_get_answer(
- video_input=video_path,
- questions=detail_qa['questions'],
- answers=detail_qa['answers'],
- aspects=detail_qa['aspects'],
- agent_config=agent_config,
- input_type='video')
- input_token_all += input_token
- output_token_all += output_token
- understanding_accuracy, understanding_aspect_accuracy, understanding_agent_answers, input_token, output_token = eval_qa_get_answer(
- video_input=video_path,
- questions=understanding_qa['questions'],
- answers=understanding_qa['answers'],
- aspects=understanding_qa['aspects'],
- agent_config=agent_config,
- input_type='video')
- input_token_all += input_token
- output_token_all += output_token
- overall_qa_result['qa_result'][test_model] = {
- 'detail_accuracy': detail_accuracy,
- 'detail_aspect_accuracy': detail_aspect_accuracy,
- 'detail_agent_answers': detail_agent_answers,
- 'understanding_accuracy': understanding_accuracy,
- 'understanding_aspect_accuracy': understanding_aspect_accuracy,
- 'understanding_agent_answers': understanding_agent_answers}
- all_models_in_file = list(overall_qa_result['qa_result'].keys())
- detail_accs = []
- understanding_accs = []
- for m in all_models_in_file:
- detail_accs.append(overall_qa_result['qa_result'][m]['detail_accuracy'])
- understanding_accs.append(overall_qa_result['qa_result'][m]['understanding_accuracy'])
-
- avg_detail_accuracy = float(np.mean(detail_accs)) if detail_accs else 0.0
- avg_understanding_accuracy = float(np.mean(understanding_accs)) if understanding_accs else 0.0
-
- overall_qa_result['avg_detail_accuracy'] = avg_detail_accuracy
- overall_qa_result['avg_understanding_accuracy'] = avg_understanding_accuracy
-
- # Finally, overwrite the same JSON file with the updated results
- with open(result_path, 'w') as f: json.dump(overall_qa_result, f, indent=4)
- print(detail_accuracy, detail_aspect_accuracy, detail_agent_answers, input_token, output_token)
-
-_num_at_start = re.compile(r'^\s*["\']?(\d+)')
-def sort_by_leading_number(paths: List[str]) -> List[str]:
- def key(p: str):
- name = Path(p).name
- m = _num_at_start.match(name)
- return (int(m.group(1)) if m else float('inf'), name)
- return sorted(paths, key=key)
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument("-r", "--result_dir", default="/path/to/result")
- parser.add_argument("-g", "--data_dir", default="/path/to/data")
- parser.add_argument("-s", "--save_dir", default="/path/to/data")
- args = parser.parse_args()
- ## mkdirs
- save_dir = args.save_dir
- if path.basename(args.result_dir) == "paper2video":
- save_dir = path.join(save_dir, path.basename(args.result_dir))
- else: save_dir = path.join(save_dir, path.basename(args.result_dir))
-
- save_path = path.join(save_dir, "qa_result")
- os.makedirs(save_dir, exist_ok=True)
- os.makedirs(save_path, exist_ok=True)
-
- ## run test
- gt_dir, result_dir = args.data_dir, args.result_dir
- groundtruth_list = sort_by_leading_number([path.join(gt_dir, name) for name in os.listdir(gt_dir)])
- if path.basename(args.result_dir) == "human_made": result_list = [] # from dataset
- else: result_list = sort_by_leading_number([path.join(result_dir, name) for name in os.listdir(result_dir)])
-
- start, end = 1, 100
- for index in range(start, end):
- qa_json_path = path.join(groundtruth_list[index], "4o-mini_qa.json")
-
- ## paper2video
- if path.basename(args.result_dir) == 'paper2video':
- if without_presenter_flag is False:
- test_video_path = path.join(result_list[index], "3_merage.mp4")
- else:
- test_video_path = path.join(result_list[index], "1_merage.mp4")
- if path.exists(test_video_path) is False: continue
- ## human made as baseline
- elif path.basename(args.result_dir) == 'human_made':
- test_video_path = path.join(groundtruth_list[index], "gt_presentation_video.mp4")
- if path.exists(test_video_path) is False:
- test_video_path = path.join(groundtruth_list[index], "raw_video.mp4")
- ## veo3
- elif path.basename(args.result_dir) == 'veo3':
- test_video_path = result_list[index]
- elif path.basename(args.result_dir) == 'wan2.1':
- test_video_path = path.join(result_list[index], "result.mp4")
- ## presentagent
- else:
- test_video_path = path.join(result_list[index], "result.mp4")
- if path.exists(test_video_path) is False: continue
- result_save_path = path.join(save_path, "qa_result_{}.json".format(index))
- print("start")
- run_qa_metric(qa_json_path, test_video_path, result_save_path, 'gemini')
\ No newline at end of file
diff --git a/Paper2Video/src/evaluation/PresentQuiz/create_paper_questions.py b/Paper2Video/src/evaluation/PresentQuiz/create_paper_questions.py
deleted file mode 100644
index 487793657d599beea560ac4bb89cf1000c64ecee..0000000000000000000000000000000000000000
--- a/Paper2Video/src/evaluation/PresentQuiz/create_paper_questions.py
+++ /dev/null
@@ -1,47 +0,0 @@
-from utils.poster_eval_utils import *
-import argparse
-import os
-import json
-
-
-os.environ["OPENAI_API_KEY"] = ""
-
-
-if __name__ == '__main__':
- parser = argparse.ArgumentParser()
- parser.add_argument('--paper_folder', type=str, default="path/to/data")
- parser.add_argument('--model_name', type=str, default='4o')
- args = parser.parse_args()
-
- paper_text = get_poster_text(os.path.join(args.paper_folder, 'pdf', 'paper.pdf'))
-
- if args.model_name == '4o':
- model_type = ModelType.GPT_4O
- elif args.model_name == 'o3':
- model_type = ModelType.O3
- elif args.model_name == 'gemini':
- model_type = ModelType.GEMINI_2_5_PRO
-
- detail_qa = get_questions(paper_text, 'detail', model_type)
- understanding_qa = get_questions(paper_text, 'understanding', model_type)
-
- detail_q, detail_a, detail_aspects = get_answers_and_remove_answers(detail_qa)
- understanding_q, understanding_a, understanding_aspects = get_answers_and_remove_answers(understanding_qa)
-
- final_qa = {}
- detail_qa = {
- 'questions': detail_q,
- 'answers': detail_a,
- 'aspects': detail_aspects,
- }
-
- understanding_qa = {
- 'questions': understanding_q,
- 'answers': understanding_a,
- 'aspects': understanding_aspects,
- }
- final_qa['detail'] = detail_qa
- final_qa['understanding'] = understanding_qa
-
- with open(os.path.join(args.paper_folder, f'{args.model_name}_qa.json'), 'w') as f:
- json.dump(final_qa, f, indent=4)
\ No newline at end of file
diff --git a/Paper2Video/src/evaluation/PresentQuiz/docling/__init__.py b/Paper2Video/src/evaluation/PresentQuiz/docling/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/Paper2Video/src/evaluation/PresentQuiz/docling/backend/__init__.py b/Paper2Video/src/evaluation/PresentQuiz/docling/backend/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/Paper2Video/src/evaluation/PresentQuiz/docling/backend/abstract_backend.py b/Paper2Video/src/evaluation/PresentQuiz/docling/backend/abstract_backend.py
deleted file mode 100644
index 491330b36f71c364fe96695fcfaa3ab752bac1e2..0000000000000000000000000000000000000000
--- a/Paper2Video/src/evaluation/PresentQuiz/docling/backend/abstract_backend.py
+++ /dev/null
@@ -1,63 +0,0 @@
-from abc import ABC, abstractmethod
-from io import BytesIO
-from pathlib import Path
-from typing import TYPE_CHECKING, Set, Union
-
-from docling_core.types.doc import DoclingDocument
-
-if TYPE_CHECKING:
- from docling.datamodel.base_models import InputFormat
- from docling.datamodel.document import InputDocument
-
-
-class AbstractDocumentBackend(ABC):
- @abstractmethod
- def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
- self.file = in_doc.file
- self.path_or_stream = path_or_stream
- self.document_hash = in_doc.document_hash
- self.input_format = in_doc.format
-
- @abstractmethod
- def is_valid(self) -> bool:
- pass
-
- @classmethod
- @abstractmethod
- def supports_pagination(cls) -> bool:
- pass
-
- def unload(self):
- if isinstance(self.path_or_stream, BytesIO):
- self.path_or_stream.close()
-
- self.path_or_stream = None
-
- @classmethod
- @abstractmethod
- def supported_formats(cls) -> Set["InputFormat"]:
- pass
-
-
-class PaginatedDocumentBackend(AbstractDocumentBackend):
- """DeclarativeDocumentBackend.
-
- A declarative document backend is a backend that can transform to DoclingDocument
- straight without a recognition pipeline.
- """
-
- @abstractmethod
- def page_count(self) -> int:
- pass
-
-
-class DeclarativeDocumentBackend(AbstractDocumentBackend):
- """DeclarativeDocumentBackend.
-
- A declarative document backend is a backend that can transform to DoclingDocument
- straight without a recognition pipeline.
- """
-
- @abstractmethod
- def convert(self) -> DoclingDocument:
- pass
diff --git a/Paper2Video/src/evaluation/PresentQuiz/docling/backend/asciidoc_backend.py b/Paper2Video/src/evaluation/PresentQuiz/docling/backend/asciidoc_backend.py
deleted file mode 100644
index 397bfc44b91666c24ee38b3191978698a923d0c3..0000000000000000000000000000000000000000
--- a/Paper2Video/src/evaluation/PresentQuiz/docling/backend/asciidoc_backend.py
+++ /dev/null
@@ -1,430 +0,0 @@
-import logging
-import re
-from io import BytesIO
-from pathlib import Path
-from typing import Set, Union
-
-from docling_core.types.doc import (
- DocItemLabel,
- DoclingDocument,
- DocumentOrigin,
- GroupItem,
- GroupLabel,
- ImageRef,
- Size,
- TableCell,
- TableData,
-)
-
-from docling.backend.abstract_backend import DeclarativeDocumentBackend
-from docling.datamodel.base_models import InputFormat
-from docling.datamodel.document import InputDocument
-
-_log = logging.getLogger(__name__)
-
-
-class AsciiDocBackend(DeclarativeDocumentBackend):
- def __init__(self, in_doc: InputDocument, path_or_stream: Union[BytesIO, Path]):
- super().__init__(in_doc, path_or_stream)
-
- self.path_or_stream = path_or_stream
-
- try:
- if isinstance(self.path_or_stream, BytesIO):
- text_stream = self.path_or_stream.getvalue().decode("utf-8")
- self.lines = text_stream.split("\n")
- if isinstance(self.path_or_stream, Path):
- with open(self.path_or_stream, "r", encoding="utf-8") as f:
- self.lines = f.readlines()
- self.valid = True
-
- except Exception as e:
- raise RuntimeError(
- f"Could not initialize AsciiDoc backend for file with hash {self.document_hash}."
- ) from e
- return
-
- def is_valid(self) -> bool:
- return self.valid
-
- @classmethod
- def supports_pagination(cls) -> bool:
- return False
-
- def unload(self):
- return
-
- @classmethod
- def supported_formats(cls) -> Set[InputFormat]:
- return {InputFormat.ASCIIDOC}
-
- def convert(self) -> DoclingDocument:
- """
- Parses the ASCII into a structured document model.
- """
-
- origin = DocumentOrigin(
- filename=self.file.name or "file",
- mimetype="text/asciidoc",
- binary_hash=self.document_hash,
- )
-
- doc = DoclingDocument(name=self.file.stem or "file", origin=origin)
-
- doc = self._parse(doc)
-
- return doc
-
- def _parse(self, doc: DoclingDocument):
- """
- Main function that orchestrates the parsing by yielding components:
- title, section headers, text, lists, and tables.
- """
-
- content = ""
-
- in_list = False
- in_table = False
-
- text_data: list[str] = []
- table_data: list[str] = []
- caption_data: list[str] = []
-
- # parents: dict[int, Union[DocItem, GroupItem, None]] = {}
- parents: dict[int, Union[GroupItem, None]] = {}
- # indents: dict[int, Union[DocItem, GroupItem, None]] = {}
- indents: dict[int, Union[GroupItem, None]] = {}
-
- for i in range(0, 10):
- parents[i] = None
- indents[i] = None
-
- for line in self.lines:
- # line = line.strip()
-
- # Title
- if self._is_title(line):
- item = self._parse_title(line)
- level = item["level"]
-
- parents[level] = doc.add_text(
- text=item["text"], label=DocItemLabel.TITLE
- )
-
- # Section headers
- elif self._is_section_header(line):
- item = self._parse_section_header(line)
- level = item["level"]
-
- parents[level] = doc.add_heading(
- text=item["text"], level=item["level"], parent=parents[level - 1]
- )
- for k, v in parents.items():
- if k > level:
- parents[k] = None
-
- # Lists
- elif self._is_list_item(line):
-
- _log.debug(f"line: {line}")
- item = self._parse_list_item(line)
- _log.debug(f"parsed list-item: {item}")
-
- level = self._get_current_level(parents)
-
- if not in_list:
- in_list = True
-
- parents[level + 1] = doc.add_group(
- parent=parents[level], name="list", label=GroupLabel.LIST
- )
- indents[level + 1] = item["indent"]
-
- elif in_list and item["indent"] > indents[level]:
- parents[level + 1] = doc.add_group(
- parent=parents[level], name="list", label=GroupLabel.LIST
- )
- indents[level + 1] = item["indent"]
-
- elif in_list and item["indent"] < indents[level]:
-
- # print(item["indent"], " => ", indents[level])
- while item["indent"] < indents[level]:
- # print(item["indent"], " => ", indents[level])
- parents[level] = None
- indents[level] = None
- level -= 1
-
- doc.add_list_item(
- item["text"], parent=self._get_current_parent(parents)
- )
-
- elif in_list and not self._is_list_item(line):
- in_list = False
-
- level = self._get_current_level(parents)
- parents[level] = None
-
- # Tables
- elif line.strip() == "|===" and not in_table: # start of table
- in_table = True
-
- elif self._is_table_line(line): # within a table
- in_table = True
- table_data.append(self._parse_table_line(line))
-
- elif in_table and (
- (not self._is_table_line(line)) or line.strip() == "|==="
- ): # end of table
-
- caption = None
- if len(caption_data) > 0:
- caption = doc.add_text(
- text=" ".join(caption_data), label=DocItemLabel.CAPTION
- )
-
- caption_data = []
-
- data = self._populate_table_as_grid(table_data)
- doc.add_table(
- data=data, parent=self._get_current_parent(parents), caption=caption
- )
-
- in_table = False
- table_data = []
-
- # Picture
- elif self._is_picture(line):
-
- caption = None
- if len(caption_data) > 0:
- caption = doc.add_text(
- text=" ".join(caption_data), label=DocItemLabel.CAPTION
- )
-
- caption_data = []
-
- item = self._parse_picture(line)
-
- size = None
- if "width" in item and "height" in item:
- size = Size(width=int(item["width"]), height=int(item["height"]))
-
- uri = None
- if (
- "uri" in item
- and not item["uri"].startswith("http")
- and item["uri"].startswith("//")
- ):
- uri = "file:" + item["uri"]
- elif (
- "uri" in item
- and not item["uri"].startswith("http")
- and item["uri"].startswith("/")
- ):
- uri = "file:/" + item["uri"]
- elif "uri" in item and not item["uri"].startswith("http"):
- uri = "file://" + item["uri"]
-
- image = ImageRef(mimetype="image/png", size=size, dpi=70, uri=uri)
- doc.add_picture(image=image, caption=caption)
-
- # Caption
- elif self._is_caption(line) and len(caption_data) == 0:
- item = self._parse_caption(line)
- caption_data.append(item["text"])
-
- elif (
- len(line.strip()) > 0 and len(caption_data) > 0
- ): # allow multiline captions
- item = self._parse_text(line)
- caption_data.append(item["text"])
-
- # Plain text
- elif len(line.strip()) == 0 and len(text_data) > 0:
- doc.add_text(
- text=" ".join(text_data),
- label=DocItemLabel.PARAGRAPH,
- parent=self._get_current_parent(parents),
- )
- text_data = []
-
- elif len(line.strip()) > 0: # allow multiline texts
-
- item = self._parse_text(line)
- text_data.append(item["text"])
-
- if len(text_data) > 0:
- doc.add_text(
- text=" ".join(text_data),
- label=DocItemLabel.PARAGRAPH,
- parent=self._get_current_parent(parents),
- )
- text_data = []
-
- if in_table and len(table_data) > 0:
- data = self._populate_table_as_grid(table_data)
- doc.add_table(data=data, parent=self._get_current_parent(parents))
-
- in_table = False
- table_data = []
-
- return doc
-
- def _get_current_level(self, parents):
- for k, v in parents.items():
- if v == None and k > 0:
- return k - 1
-
- return 0
-
- def _get_current_parent(self, parents):
- for k, v in parents.items():
- if v == None and k > 0:
- return parents[k - 1]
-
- return None
-
- # ========= Title
- def _is_title(self, line):
- return re.match(r"^= ", line)
-
- def _parse_title(self, line):
- return {"type": "title", "text": line[2:].strip(), "level": 0}
-
- # ========= Section headers
- def _is_section_header(self, line):
- return re.match(r"^==+", line)
-
- def _parse_section_header(self, line):
- match = re.match(r"^(=+)\s+(.*)", line)
-
- marker = match.group(1) # The list marker (e.g., "*", "-", "1.")
- text = match.group(2) # The actual text of the list item
-
- header_level = marker.count("=") # number of '=' represents level
- return {
- "type": "header",
- "level": header_level - 1,
- "text": text.strip(),
- }
-
- # ========= Lists
- def _is_list_item(self, line):
- return re.match(r"^(\s)*(\*|-|\d+\.|\w+\.) ", line)
-
- def _parse_list_item(self, line):
- """Extract the item marker (number or bullet symbol) and the text of the item."""
-
- match = re.match(r"^(\s*)(\*|-|\d+\.)\s+(.*)", line)
- if match:
- indent = match.group(1)
- marker = match.group(2) # The list marker (e.g., "*", "-", "1.")
- text = match.group(3) # The actual text of the list item
-
- if marker == "*" or marker == "-":
- return {
- "type": "list_item",
- "marker": marker,
- "text": text.strip(),
- "numbered": False,
- "indent": 0 if indent == None else len(indent),
- }
- else:
- return {
- "type": "list_item",
- "marker": marker,
- "text": text.strip(),
- "numbered": True,
- "indent": 0 if indent == None else len(indent),
- }
- else:
- # Fallback if no match
- return {
- "type": "list_item",
- "marker": "-",
- "text": line,
- "numbered": False,
- "indent": 0,
- }
-
- # ========= Tables
- def _is_table_line(self, line):
- return re.match(r"^\|.*\|", line)
-
- def _parse_table_line(self, line):
- # Split table cells and trim extra spaces
- return [cell.strip() for cell in line.split("|") if cell.strip()]
-
- def _populate_table_as_grid(self, table_data):
-
- num_rows = len(table_data)
-
- # Adjust the table data into a grid format
- num_cols = max(len(row) for row in table_data)
-
- data = TableData(num_rows=num_rows, num_cols=num_cols, table_cells=[])
- for row_idx, row in enumerate(table_data):
- # Pad rows with empty strings to match column count
- # grid.append(row + [''] * (max_cols - len(row)))
-
- for col_idx, text in enumerate(row):
- row_span = 1
- col_span = 1
-
- cell = TableCell(
- text=text,
- row_span=row_span,
- col_span=col_span,
- start_row_offset_idx=row_idx,
- end_row_offset_idx=row_idx + row_span,
- start_col_offset_idx=col_idx,
- end_col_offset_idx=col_idx + col_span,
- col_header=False,
- row_header=False,
- )
- data.table_cells.append(cell)
-
- return data
-
- # ========= Pictures
- def _is_picture(self, line):
- return re.match(r"^image::", line)
-
- def _parse_picture(self, line):
- """
- Parse an image macro, extracting its path and attributes.
- Syntax: image::path/to/image.png[Alt Text, width=200, height=150, align=center]
- """
- mtch = re.match(r"^image::(.+)\[(.*)\]$", line)
- if mtch:
- picture_path = mtch.group(1).strip()
- attributes = mtch.group(2).split(",")
- picture_info = {"type": "picture", "uri": picture_path}
-
- # Extract optional attributes (alt text, width, height, alignment)
- if attributes:
- picture_info["alt"] = attributes[0].strip() if attributes[0] else ""
- for attr in attributes[1:]:
- key, value = attr.split("=")
- picture_info[key.strip()] = value.strip()
-
- return picture_info
-
- return {"type": "picture", "uri": line}
-
- # ========= Captions
- def _is_caption(self, line):
- return re.match(r"^\.(.+)", line)
-
- def _parse_caption(self, line):
- mtch = re.match(r"^\.(.+)", line)
- if mtch:
- text = mtch.group(1)
- return {"type": "caption", "text": text}
-
- return {"type": "caption", "text": ""}
-
- # ========= Plain text
- def _parse_text(self, line):
- return {"type": "text", "text": line.strip()}
diff --git a/Paper2Video/src/evaluation/PresentQuiz/docling/backend/docling_parse_backend.py b/Paper2Video/src/evaluation/PresentQuiz/docling/backend/docling_parse_backend.py
deleted file mode 100644
index 6d22127bbfcb129791cf43fbfa749e0e437ff58a..0000000000000000000000000000000000000000
--- a/Paper2Video/src/evaluation/PresentQuiz/docling/backend/docling_parse_backend.py
+++ /dev/null
@@ -1,227 +0,0 @@
-import logging
-import random
-from io import BytesIO
-from pathlib import Path
-from typing import Iterable, List, Optional, Union
-
-import pypdfium2 as pdfium
-from docling_core.types.doc import BoundingBox, CoordOrigin, Size
-from docling_parse.pdf_parsers import pdf_parser_v1
-from PIL import Image, ImageDraw
-from pypdfium2 import PdfPage
-
-from docling.backend.pdf_backend import PdfDocumentBackend, PdfPageBackend
-from docling.datamodel.base_models import Cell
-from docling.datamodel.document import InputDocument
-
-_log = logging.getLogger(__name__)
-
-
-class DoclingParsePageBackend(PdfPageBackend):
- def __init__(
- self, parser: pdf_parser_v1, document_hash: str, page_no: int, page_obj: PdfPage
- ):
- self._ppage = page_obj
- parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no)
-
- self.valid = "pages" in parsed_page
- if self.valid:
- self._dpage = parsed_page["pages"][0]
- else:
- _log.info(
- f"An error occurred when loading page {page_no} of document {document_hash}."
- )
-
- def is_valid(self) -> bool:
- return self.valid
-
- def get_text_in_rect(self, bbox: BoundingBox) -> str:
- if not self.valid:
- return ""
- # Find intersecting cells on the page
- text_piece = ""
- page_size = self.get_size()
- parser_width = self._dpage["width"]
- parser_height = self._dpage["height"]
-
- scale = (
- 1 # FIX - Replace with param in get_text_in_rect across backends (optional)
- )
-
- for i in range(len(self._dpage["cells"])):
- rect = self._dpage["cells"][i]["box"]["device"]
- x0, y0, x1, y1 = rect
- cell_bbox = BoundingBox(
- l=x0 * scale * page_size.width / parser_width,
- b=y0 * scale * page_size.height / parser_height,
- r=x1 * scale * page_size.width / parser_width,
- t=y1 * scale * page_size.height / parser_height,
- coord_origin=CoordOrigin.BOTTOMLEFT,
- ).to_top_left_origin(page_height=page_size.height * scale)
-
- overlap_frac = cell_bbox.intersection_area_with(bbox) / cell_bbox.area()
-
- if overlap_frac > 0.5:
- if len(text_piece) > 0:
- text_piece += " "
- text_piece += self._dpage["cells"][i]["content"]["rnormalized"]
-
- return text_piece
-
- def get_text_cells(self) -> Iterable[Cell]:
- cells: List[Cell] = []
- cell_counter = 0
-
- if not self.valid:
- return cells
-
- page_size = self.get_size()
-
- parser_width = self._dpage["width"]
- parser_height = self._dpage["height"]
-
- for i in range(len(self._dpage["cells"])):
- rect = self._dpage["cells"][i]["box"]["device"]
- x0, y0, x1, y1 = rect
-
- if x1 < x0:
- x0, x1 = x1, x0
- if y1 < y0:
- y0, y1 = y1, y0
-
- text_piece = self._dpage["cells"][i]["content"]["rnormalized"]
- cells.append(
- Cell(
- id=cell_counter,
- text=text_piece,
- bbox=BoundingBox(
- # l=x0, b=y0, r=x1, t=y1,
- l=x0 * page_size.width / parser_width,
- b=y0 * page_size.height / parser_height,
- r=x1 * page_size.width / parser_width,
- t=y1 * page_size.height / parser_height,
- coord_origin=CoordOrigin.BOTTOMLEFT,
- ).to_top_left_origin(page_size.height),
- )
- )
- cell_counter += 1
-
- def draw_clusters_and_cells():
- image = (
- self.get_page_image()
- ) # make new image to avoid drawing on the saved ones
- draw = ImageDraw.Draw(image)
- for c in cells:
- x0, y0, x1, y1 = c.bbox.as_tuple()
- cell_color = (
- random.randint(30, 140),
- random.randint(30, 140),
- random.randint(30, 140),
- )
- draw.rectangle([(x0, y0), (x1, y1)], outline=cell_color)
- image.show()
-
- # before merge:
- # draw_clusters_and_cells()
-
- # cells = merge_horizontal_cells(cells)
-
- # after merge:
- # draw_clusters_and_cells()
-
- return cells
-
- def get_bitmap_rects(self, scale: float = 1) -> Iterable[BoundingBox]:
- AREA_THRESHOLD = 0 # 32 * 32
-
- for i in range(len(self._dpage["images"])):
- bitmap = self._dpage["images"][i]
- cropbox = BoundingBox.from_tuple(
- bitmap["box"], origin=CoordOrigin.BOTTOMLEFT
- ).to_top_left_origin(self.get_size().height)
-
- if cropbox.area() > AREA_THRESHOLD:
- cropbox = cropbox.scaled(scale=scale)
-
- yield cropbox
-
- def get_page_image(
- self, scale: float = 1, cropbox: Optional[BoundingBox] = None
- ) -> Image.Image:
-
- page_size = self.get_size()
-
- if not cropbox:
- cropbox = BoundingBox(
- l=0,
- r=page_size.width,
- t=0,
- b=page_size.height,
- coord_origin=CoordOrigin.TOPLEFT,
- )
- padbox = BoundingBox(
- l=0, r=0, t=0, b=0, coord_origin=CoordOrigin.BOTTOMLEFT
- )
- else:
- padbox = cropbox.to_bottom_left_origin(page_size.height).model_copy()
- padbox.r = page_size.width - padbox.r
- padbox.t = page_size.height - padbox.t
-
- image = (
- self._ppage.render(
- scale=scale * 1.5,
- rotation=0, # no additional rotation
- crop=padbox.as_tuple(),
- )
- .to_pil()
- .resize(size=(round(cropbox.width * scale), round(cropbox.height * scale)))
- ) # We resize the image from 1.5x the given scale to make it sharper.
-
- return image
-
- def get_size(self) -> Size:
- return Size(width=self._ppage.get_width(), height=self._ppage.get_height())
-
- def unload(self):
- self._ppage = None
- self._dpage = None
-
-
-class DoclingParseDocumentBackend(PdfDocumentBackend):
- def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
- super().__init__(in_doc, path_or_stream)
-
- self._pdoc = pdfium.PdfDocument(self.path_or_stream)
- self.parser = pdf_parser_v1()
-
- success = False
- if isinstance(self.path_or_stream, BytesIO):
- success = self.parser.load_document_from_bytesio(
- self.document_hash, self.path_or_stream
- )
- elif isinstance(self.path_or_stream, Path):
- success = self.parser.load_document(
- self.document_hash, str(self.path_or_stream)
- )
-
- if not success:
- raise RuntimeError(
- f"docling-parse could not load document with hash {self.document_hash}."
- )
-
- def page_count(self) -> int:
- return len(self._pdoc) # To be replaced with docling-parse API
-
- def load_page(self, page_no: int) -> DoclingParsePageBackend:
- return DoclingParsePageBackend(
- self.parser, self.document_hash, page_no, self._pdoc[page_no]
- )
-
- def is_valid(self) -> bool:
- return self.page_count() > 0
-
- def unload(self):
- super().unload()
- self.parser.unload_document(self.document_hash)
- self._pdoc.close()
- self._pdoc = None
diff --git a/Paper2Video/src/evaluation/PresentQuiz/docling/backend/docling_parse_v2_backend.py b/Paper2Video/src/evaluation/PresentQuiz/docling/backend/docling_parse_v2_backend.py
deleted file mode 100644
index 27a368f92e11a26041a701012b96f875544385f0..0000000000000000000000000000000000000000
--- a/Paper2Video/src/evaluation/PresentQuiz/docling/backend/docling_parse_v2_backend.py
+++ /dev/null
@@ -1,250 +0,0 @@
-import logging
-import random
-from io import BytesIO
-from pathlib import Path
-from typing import TYPE_CHECKING, Iterable, List, Optional, Union
-
-import pypdfium2 as pdfium
-from docling_core.types.doc import BoundingBox, CoordOrigin
-from docling_parse.pdf_parsers import pdf_parser_v2
-from PIL import Image, ImageDraw
-from pypdfium2 import PdfPage
-
-from docling.backend.pdf_backend import PdfDocumentBackend, PdfPageBackend
-from docling.datamodel.base_models import Cell, Size
-
-if TYPE_CHECKING:
- from docling.datamodel.document import InputDocument
-
-_log = logging.getLogger(__name__)
-
-
-class DoclingParseV2PageBackend(PdfPageBackend):
- def __init__(
- self, parser: pdf_parser_v2, document_hash: str, page_no: int, page_obj: PdfPage
- ):
- self._ppage = page_obj
- parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no)
-
- self.valid = "pages" in parsed_page and len(parsed_page["pages"]) == 1
- if self.valid:
- self._dpage = parsed_page["pages"][0]
- else:
- _log.info(
- f"An error occurred when loading page {page_no} of document {document_hash}."
- )
-
- def is_valid(self) -> bool:
- return self.valid
-
- def get_text_in_rect(self, bbox: BoundingBox) -> str:
- if not self.valid:
- return ""
- # Find intersecting cells on the page
- text_piece = ""
- page_size = self.get_size()
-
- parser_width = self._dpage["sanitized"]["dimension"]["width"]
- parser_height = self._dpage["sanitized"]["dimension"]["height"]
-
- scale = (
- 1 # FIX - Replace with param in get_text_in_rect across backends (optional)
- )
-
- cells_data = self._dpage["sanitized"]["cells"]["data"]
- cells_header = self._dpage["sanitized"]["cells"]["header"]
-
- for i, cell_data in enumerate(cells_data):
- x0 = cell_data[cells_header.index("x0")]
- y0 = cell_data[cells_header.index("y0")]
- x1 = cell_data[cells_header.index("x1")]
- y1 = cell_data[cells_header.index("y1")]
-
- cell_bbox = BoundingBox(
- l=x0 * scale * page_size.width / parser_width,
- b=y0 * scale * page_size.height / parser_height,
- r=x1 * scale * page_size.width / parser_width,
- t=y1 * scale * page_size.height / parser_height,
- coord_origin=CoordOrigin.BOTTOMLEFT,
- ).to_top_left_origin(page_height=page_size.height * scale)
-
- overlap_frac = cell_bbox.intersection_area_with(bbox) / cell_bbox.area()
-
- if overlap_frac > 0.5:
- if len(text_piece) > 0:
- text_piece += " "
- text_piece += cell_data[cells_header.index("text")]
-
- return text_piece
-
- def get_text_cells(self) -> Iterable[Cell]:
- cells: List[Cell] = []
- cell_counter = 0
-
- if not self.valid:
- return cells
-
- page_size = self.get_size()
-
- parser_width = self._dpage["sanitized"]["dimension"]["width"]
- parser_height = self._dpage["sanitized"]["dimension"]["height"]
-
- cells_data = self._dpage["sanitized"]["cells"]["data"]
- cells_header = self._dpage["sanitized"]["cells"]["header"]
-
- for i, cell_data in enumerate(cells_data):
- x0 = cell_data[cells_header.index("x0")]
- y0 = cell_data[cells_header.index("y0")]
- x1 = cell_data[cells_header.index("x1")]
- y1 = cell_data[cells_header.index("y1")]
-
- if x1 < x0:
- x0, x1 = x1, x0
- if y1 < y0:
- y0, y1 = y1, y0
-
- text_piece = cell_data[cells_header.index("text")]
- cells.append(
- Cell(
- id=cell_counter,
- text=text_piece,
- bbox=BoundingBox(
- # l=x0, b=y0, r=x1, t=y1,
- l=x0 * page_size.width / parser_width,
- b=y0 * page_size.height / parser_height,
- r=x1 * page_size.width / parser_width,
- t=y1 * page_size.height / parser_height,
- coord_origin=CoordOrigin.BOTTOMLEFT,
- ).to_top_left_origin(page_size.height),
- )
- )
- cell_counter += 1
-
- def draw_clusters_and_cells():
- image = (
- self.get_page_image()
- ) # make new image to avoid drawing on the saved ones
- draw = ImageDraw.Draw(image)
- for c in cells:
- x0, y0, x1, y1 = c.bbox.as_tuple()
- cell_color = (
- random.randint(30, 140),
- random.randint(30, 140),
- random.randint(30, 140),
- )
- draw.rectangle([(x0, y0), (x1, y1)], outline=cell_color)
- image.show()
-
- # draw_clusters_and_cells()
-
- return cells
-
- def get_bitmap_rects(self, scale: float = 1) -> Iterable[BoundingBox]:
- AREA_THRESHOLD = 0 # 32 * 32
-
- images = self._dpage["sanitized"]["images"]["data"]
- images_header = self._dpage["sanitized"]["images"]["header"]
-
- for row in images:
- x0 = row[images_header.index("x0")]
- y0 = row[images_header.index("y0")]
- x1 = row[images_header.index("x1")]
- y1 = row[images_header.index("y1")]
-
- cropbox = BoundingBox.from_tuple(
- (x0, y0, x1, y1), origin=CoordOrigin.BOTTOMLEFT
- ).to_top_left_origin(self.get_size().height)
-
- if cropbox.area() > AREA_THRESHOLD:
- cropbox = cropbox.scaled(scale=scale)
-
- yield cropbox
-
- def get_page_image(
- self, scale: float = 1, cropbox: Optional[BoundingBox] = None
- ) -> Image.Image:
-
- page_size = self.get_size()
-
- if not cropbox:
- cropbox = BoundingBox(
- l=0,
- r=page_size.width,
- t=0,
- b=page_size.height,
- coord_origin=CoordOrigin.TOPLEFT,
- )
- padbox = BoundingBox(
- l=0, r=0, t=0, b=0, coord_origin=CoordOrigin.BOTTOMLEFT
- )
- else:
- padbox = cropbox.to_bottom_left_origin(page_size.height).model_copy()
- padbox.r = page_size.width - padbox.r
- padbox.t = page_size.height - padbox.t
-
- image = (
- self._ppage.render(
- scale=scale * 1.5,
- rotation=0, # no additional rotation
- crop=padbox.as_tuple(),
- )
- .to_pil()
- .resize(size=(round(cropbox.width * scale), round(cropbox.height * scale)))
- ) # We resize the image from 1.5x the given scale to make it sharper.
-
- return image
-
- def get_size(self) -> Size:
- return Size(width=self._ppage.get_width(), height=self._ppage.get_height())
-
- def unload(self):
- self._ppage = None
- self._dpage = None
-
-
-class DoclingParseV2DocumentBackend(PdfDocumentBackend):
- def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
- super().__init__(in_doc, path_or_stream)
-
- self._pdoc = pdfium.PdfDocument(self.path_or_stream)
- self.parser = pdf_parser_v2("fatal")
-
- success = False
- if isinstance(self.path_or_stream, BytesIO):
- success = self.parser.load_document_from_bytesio(
- self.document_hash, self.path_or_stream
- )
- elif isinstance(self.path_or_stream, Path):
- success = self.parser.load_document(
- self.document_hash, str(self.path_or_stream)
- )
-
- if not success:
- raise RuntimeError(
- f"docling-parse v2 could not load document {self.document_hash}."
- )
-
- def page_count(self) -> int:
- # return len(self._pdoc) # To be replaced with docling-parse API
-
- len_1 = len(self._pdoc)
- len_2 = self.parser.number_of_pages(self.document_hash)
-
- if len_1 != len_2:
- _log.error(f"Inconsistent number of pages: {len_1}!={len_2}")
-
- return len_2
-
- def load_page(self, page_no: int) -> DoclingParseV2PageBackend:
- return DoclingParseV2PageBackend(
- self.parser, self.document_hash, page_no, self._pdoc[page_no]
- )
-
- def is_valid(self) -> bool:
- return self.page_count() > 0
-
- def unload(self):
- super().unload()
- self.parser.unload_document(self.document_hash)
- self._pdoc.close()
- self._pdoc = None
diff --git a/Paper2Video/src/evaluation/PresentQuiz/docling/backend/html_backend.py b/Paper2Video/src/evaluation/PresentQuiz/docling/backend/html_backend.py
deleted file mode 100644
index 286dfbfaedbfee4c058a70c86a2f1520712f7b69..0000000000000000000000000000000000000000
--- a/Paper2Video/src/evaluation/PresentQuiz/docling/backend/html_backend.py
+++ /dev/null
@@ -1,442 +0,0 @@
-import logging
-from io import BytesIO
-from pathlib import Path
-from typing import Optional, Set, Union
-
-from bs4 import BeautifulSoup, Tag
-from docling_core.types.doc import (
- DocItemLabel,
- DoclingDocument,
- DocumentOrigin,
- GroupLabel,
- TableCell,
- TableData,
-)
-
-from docling.backend.abstract_backend import DeclarativeDocumentBackend
-from docling.datamodel.base_models import InputFormat
-from docling.datamodel.document import InputDocument
-
-_log = logging.getLogger(__name__)
-
-
-class HTMLDocumentBackend(DeclarativeDocumentBackend):
- def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
- super().__init__(in_doc, path_or_stream)
- _log.debug("About to init HTML backend...")
- self.soup: Optional[Tag] = None
- # HTML file:
- self.path_or_stream = path_or_stream
- # Initialise the parents for the hierarchy
- self.max_levels = 10
- self.level = 0
- self.parents = {} # type: ignore
- for i in range(0, self.max_levels):
- self.parents[i] = None
- self.labels = {} # type: ignore
-
- try:
- if isinstance(self.path_or_stream, BytesIO):
- text_stream = self.path_or_stream.getvalue()
- self.soup = BeautifulSoup(text_stream, "html.parser")
- if isinstance(self.path_or_stream, Path):
- with open(self.path_or_stream, "rb") as f:
- html_content = f.read()
- self.soup = BeautifulSoup(html_content, "html.parser")
- except Exception as e:
- raise RuntimeError(
- f"Could not initialize HTML backend for file with hash {self.document_hash}."
- ) from e
-
- def is_valid(self) -> bool:
- return self.soup is not None
-
- @classmethod
- def supports_pagination(cls) -> bool:
- return False
-
- def unload(self):
- if isinstance(self.path_or_stream, BytesIO):
- self.path_or_stream.close()
-
- self.path_or_stream = None
-
- @classmethod
- def supported_formats(cls) -> Set[InputFormat]:
- return {InputFormat.HTML}
-
- def convert(self) -> DoclingDocument:
- # access self.path_or_stream to load stuff
- origin = DocumentOrigin(
- filename=self.file.name or "file",
- mimetype="text/html",
- binary_hash=self.document_hash,
- )
-
- doc = DoclingDocument(name=self.file.stem or "file", origin=origin)
- _log.debug("Trying to convert HTML...")
-
- if self.is_valid():
- assert self.soup is not None
- content = self.soup.body or self.soup
- # Replace
tags with newline characters
- for br in content.find_all("br"):
- br.replace_with("\n")
- doc = self.walk(content, doc)
- else:
- raise RuntimeError(
- f"Cannot convert doc with {self.document_hash} because the backend failed to init."
- )
- return doc
-
- def walk(self, element: Tag, doc: DoclingDocument):
- try:
- # Iterate over elements in the body of the document
- for idx, element in enumerate(element.children):
- try:
- self.analyse_element(element, idx, doc)
- except Exception as exc_child:
-
- _log.error(" -> error treating child: ", exc_child)
- _log.error(" => element: ", element, "\n")
- raise exc_child
-
- except Exception as exc:
- pass
-
- return doc
-
- def analyse_element(self, element: Tag, idx: int, doc: DoclingDocument):
- """
- if element.name!=None:
- _log.debug("\t"*self.level, idx, "\t", f"{element.name} ({self.level})")
- """
-
- if element.name in self.labels:
- self.labels[element.name] += 1
- else:
- self.labels[element.name] = 1
-
- if element.name in ["h1", "h2", "h3", "h4", "h5", "h6"]:
- self.handle_header(element, idx, doc)
- elif element.name in ["p"]:
- self.handle_paragraph(element, idx, doc)
- elif element.name in ["pre"]:
- self.handle_code(element, idx, doc)
- elif element.name in ["ul", "ol"]:
- self.handle_list(element, idx, doc)
- elif element.name in ["li"]:
- self.handle_listitem(element, idx, doc)
- elif element.name == "table":
- self.handle_table(element, idx, doc)
- elif element.name == "figure":
- self.handle_figure(element, idx, doc)
- elif element.name == "img":
- self.handle_image(element, idx, doc)
- else:
- self.walk(element, doc)
-
- def get_direct_text(self, item: Tag):
- """Get the direct text of the element (ignoring nested lists)."""
- text = item.find(string=True, recursive=False)
- if isinstance(text, str):
- return text.strip()
-
- return ""
-
- # Function to recursively extract text from all child nodes
- def extract_text_recursively(self, item: Tag):
- result = []
-
- if isinstance(item, str):
- return [item]
-
- if item.name not in ["ul", "ol"]:
- try:
- # Iterate over the children (and their text and tails)
- for child in item:
- try:
- # Recursively get the child's text content
- result.extend(self.extract_text_recursively(child))
- except:
- pass
- except:
- _log.warn("item has no children")
- pass
-
- return "".join(result) + " "
-
- def handle_header(self, element: Tag, idx: int, doc: DoclingDocument):
- """Handles header tags (h1, h2, etc.)."""
- hlevel = int(element.name.replace("h", ""))
- slevel = hlevel - 1
-
- label = DocItemLabel.SECTION_HEADER
- text = element.text.strip()
-
- if hlevel == 1:
- for key, val in self.parents.items():
- self.parents[key] = None
-
- self.level = 1
- self.parents[self.level] = doc.add_text(
- parent=self.parents[0], label=DocItemLabel.TITLE, text=text
- )
- else:
- if hlevel > self.level:
-
- # add invisible group
- for i in range(self.level + 1, hlevel):
- self.parents[i] = doc.add_group(
- name=f"header-{i}",
- label=GroupLabel.SECTION,
- parent=self.parents[i - 1],
- )
- self.level = hlevel
-
- elif hlevel < self.level:
-
- # remove the tail
- for key, val in self.parents.items():
- if key > hlevel:
- self.parents[key] = None
- self.level = hlevel
-
- self.parents[hlevel] = doc.add_heading(
- parent=self.parents[hlevel - 1],
- text=text,
- level=hlevel,
- )
-
- def handle_code(self, element: Tag, idx: int, doc: DoclingDocument):
- """Handles monospace code snippets (pre)."""
- if element.text is None:
- return
- text = element.text.strip()
- label = DocItemLabel.CODE
- if len(text) == 0:
- return
- doc.add_code(parent=self.parents[self.level], text=text)
-
- def handle_paragraph(self, element: Tag, idx: int, doc: DoclingDocument):
- """Handles paragraph tags (p)."""
- if element.text is None:
- return
- text = element.text.strip()
- label = DocItemLabel.PARAGRAPH
- if len(text) == 0:
- return
- doc.add_text(parent=self.parents[self.level], label=label, text=text)
-
- def handle_list(self, element: Tag, idx: int, doc: DoclingDocument):
- """Handles list tags (ul, ol) and their list items."""
-
- if element.name == "ul":
- # create a list group
- self.parents[self.level + 1] = doc.add_group(
- parent=self.parents[self.level], name="list", label=GroupLabel.LIST
- )
- elif element.name == "ol":
- # create a list group
- self.parents[self.level + 1] = doc.add_group(
- parent=self.parents[self.level],
- name="ordered list",
- label=GroupLabel.ORDERED_LIST,
- )
- self.level += 1
-
- self.walk(element, doc)
-
- self.parents[self.level + 1] = None
- self.level -= 1
-
- def handle_listitem(self, element: Tag, idx: int, doc: DoclingDocument):
- """Handles listitem tags (li)."""
- nested_lists = element.find(["ul", "ol"])
-
- parent_list_label = self.parents[self.level].label
- index_in_list = len(self.parents[self.level].children) + 1
-
- if nested_lists:
- name = element.name
- # Text in list item can be hidden within hierarchy, hence
- # we need to extract it recursively
- text = self.extract_text_recursively(element)
- # Flatten text, remove break lines:
- text = text.replace("\n", "").replace("\r", "")
- text = " ".join(text.split()).strip()
-
- marker = ""
- enumerated = False
- if parent_list_label == GroupLabel.ORDERED_LIST:
- marker = str(index_in_list)
- enumerated = True
-
- if len(text) > 0:
- # create a list-item
- self.parents[self.level + 1] = doc.add_list_item(
- text=text,
- enumerated=enumerated,
- marker=marker,
- parent=self.parents[self.level],
- )
- self.level += 1
-
- self.walk(element, doc)
-
- self.parents[self.level + 1] = None
- self.level -= 1
-
- elif isinstance(element.text, str):
- text = element.text.strip()
-
- marker = ""
- enumerated = False
- if parent_list_label == GroupLabel.ORDERED_LIST:
- marker = f"{str(index_in_list)}."
- enumerated = True
- doc.add_list_item(
- text=text,
- enumerated=enumerated,
- marker=marker,
- parent=self.parents[self.level],
- )
- else:
- _log.warn("list-item has no text: ", element)
-
- def handle_table(self, element: Tag, idx: int, doc: DoclingDocument):
- """Handles table tags."""
-
- nested_tables = element.find("table")
- if nested_tables is not None:
- _log.warn("detected nested tables: skipping for now")
- return
-
- # Count the number of rows (number of elements)
- num_rows = len(element.find_all("tr"))
-
- # Find the number of columns (taking into account colspan)
- num_cols = 0
- for row in element.find_all("tr"):
- col_count = 0
- for cell in row.find_all(["td", "th"]):
- colspan = int(cell.get("colspan", 1))
- col_count += colspan
- num_cols = max(num_cols, col_count)
-
- grid = [[None for _ in range(num_cols)] for _ in range(num_rows)]
-
- data = TableData(num_rows=num_rows, num_cols=num_cols, table_cells=[])
-
- # Iterate over the rows in the table
- for row_idx, row in enumerate(element.find_all("tr")):
-
- # For each row, find all the column cells (both | and | )
- cells = row.find_all(["td", "th"])
-
- # Check if each cell in the row is a header -> means it is a column header
- col_header = True
- for j, html_cell in enumerate(cells):
- if html_cell.name == "td":
- col_header = False
-
- col_idx = 0
- # Extract and print the text content of each cell
- for _, html_cell in enumerate(cells):
-
- text = html_cell.text
- try:
- text = self.extract_table_cell_text(html_cell)
- except Exception as exc:
- _log.warn("exception: ", exc)
- exit(-1)
-
- # label = html_cell.name
-
- col_span = int(html_cell.get("colspan", 1))
- row_span = int(html_cell.get("rowspan", 1))
-
- while grid[row_idx][col_idx] is not None:
- col_idx += 1
- for r in range(row_span):
- for c in range(col_span):
- grid[row_idx + r][col_idx + c] = text
-
- cell = TableCell(
- text=text,
- row_span=row_span,
- col_span=col_span,
- start_row_offset_idx=row_idx,
- end_row_offset_idx=row_idx + row_span,
- start_col_offset_idx=col_idx,
- end_col_offset_idx=col_idx + col_span,
- col_header=col_header,
- row_header=((not col_header) and html_cell.name == "th"),
- )
- data.table_cells.append(cell)
-
- doc.add_table(data=data, parent=self.parents[self.level])
-
- def get_list_text(self, list_element: Tag, level=0):
- """Recursively extract text from |