| | --- |
| | library_name: transformers |
| | license: apache-2.0 |
| | license_link: https://huggingface.co/UbiquantAI/Fleming-R1-32B/blob/main/LICENSE |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # Fleming-VL-8B |
| | <p align="center" style="margin: 0;"> |
| | <a href="https://github.com/UbiquantAI/Fleming-R1" aria-label="GitHub Repository" style="text-decoration:none;"> |
| | <span style="display:inline-flex;align-items:center;gap:.35em;"> |
| | <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" |
| | width="16" height="16" aria-hidden="true" |
| | style="vertical-align:text-bottom;fill:currentColor;"> |
| | <path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0016 8c0-4.42-3.58-8-8-8Z"/> |
| | </svg> |
| | <span>GitHub</span> |
| | </span> |
| | </a> |
| | <span style="margin:0 .75em;opacity:.6;">•</span> |
| | <a href="https://arxiv.org/abs/2509.15279" aria-label="Paper">📑 Paper</a> |
| | </p> |
| | |
| | ## Highlights |
| |
|
| | ## 📖 Model Overview |
| |
|
| | Fleming-VL is a multimodal reasoning model for medical scenarios that can process and analyze various types of medical data including 2D images, 3D volumetric data, and video sequences. The model performs step-by-step analysis of complex multimodal medical problems and produces reliable answers. Building upon the GRPO reasoning paradigm, Fleming-VL extends the capabilities to handle diverse medical imaging modalities while maintaining strong reasoning performance. |
| |
|
| | **Model Features:** |
| |
|
| | * **Multimodal Processing** Supports various medical data types including 2D images (X-rays, pathology slides), 3D volumes (CT/MRI scans), and videos (ultrasound, endoscopy, surgical recordings); |
| | * **Medical Reasoning** Performs step-by-step chain-of-thought reasoning to analyze complex medical problems, combining visual information with medical knowledge to provide reliable diagnostic insights. |
| | ## 📦 Releases |
| |
|
| | - **Fleming-VL-7B** —— Trained on InternVL3-8B |
| | 🤗 [`UbiquantAI/Fleming-VL-8B`](https://huggingface.co/UbiquantAI/Fleming-VL-8B) |
| | - **Fleming-VL-38B** —— Trained on InternVL3-38B |
| | 🤗 [`UbiquantAI/Fleming-VL-8B`](https://huggingface.co/UbiquantAI/Fleming-VL-38B) |
| |
|
| | ## 📊 Performance |
| |
|
| | ### Main Benchmark Results |
| |
|
| | <div align="center"> |
| | <img src="images/exp_result.png" alt="Benchmark Results" width="60%"> |
| | </div> |
| |
|
| |
|
| | ## 🔧 Quick Start |
| |
|
| | ```python |
| | """ |
| | Fleming-VL-8B Multi-Modal Inference Script |
| | |
| | This script demonstrates three inference modes: |
| | 1. Single image inference |
| | 2. Video inference (frame-by-frame) |
| | 3. 3D medical image (CT/MRI) inference from .npy files |
| | |
| | Model: UbiquantAI/Fleming-VL-8B |
| | Based on: InternVL_chat-1.2 template |
| | """ |
| | |
| | from transformers import AutoTokenizer, AutoModel, CLIPImageProcessor |
| | from decord import VideoReader, cpu |
| | from PIL import Image |
| | import numpy as np |
| | import shutil |
| | import torch |
| | import os |
| | |
| | |
| | # ============================================================================ |
| | # Configuration |
| | # ============================================================================ |
| | |
| | MODEL_PATH = "UbiquantAI/Fleming-VL-8B" |
| | REQUIRED_FILES_DIR = './required_files' |
| | |
| | # Prompt template for reasoning-based responses |
| | REASONING_PROMPT = ( |
| | "A conversation between User and Assistant. The user asks a question, " |
| | "and the Assistant solves it. The assistant first thinks about the " |
| | "reasoning process in the mind and then provides the user a concise " |
| | "final answer in a short word or phrase. The reasoning process and " |
| | "answer are enclosed within <think> </think> and <answer> </answer> " |
| | "tags, respectively, i.e., <think> reasoning process here </think>" |
| | "<answer> answer here </answer>" |
| | ) |
| | |
| | |
| | # ============================================================================ |
| | # Utility Functions |
| | # ============================================================================ |
| | |
| | def copy_necessary_files(target_path, source_path): |
| | """ |
| | Copy required model configuration files to the model directory. |
| | |
| | Args: |
| | target_path: Destination directory (model path) |
| | source_path: Source directory containing required files |
| | """ |
| | required_files = [ |
| | "modeling_internvl_chat.py", |
| | "conversation.py", |
| | "modeling_intern_vit.py", |
| | "preprocessor_config.json", |
| | "configuration_internvl_chat.py", |
| | "configuration_intern_vit.py", |
| | ] |
| | |
| | for filename in required_files: |
| | target_file = os.path.join(target_path, filename) |
| | source_file = os.path.join(source_path, filename) |
| | |
| | if not os.path.exists(target_file): |
| | print(f"File {filename} not found in target path, copying from source...") |
| | |
| | if os.path.exists(source_file): |
| | try: |
| | shutil.copy2(source_file, target_file) |
| | print(f"Successfully copied {filename}") |
| | except Exception as e: |
| | print(f"Error copying {filename}: {str(e)}") |
| | else: |
| | print(f"Warning: Source file {filename} does not exist, cannot copy") |
| | else: |
| | print(f"File {filename} already exists") |
| | |
| | |
| | def load_model(model_path, use_flash_attn=True): |
| | """ |
| | Load the vision-language model and tokenizer. |
| | |
| | Args: |
| | model_path: Path to the pretrained model |
| | use_flash_attn: Whether to use flash attention (default: True) |
| | |
| | Returns: |
| | tuple: (model, tokenizer) |
| | """ |
| | model = AutoModel.from_pretrained( |
| | model_path, |
| | torch_dtype=torch.bfloat16, |
| | low_cpu_mem_usage=True, |
| | use_flash_attn=use_flash_attn, |
| | trust_remote_code=True |
| | ).eval().cuda() |
| | |
| | tokenizer = AutoTokenizer.from_pretrained( |
| | model_path, |
| | trust_remote_code=True, |
| | use_fast=False |
| | ) |
| | |
| | return model, tokenizer |
| | |
| | |
| | # ============================================================================ |
| | # Image Inference |
| | # ============================================================================ |
| | |
| | def inference_single_image(model, tokenizer, image_path, question, prompt=REASONING_PROMPT): |
| | """ |
| | Perform inference on a single image. |
| | |
| | Args: |
| | model: Loaded vision-language model |
| | tokenizer: Loaded tokenizer |
| | image_path: Path to the input image |
| | question: Question to ask about the image |
| | prompt: System prompt template |
| | |
| | Returns: |
| | str: Model response |
| | """ |
| | # Load and preprocess image |
| | image_processor = CLIPImageProcessor.from_pretrained(MODEL_PATH) |
| | image = Image.open(image_path).resize((448, 448)) |
| | pixel_values = image_processor( |
| | images=image, |
| | return_tensors='pt' |
| | ).pixel_values.to(torch.bfloat16).cuda() |
| | |
| | # Prepare question with prompt and image token |
| | full_question = f"{prompt}\n<image>\n{question}" |
| | |
| | # Generate response |
| | generation_config = dict(max_new_tokens=1024, do_sample=False) |
| | response = model.chat(tokenizer, pixel_values, full_question, generation_config) |
| | |
| | return response |
| | |
| | |
| | # ============================================================================ |
| | # Video Inference |
| | # ============================================================================ |
| | |
| | def get_frame_indices(bound, fps, max_frame, first_idx=0, num_segments=32): |
| | """ |
| | Calculate evenly distributed frame indices for video sampling. |
| | |
| | Args: |
| | bound: Tuple of (start_time, end_time) in seconds, or None for full video |
| | fps: Frames per second of the video |
| | max_frame: Maximum frame index |
| | first_idx: First frame index to consider |
| | num_segments: Number of frames to sample |
| | |
| | Returns: |
| | np.array: Array of frame indices |
| | """ |
| | if bound: |
| | start, end = bound[0], bound[1] |
| | else: |
| | start, end = -100000, 100000 |
| | |
| | start_idx = max(first_idx, round(start * fps)) |
| | end_idx = min(round(end * fps), max_frame) |
| | seg_size = float(end_idx - start_idx) / num_segments |
| | |
| | frame_indices = np.array([ |
| | int(start_idx + (seg_size / 2) + np.round(seg_size * idx)) |
| | for idx in range(num_segments) |
| | ]) |
| | |
| | return frame_indices |
| | |
| | |
| | def load_video(video_path, model_path, bound=None, num_segments=32): |
| | """ |
| | Load and preprocess video frames. |
| | |
| | Args: |
| | video_path: Path to the video file |
| | model_path: Path to the model (for image processor) |
| | bound: Time boundary tuple (start, end) in seconds |
| | num_segments: Number of frames to extract |
| | |
| | Returns: |
| | tuple: (pixel_values tensor, list of num_patches per frame) |
| | """ |
| | vr = VideoReader(video_path, ctx=cpu(0), num_threads=1) |
| | max_frame = len(vr) - 1 |
| | fps = float(vr.get_avg_fps()) |
| | |
| | pixel_values_list = [] |
| | num_patches_list = [] |
| | image_processor = CLIPImageProcessor.from_pretrained(model_path) |
| | |
| | frame_indices = get_frame_indices(bound, fps, max_frame, first_idx=0, num_segments=num_segments) |
| | |
| | for frame_index in frame_indices: |
| | # Extract and preprocess frame |
| | img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB').resize((448, 448)) |
| | pixel_values = image_processor(images=img, return_tensors='pt').pixel_values |
| | num_patches_list.append(pixel_values.shape[0]) |
| | pixel_values_list.append(pixel_values) |
| | |
| | pixel_values = torch.cat(pixel_values_list) |
| | return pixel_values, num_patches_list |
| | |
| | |
| | def inference_video(model, tokenizer, video_path, video_duration, question, prompt=REASONING_PROMPT): |
| | """ |
| | Perform inference on a video by sampling frames. |
| | |
| | Args: |
| | model: Loaded vision-language model |
| | tokenizer: Loaded tokenizer |
| | video_path: Path to the video file |
| | video_duration: Duration of video in seconds |
| | question: Question to ask about the video |
| | prompt: System prompt template |
| | |
| | Returns: |
| | str: Model response |
| | """ |
| | # Sample frames from video (1 frame per second) |
| | num_segments = int(video_duration) |
| | pixel_values, num_patches_list = load_video(video_path, MODEL_PATH, num_segments=num_segments) |
| | pixel_values = pixel_values.to(torch.bfloat16).cuda() |
| | |
| | # Create image token prefix for all frames |
| | video_prefix = ''.join([f'<image>\n' for _ in range(len(num_patches_list))]) |
| | |
| | # Prepare question with prompt and image tokens |
| | full_question = f"{prompt}\n{video_prefix}{question}" |
| | |
| | # Generate response |
| | generation_config = dict(max_new_tokens=1024, do_sample=False) |
| | response, history = model.chat( |
| | tokenizer, |
| | pixel_values, |
| | full_question, |
| | generation_config, |
| | num_patches_list=num_patches_list, |
| | history=None, |
| | return_history=True |
| | ) |
| | |
| | return response |
| | |
| | |
| | # ============================================================================ |
| | # 3D Medical Image (NPY) Inference |
| | # ============================================================================ |
| | |
| | def normalize_image(image): |
| | """ |
| | Normalize image array to 0-255 range. |
| | |
| | Args: |
| | image: NumPy array of image data |
| | |
| | Returns: |
| | np.array: Normalized image as uint8 |
| | """ |
| | img_min = np.min(image) |
| | img_max = np.max(image) |
| | |
| | if img_max - img_min == 0: |
| | return np.zeros_like(image, dtype=np.uint8) |
| | |
| | return ((image - img_min) / (img_max - img_min) * 255).astype(np.uint8) |
| | |
| | |
| | def convert_npy_to_images(npy_path, model_path, num_slices=11): |
| | """ |
| | Convert 3D medical image (.npy) to multiple 2D RGB images. |
| | |
| | Expected input shape: (32, 256, 256) or (1, 32, 256, 256) |
| | Extracts evenly distributed slices and converts to RGB format. |
| | |
| | Args: |
| | npy_path: Path to the .npy file |
| | model_path: Path to the model (for image processor) |
| | num_slices: Number of slices to extract (default: 11) |
| | |
| | Returns: |
| | tuple: (pixel_values tensor, list of num_patches per slice) or False if error |
| | """ |
| | try: |
| | # Load .npy file |
| | data = np.load(npy_path) |
| | |
| | # Handle shape (1, 32, 256, 256) -> (32, 256, 256) |
| | if data.ndim == 4 and data.shape[0] == 1: |
| | data = data[0] |
| | |
| | # Validate shape |
| | if data.shape != (32, 256, 256): |
| | print(f"Warning: {npy_path} has shape {data.shape}, expected (32, 256, 256), skipping") |
| | return False |
| | |
| | # Select evenly distributed slices from 32 slices |
| | indices = np.linspace(0, 31, num_slices, dtype=int) |
| | |
| | image_processor = CLIPImageProcessor.from_pretrained(model_path) |
| | pixel_values_list = [] |
| | num_patches_list = [] |
| | |
| | # Process each selected slice |
| | for idx in indices: |
| | # Get slice |
| | slice_img = data[idx] |
| | |
| | # Normalize to 0-255 |
| | normalized = normalize_image(slice_img) |
| | |
| | # Convert grayscale to RGB by stacking |
| | rgb_img = np.stack([normalized, normalized, normalized], axis=-1) |
| | |
| | # Convert to PIL Image |
| | img = Image.fromarray(rgb_img) |
| | |
| | # Preprocess with CLIP processor |
| | pixel_values = image_processor(images=img, return_tensors='pt').pixel_values |
| | num_patches_list.append(pixel_values.shape[0]) |
| | pixel_values_list.append(pixel_values) |
| | |
| | pixel_values = torch.cat(pixel_values_list) |
| | return pixel_values, num_patches_list |
| | |
| | except Exception as e: |
| | print(f"Error processing {npy_path}: {str(e)}") |
| | return False |
| | |
| | |
| | def inference_3d_medical_image(model, tokenizer, npy_path, question, prompt=REASONING_PROMPT): |
| | """ |
| | Perform inference on 3D medical images stored as .npy files. |
| | |
| | Args: |
| | model: Loaded vision-language model |
| | tokenizer: Loaded tokenizer |
| | npy_path: Path to the .npy file (shape: 32x256x256) |
| | question: Question to ask about the image |
| | prompt: System prompt template |
| | |
| | Returns: |
| | str: Model response or None if error |
| | """ |
| | # Convert 3D volume to multiple 2D slices |
| | result = convert_npy_to_images(npy_path, MODEL_PATH) |
| | |
| | if result is False: |
| | return None |
| | |
| | pixel_values, num_patches_list = result |
| | pixel_values = pixel_values.to(torch.bfloat16).cuda() |
| | |
| | # Create image token prefix for all slices |
| | image_prefix = ''.join([f'<image>\n' for _ in range(len(num_patches_list))]) |
| | |
| | # Prepare question with prompt and image tokens |
| | full_question = f"{prompt}\n{image_prefix}{question}" |
| | |
| | # Generate response |
| | generation_config = dict(max_new_tokens=1024, do_sample=False) |
| | response, history = model.chat( |
| | tokenizer, |
| | pixel_values, |
| | full_question, |
| | generation_config, |
| | num_patches_list=num_patches_list, |
| | history=None, |
| | return_history=True |
| | ) |
| | |
| | return response |
| | |
| | |
| | # ============================================================================ |
| | # Main Execution Examples |
| | # ============================================================================ |
| | |
| | def main(): |
| | """ |
| | Main function demonstrating all three inference modes. |
| | """ |
| | # Copy necessary files |
| | copy_necessary_files(MODEL_PATH, REQUIRED_FILES_DIR) |
| | |
| | # ======================================================================== |
| | # Example 1: Single Image Inference |
| | # ======================================================================== |
| | print("\n" + "="*80) |
| | print("EXAMPLE 1: Single Image Inference") |
| | print("="*80) |
| | |
| | image_path = "./test.png" |
| | question = ( |
| | "What imaging technique was employed to obtain this picture?\n" |
| | "A. PET scan. B. CT scan. C. Blood test. D. Fundus imaging." |
| | ) |
| | |
| | model, tokenizer = load_model(MODEL_PATH, use_flash_attn=True) |
| | response = inference_single_image(model, tokenizer, image_path, question) |
| | |
| | print(f"\nUser: {question}") |
| | print(f"Assistant: {response}") |
| | |
| | # Clean up GPU memory |
| | del model, tokenizer |
| | torch.cuda.empty_cache() |
| | |
| | # ======================================================================== |
| | # Example 2: Video Inference |
| | # ======================================================================== |
| | print("\n" + "="*80) |
| | print("EXAMPLE 2: Video Inference") |
| | print("="*80) |
| | |
| | video_path = "./test.mp4" |
| | video_duration = 6 # seconds |
| | question = "Please describe the video." |
| | |
| | model, tokenizer = load_model(MODEL_PATH, use_flash_attn=False) |
| | response = inference_video(model, tokenizer, video_path, video_duration, question) |
| | |
| | print(f"\nUser: {question}") |
| | print(f"Assistant: {response}") |
| | |
| | # Clean up GPU memory |
| | del model, tokenizer |
| | torch.cuda.empty_cache() |
| | |
| | # ======================================================================== |
| | # Example 3: 3D Medical Image Inference |
| | # ======================================================================== |
| | print("\n" + "="*80) |
| | print("EXAMPLE 3: 3D Medical Image Inference") |
| | print("="*80) |
| | |
| | npy_path = "./test.npy" |
| | question = "What device is observed on the chest wall?" |
| | |
| | # Example cases: |
| | # Case 1: /path/to/test_1016_d_2.npy |
| | # Question: "Where is the largest lymph node observed?" |
| | # Answer: "Right hilar region." |
| | # |
| | # Case 2: /path/to/test_1031_a_2.npy |
| | # Question: "What device is observed on the chest wall?" |
| | # Answer: "Pacemaker." |
| | |
| | model, tokenizer = load_model(MODEL_PATH, use_flash_attn=False) |
| | response = inference_3d_medical_image(model, tokenizer, npy_path, question) |
| | |
| | if response: |
| | print(f"\nUser: {question}") |
| | print(f"Assistant: {response}") |
| | else: |
| | print("\nError: Failed to process 3D medical image") |
| | |
| | # Clean up GPU memory |
| | del model, tokenizer |
| | torch.cuda.empty_cache() |
| | |
| | |
| | if __name__ == "__main__": |
| | main() |
| | |
| | ``` |
| |
|
| | ## ⚠️ Safety Statement |
| |
|
| | This project is for research and non-clinical reference only; it must not be used for actual diagnosis or treatment decisions. |
| | The generated reasoning traces are an auditable intermediate process and do not constitute medical advice. |
| | In medical scenarios, results must be reviewed and approved by qualified professionals, and all applicable laws, regulations, and privacy compliance requirements in your region must be followed. |
| |
|
| | ## 📚 Citation |
| |
|
| | ```bibtex |
| | @misc{flemingr1, |
| | title={Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning}, |
| | author={Chi Liu and Derek Li and Yan Shu and Robin Chen and Derek Duan and Teng Fang and Bryan Dai}, |
| | year={2025}, |
| | eprint={2509.15279}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.LG}, |
| | url={https://arxiv.org/abs/2509.15279}, |
| | } |
| | ``` |
| |
|