lorebianchi98
/

NoctOWLv2-base-patch16

Object Detection

Model card Files Files and versions

NoctOWLv2-base-patch16 / README.md

lorebianchi98's picture

Update README.md

4ff3607 verified 10 months ago

|

history blame contribute delete

3.49 kB

	---
	license: apache-2.0
	base_model:
	- google/owlv2-base-patch16
	pipeline_tag: object-detection
	---
	# NoctOWL: Fine-Grained Open-Vocabulary Object Detector


	## Model Description

	NoctOWL (*Not only coarse-text OWL) is an adaptation of OWL-ViT* (NoctOWL) and OWLv2 (NoctOWLv2), designed for Fine-Grained Open-Vocabulary Detection (FG-OVD). Unlike standard open-vocabulary object detectors, which focus primarily on class-level recognition, NoctOWL enhances the ability to detect and distinguish fine-grained object attributes such as color, material, transparency, and pattern.

	It maintains a balanced trade-off between fine- and coarse-grained detection, making it particularly effective in scenarios requiring detailed object descriptions.

	You can find the original code to train and evaluate the model [here](https://github.com/lorebianchi98/FG-OVD/tree/main/benchmarks).

	### Model Variants
	- NoctOWL Base (`lorebianchi98/NoctOWL-base-patch16`)
	- NoctOWLv2 Base (`lorebianchi98/NoctOWLv2-base-patch16`)
	- NoctOWL Large (`lorebianchi98/NoctOWL-large-patch14`)
	- NoctOWLv2 Large (`lorebianchi98/NoctOWLv2-large-patch14`)

	## Usage

	### Loading the Model
	```python
	from transformers import OwlViTForObjectDetection, Owlv2ForObjectDetection, OwlViTProcessor, Owlv2Processor

	# Load NoctOWL model
	model = OwlViTForObjectDetection.from_pretrained("lorebianchi98/NoctOWL-base-patch16")
	processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch16")

	# Load NoctOWLv2 model
	model_v2 = Owlv2ForObjectDetection.from_pretrained("lorebianchi98/NoctOWLv2-base-patch16")
	processor_v2 = Owlv2Processor.from_pretrained("google/owlv2-base-patch16")
	```

	### Inference Example
	```python
	from PIL import Image
	import torch

	# Load image
	image = Image.open("example.jpg")

	# Define text prompts (fine-grained descriptions)
	text_queries = ["a red patterned dress", "a dark brown wooden chair"]

	# Process inputs
	inputs = processor(images=image, text=text_queries, return_tensors="pt")

	# Run inference
	outputs = model(**inputs)

	# Extract detected objects
	logits = outputs.logits
	boxes = outputs.pred_boxes

	# Post-processing can be applied to visualize results
	```

	## Results
	We report the mean Average Precision (mAP) on the Fine-Grained Open-Vocabulary Detection ([FG-OVD](https://lorebianchi98.github.io/FG-OVD/)) benchmarks across different difficulty levels, as well as performance on rare classes from the LVIS dataset.
	\| Model \| LVIS (Rare) \| Trivial \| Easy \| Medium \| Hard \| Color \| Material \| Pattern \| Transparency \|
	\|-------\|------------\|----------------\|---------------\|---------------\|---------------\|-------\|----------\|---------\|--------------\|
	\| OWL (B/16) \| 20.6 \| 53.9 \| 38.4 \| 39.8 \| 26.2 \| 45.3 \| 37.3 \| 26.6 \| 34.1 \|
	\| OWL (L/14) \| 31.2 \| 65.1 \| 44.0 \| 39.3 \| 26.5 \| 43.8 \| 44.9 \| 36.0 \| 29.2 \|
	\| OWLv2 (B/16) \| 29.6 \| 52.9 \| 40.0 \| 38.5 \| 25.3 \| 45.1 \| 33.5 \| 19.2 \| 28.5 \|
	\| OWLv2 (L/14) \| 34.9 \| 63.2 \| 42.8 \| 41.2 \| 25.4 \| 53.3 \| 36.9 \| 23.3 \| 12.2 \|
	\| NoctOWL (B/16) \| 11.6 \| 46.6 \| 44.4 \| 45.6 \| 40.0 \| 44.7 \| 46.0 \| 46.1 \| 53.6 \|
	\| NoctOWL (L/14) \| 26.0 \| 57.4 \| 54.2 \| 54.8 \| 48.6 \| 53.1 \| 56.9 \| 49.8 \| 57.2 \|
	\| NoctOWLv2 (B/16) \| 17.5 \| 48.3 \| 49.1 \| 47.1 \| 42.1 \| 46.8 \| 48.2 \| 42.2 \| 50.2 \|
	\| NoctOWLv2 (L/14) \| 27.2 \| 57.5 \| 55.5 \| 57.2 \| 50.2 \| 55.6 \| 57.0 \| 49.2 \| 55.9 \|