Experiment 23
The blog post describes a fork of the DE-ViT algorithm that adapts it for few-shot object detection on satellite imagery by using DINOv3 vision transformers pretrained on the SAT-493M dataset, with a target application of detecting objects in the xView dataset.
Zero-Shot Object Detection for Overhead Imagery: A DE-ViT Adaptation
The DE-ViT Approach
DE-ViT (Detection Transformer with Vision Transformers) represents a significant advancement in few-shot object detection, leveraging the powerful feature representations learned by vision transformers. The approach combines a pretrained vision transformer backbone with a region propagation network to enable detection with minimal training examples. The core innovation lies in how DE-ViT propagates information between support (example) images and query (target) images through a sophisticated attention mechanism, allowing the model to generalize to novel object categories with limited data.
The original DE-ViT architecture employs a subspace projection mechanism to align features from the vision transformer backbone, followed by region proposal generation and a region propagation network that refines detections based on support set prototypes. For a comprehensive understanding of the theoretical foundations, architectural details, and empirical validations, readers are directed to the original research paper (2309.12969v4.pdf).
Overhead Imagery: Unique Challenges and the xView Dataset
Overhead imagery presents distinct challenges that differentiate it from natural image domains. The most critical challenge is rotation invariance: objects in satellite and aerial imagery can appear at arbitrary orientations, unlike natural images where vehicles are typically upright and buildings follow gravity-aligned perspectives. A car photographed from above may appear at any angle from 0 to 360 degrees, and detection systems must recognize it regardless of orientation.
The xView dataset has become a cornerstone benchmark for evaluating object detection algorithms in overhead imagery. Released in 2018, xView contains over 1 million object instances across 60 categories in high-resolution satellite imagery, spanning diverse geographic regions and imaging conditions. The dataset was specifically designed to challenge computer vision systems with the complexities of overhead imagery: dense object clustering, extreme scale variation (objects ranging from small vehicles to large buildings), and the aforementioned rotation problem.
Historically, xView has driven innovation in several areas: anchor-free detection methods to handle arbitrary orientations, multi-scale feature pyramids to address the extreme scale variation, and attention mechanisms to manage dense scenes. However, most successful approaches have required extensive training on large labeled datasets, limiting their applicability in scenarios where labeled data is scarce or when adapting to new object categories.
Novel Contributions: Toward Zero-Shot Detection
This project introduces several key modifications to the original DE-ViT framework, fundamentally shifting from a few-shot to a zero-shot detection paradigm optimized for overhead imagery.
1. Removal of the Region Propagation Network
The most significant architectural change is the elimination of the region propagation component. While the original DE-ViT uses this network to refine detections through learned interactions between support and query features, this requires training on a base dataset. By removing this component, we transition to a true zero-shot scenario where no training is required. Instead, the system relies entirely on the rich feature representations from the DINOv3 vision transformer, which has been pretrained on 493 million satellite images (SAT-493M dataset). This pretrained knowledge serves as the foundation for detection without any task-specific fine-tuning.
2. Differentiated Prototype Generation
The approach to prototype generation differs fundamentally between base and novel categories:
Base Data Prototypes: For established object categories in the base dataset, prototypes are generated using standard feature extraction and pooling techniques. These prototypes serve as canonical representations of each category, extracted from single-orientation examples under the assumption that the DINOv3 backbone has already learned rotation-invariant features during pretraining.
Novel Data Prototypes with Rotation Augmentation: For novel categories where rotation invariance cannot be assumed, the system employs a rotation augmentation strategy. Multiple rotated versions of support images are generated (at incremental angles), and features are extracted from each rotation. These multi-orientation features are then aggregated to create rotation-robust prototypes. This ensures that when a novel object appears at an arbitrary angle in a query image, it can be matched against prototypes that encode appearance information across all orientations.
3. Support Vector Machine Ensemble for Confidence Scoring
The final novel component is the integration of a Support Vector Machine (SVM) ensemble as a confidence gating mechanism. Rather than directly thresholding similarity scores between query features and prototypes, an ensemble of SVMs is trained to distinguish between genuine matches and false positives. Each SVM in the ensemble is trained on different feature subspaces or using different kernel functions, and their outputs are combined through a voting or weighted averaging scheme.
This ensemble approach provides several benefits: it captures non-linear decision boundaries that simple cosine similarity cannot, it reduces false positives by requiring consensus across multiple classifiers, and it produces calibrated confidence scores that better reflect true match probability. The SVM ensemble acts as a final filter, ensuring that only high-confidence detections propagate to the final output.
Experiment & Results
I will start by summarizing the result and then briefly describe the experiment. The results were dissapointing. Despite the favorable results published in the paper, the author admits on his Github site that:
[DeVIT] tends to detect objects that do not belong to the prototypes, especially for retailed products that are not presented in the training data. For example, if you have “can”, “bottle”, and “toy” in the scene, but you only have “can” and “bottle” in the class prototypes. The ideal performance is to mark “toy” as background, but DE-ViT tends to detect “toy” as either “can” or “bottle”.
I noticed a similar trend. While the approach is promising for tapping into the power of foundation vision models for few-shot learning, it fails to provide good performance at scale for satellite imagery.
Experiment
I did not perform a formal experiment, but here is a rough description of the anecdotal test:
- 2 prototype base classes that cover +70% of xView’s dataset (building & bus)
- 1 background base class
- 1 novel class for testing (excavator)
- 5 full view satellite images with known excavators present
Conclusion
This project demonstrates that effective zero-shot object detection in overhead imagery is achievable through careful architectural modifications and domain-specific adaptations. By removing the trainable region propagation network, we eliminate the need for training data while relying on strong pretrained features from DINOv3. The rotation-aware prototype generation strategy directly addresses the orientation challenge inherent to overhead imagery, and the SVM ensemble provides robust confidence filtering to maintain detection precision.
The combination of these innovations creates a practical system for detecting novel object categories in satellite imagery without requiring labeled training data—a significant step toward adaptable, deployment-ready detection systems for real-world remote sensing applications. Future work may explore adaptive rotation augmentation strategies, hierarchical prototype refinement, and extension to additional overhead imagery datasets beyond xView.
Code for this project can be found at https://github.com/dlfelps/devit-xview.
CLAUDE PROTIPS
In the last few blog posts I provided examples where CLAUDE needed extra guidance. I thought I would do something different in this post.
- PROTIP #1: When working in complex code bases (such as a research level machine learning project), work in small measured steps.
- PROTIP #2: After using CLAUDE to implement a new feature (or in this case replace an existing module with something else), ask CLAUDE to write a simple script to demonstrate/test the new feature. Optional: treat this as a temporary test - you don’t have to add it to git.
- PROTIP #3: If there is a problem that “feels small” and I think I have a good idea how to fix it, I just give CLAUDE the error and CLAUDE will usually fix it.
- PROTIP #4: If there is a problem that “feels big” and I don’t understand it or know how to fix it, I ask CLAUDE to write a debugging script to log values up to the point at which the error occurs. Then I review the output. Next I ask CLAUDE to fix the problem and then run the debugging script again. If all goes well and values make sense I rerun the original code.