Experiment 24
The blog post demonstrates that few-shot LLMs can match 96% of fine-tuned model accuracy on scene graph extraction with 103x faster inference.
Comparing Scene Graph Extraction Methods: From Fine-tuned T5 to Few-Shot LLMs
Introduction
What is Scene Graph Parsing?
In computer vision and natural language processing, scene graph parsing is the task of extracting structured representations from image captions or visual scenes. A scene graph breaks down a description into three key components:
- Entities: The objects, people, or things in the scene (e.g., “dog”, “bench”, “person”)
- Attributes: Properties describing entities (e.g., “brown dog”, “wooden bench”)
- Relationships: Interactions and spatial relationships between entities (e.g., “dog sits on bench”, “person next to tree”)
For example, given the caption:
“A brown dog sitting on a wooden bench in a park”
A scene graph would extract:
- Entities: dog, bench, park
- Attributes: (dog, brown), (dog, sitting), (bench, wooden)
- Relationships: (dog, sit on, bench), (bench, in, park)
Why Scene Graphs Matter
Scene graphs provide structured, machine-readable representations of visual information, enabling:
- Visual Question Answering: Understanding spatial relationships to answer questions like “What is the dog sitting on?”
- Image Retrieval: Finding images based on specific object relationships
- Robotics: Helping robots understand and navigate physical environments
- Accessibility: Generating rich descriptions for visually impaired users
The Challenge
Traditional approaches to scene graph extraction rely on fine-tuning specialized models on domain-specific datasets. While effective, this approach has limitations:
- Requires large amounts of labeled training data
- Models are domain-specific and don’t generalize well
- Fine-tuning is computationally expensive
- Slow inference times due to model size
Can modern large language models (LLMs) match or exceed fine-tuned models using only few-shot learning?
This blog post documents a series of experiments comparing:
- A fine-tuned T5 model (baseline)
- LLM-based extraction using Google’s LangExtract framework with Gemini
- Native Gemini structured output
- Iteratively improved few-shot prompting
All experiments use the FACTUAL Scene Graph dataset, which contains 50,000+ image captions with ground-truth scene graph annotations.
Experiment 1: T5 Baseline Performance
Establish baseline performance using a fine-tuned T5 model pre-trained on the FACTUAL dataset.
- Model:
flan-t5-base-VG-factual-sg
(220M parameters) - Test Set: 100 complex samples (captions with >20 words)
- Evaluation: Precision, Recall, and F1 for entities, attributes, and relationships
Metric | Precision | Recall | F1 Score | Support |
---|---|---|---|---|
Entities | 0.946 | 0.884 | 0.907 | 343 |
Attributes | 0.822 | 0.770 | 0.782 | 305 |
Relationships | 0.722 | 0.635 | 0.662 | 227 |
Macro F1 | - | - | 0.784 | - |
Inference Speed: 4.64 seconds per sample
The T5 baseline demonstrates that fine-tuning on domain-specific data yields strong performance, particularly for relationship extraction. However, the 4.6-second inference time and lack of flexibility make it less practical for production use cases requiring fast inference or cross-domain generalization. The high performance on the FACTUAL dataset is to be expected since this model was specifically trained using FACTUAL data. In the following experiments we only provide a handful of examples to the LLM during the prompt to achieve a similar level of performance.
Experiment 2: LangExtract Proof of Concept
Validate whether LangExtract with Gemini can extract scene graphs with reasonable accuracy using few-shot learning (no fine-tuning required).
- Framework: Google LangExtract with Gemini 2.5 Flash
- Test Set: 30 diverse samples (subset of Experiment 1)
- Format: Flat Entities (separate classes for entity, attribute, relationship)
- Examples: 5 few-shot examples demonstrating extraction patterns
- Processing: Batch processing (all 30 samples in single API call)
Metric | Precision | Recall | F1 Score | Support |
---|---|---|---|---|
Entities | 0.956 | 0.932 | 0.944 | 114 |
Attributes | 0.899 | 0.888 | 0.893 | 99 |
Relationships | 0.237 | 0.139 | 0.174 | 72 |
Macro F1 | - | - | 0.670 | - |
Inference Speed: 0.045 seconds per sample (103x faster than T5)
LangExtract demonstrates the power of few-shot learning: with just 5 examples, it matches or exceeds T5’s performance on entities and attributes while being dramatically faster. However, relationship extraction is significantly weaker, suggesting the few-shot examples need improvement or the task requires more sophisticated prompting.
Experiment 3: Format Optimization
Identify the optimal output format for LangExtract to maximize extraction accuracy.
- Framework: LangExtract + Gemini 2.5 Flash
- Test Set: 50 diverse samples (subset of Experiment 1)
- Formats Tested:
- Flat Entities: Separate classes (entity, attribute, relationship)
- Tuple Format: Direct FACTUAL format
(subject, predicate, object)
- Hierarchical: Nested objects with properties
- JSON Structured: Clean JSON with entities/attributes/relationships arrays
Format | Entities F1 | Attributes F1 | Relationships F1 | Macro F1 |
---|---|---|---|---|
Flat Entities | 0.922 | 0.865 | 0.161 | 0.649 |
Tuple Format | 0.905 | 0.851 | 0.145 | 0.634 |
Hierarchical | 0.898 | 0.842 | 0.138 | 0.626 |
JSON Structured | 0.928 | 0.878 | 0.173 | 0.660 |
Winner: JSON Structured (0.660 macro F1)
While JSON Structured emerges as the winner, the relatively small differences between formats (0.626-0.660) suggest that representation format is not the primary bottleneck. The consistent weakness in relationship extraction across all formats points to a deeper issue with the few-shot examples or prompting strategy.
Experiment 4: Backend Comparison (LangExtract vs Native Gemini)
Compare LangExtract framework against native Gemini structured output using the winning JSON format.
- Approaches:
- LangExtract: Framework with 2 few-shot examples
- Native Gemini: Direct API with
response_schema
+ same 2 examples
- Test Set: 50 samples (same as Experiment 3)
- Format: JSON Structured (winner from Experiment 3)
After implementing centralized dataset loading to ensure all experiments use identical test data:
Approach | Entities F1 | Attributes F1 | Relationships F1 | Macro F1 | Speed (s) |
---|---|---|---|---|---|
LangExtract | 0.944 | 0.893 | 0.174 | 0.670 | 0.045 |
Native Gemini | 0.784 | 0.713 | 0.392 | 0.630 | 3.762 |
LangExtract’s framework optimizations for batch processing and few-shot learning make it more effective than raw Gemini API calls for entity and attribute extraction. However, Native Gemini’s structured schema enforcement provides better relationship extraction. The dramatic speed difference (83x) favors LangExtract for production use cases.
Experiment 4b: Targeted Improvement with Analysis-Driven Examples
Improve LangExtract’s relationship extraction by analyzing specific failures and creating targeted few-shot examples.
Step 1: Failure Analysis
- Compared detailed results from LangExtract vs Native Gemini
- Identified specific relationships Native Gemini extracted correctly but LangExtract missed
- Found the root cause of failures
Key Discovery:
Analysis Results:
- Total failed relationships: 52
- "sit on" predicate failures: 48 (92%)
- "with" predicate failures: 4 (8%)
Root Cause: LangExtract was extracting "sitting on" instead of "sit on"
despite having instructions to use base verb forms.
Step 2: Create Targeted Examples Added 5 new examples specifically demonstrating “sitting on” → “sit on” normalization:
- “A white teddy bear sitting on a green carpeted stair” → uses “sit on”
- “A small child sitting on a wooden chair” → uses “sit on”
- “A dog sitting beside a tree” → uses “beside” (not “sitting beside”)
- “People sitting on benches next to tables” → uses “sit on”
- “A man standing on a ladder” → uses “stand on” (same pattern)
Step 3: Re-run Evaluation
- Same 50 samples as Experiment 4
- Same LangExtract framework
- Now with 7 total examples (2 original + 5 targeted)
Approach | Entities F1 | Attributes F1 | Relationships F1 | Macro F1 |
---|---|---|---|---|
Improved LangExtract | 0.901 | 0.900 | 0.450 | 0.750 |
Original LangExtract | 0.944 | 0.893 | 0.174 | 0.670 |
Native Gemini | 0.784 | 0.713 | 0.392 | 0.630 |
T5 Baseline | 0.907 | 0.782 | 0.662 | 0.784 |
Improvement from Targeted Examples:
- Relationships: 0.174 → 0.450 (+159% improvement!)
- Macro F1: 0.670 → 0.750 (+12% improvement)
This experiment demonstrates a powerful methodology for improving LLM performance:
- Analyze failures: Don’t just look at aggregate metrics
- Identify patterns: Find common error modes (92% were “sit on” failures)
- Create targeted examples: Address specific weaknesses with focused demonstrations
- Iterate: Measure impact and repeat
The result: Improved LangExtract now outperforms all other approaches including Native Gemini, while remaining extremely fast. It comes within 4.3% of T5’s performance (0.750 vs 0.784) despite using only 7 examples instead of full fine-tuning.
Comprehensive Comparison
Approach | Test Samples | Entities F1 | Attributes F1 | Relationships F1 | Macro F1 | Speed (s) |
---|---|---|---|---|---|---|
T5 Baseline | 100 | 0.907 | 0.782 | 0.662 | 0.784 | 4.641 |
LangExtract Original | 50 | 0.944 | 0.893 | 0.174 | 0.670 | 0.045 |
Native Gemini | 50 | 0.784 | 0.713 | 0.392 | 0.630 | 3.762 |
LangExtract Improved | 50 | 0.901 | 0.900 | 0.450 | 0.750 | 0.045 |
Conclusions and Recommendations
- Few-Shot Learning is Surprisingly Effective
- With just 7 examples, LangExtract achieved 96% of T5’s performance (0.750 vs 0.784)
- No fine-tuning, no domain-specific training data required
- Dramatically faster inference (103x speedup)
- Analysis-Driven Few-Shot Engineering is Powerful
- Analyzing specific failures revealed 92% of errors came from one pattern
- Creating 5 targeted examples improved relationship F1 by +159%
- This methodology can be repeated iteratively to close the gap with fine-tuned models
- Dataset Consistency is Critical
- Initial experiment 4 results were misleading due to inconsistent test data
- Centralizing dataset loading (dataset_utils.py) ensured fair comparisons
- Lesson: Always use identical test sets when comparing approaches
- LLMs Excel at Different Tasks Than Fine-tuned Models
- LLMs better at: Entities (0.90+ F1), Attributes (0.90 F1)
- Fine-tuned models better at: Relationships (0.66 vs 0.45 F1)
- This suggests relationships require more domain-specific knowledge
- Speed vs Accuracy Trade-offs
- T5: Best accuracy, slowest (4.6s/sample)
- Improved LangExtract: Excellent accuracy, extremely fast (0.045s/sample)
- Native Gemini: Worst of both worlds (lower accuracy, slow)
Recommendations by Use Case
Choose T5 Fine-tuned Model if:
✅ You need the absolute best relationship extraction (0.662 F1)
✅ Inference speed is not a constraint
✅ You’re working with FACTUAL-style data
✅ You have computational resources for model loading
Choose Improved LangExtract if:
✅ You need fast inference (22 samples/second)
✅ You want excellent entity/attribute extraction (0.90 F1)
✅ You need flexibility to adapt to new domains (just change examples)
✅ You want good all-around performance without fine-tuning
✅ Recommended for most production use cases
Final Thoughts
This series of experiments demonstrates that modern LLMs with few-shot learning can approach the performance of fine-tuned models while offering dramatic speed advantages and flexibility. The key insight: analysis-driven few-shot engineering is a powerful technique for iteratively improving LLM performance on structured extraction tasks.
For scene graph extraction specifically:
- Improved LangExtract achieved 96% of T5’s performance with 103x faster inference
- Adding just 5 targeted examples improved relationship extraction by +159%
- The gap between few-shot and fine-tuning continues to narrow
As LLMs continue to improve and few-shot learning techniques become more sophisticated, we expect the gap to close further. For practitioners, this means:
- Start with few-shot LLMs for their speed and flexibility
- Use fine-tuned models only when you need the absolute best accuracy
- Invest time in analysis-driven example engineering rather than collecting labeled data for fine-tuning
The future of structured extraction is fast, flexible, and increasingly accurate—powered by few-shot learning with large language models.
Appendix: Reproduction
All experiments are reproducible using:
# Install dependencies
uv add FactualSceneGraph langextract anthropic datasets torch pandas numpy matplotlib seaborn python-dotenv scikit-learn
# Run experiments
uv run experiment_1_t5_baseline.py
uv run experiment_2_langextract_poc_batched.py
uv run experiment_3_format_optimization_batched.py
uv run experiment_4_backend_comparison.py
uv run experiment_4b_improved_langextract.py
# Compare all results
uv run compare_all_experiments.py
Dataset: FACTUAL Scene Graph Dataset
Code: Available in this repository