Experiment 24

November 1, 2025 22 minute read

The blog post demonstrates that few-shot LLMs can match 96% of fine-tuned model accuracy on scene graph extraction with 103x faster inference.

Comparing Scene Graph Extraction Methods: From Fine-tuned T5 to Few-Shot LLMs

Introduction

What is Scene Graph Parsing?

In computer vision and natural language processing, scene graph parsing is the task of extracting structured representations from image captions or visual scenes. A scene graph breaks down a description into three key components:

Entities: The objects, people, or things in the scene (e.g., “dog”, “bench”, “person”)
Attributes: Properties describing entities (e.g., “brown dog”, “wooden bench”)
Relationships: Interactions and spatial relationships between entities (e.g., “dog sits on bench”, “person next to tree”)

For example, given the caption:

“A brown dog sitting on a wooden bench in a park”

A scene graph would extract:

Entities: dog, bench, park
Attributes: (dog, brown), (dog, sitting), (bench, wooden)
Relationships: (dog, sit on, bench), (bench, in, park)

Why Scene Graphs Matter

Scene graphs provide structured, machine-readable representations of visual information, enabling:

Visual Question Answering: Understanding spatial relationships to answer questions like “What is the dog sitting on?”
Image Retrieval: Finding images based on specific object relationships
Robotics: Helping robots understand and navigate physical environments
Accessibility: Generating rich descriptions for visually impaired users

The Challenge

Traditional approaches to scene graph extraction rely on fine-tuning specialized models on domain-specific datasets. While effective, this approach has limitations:

Requires large amounts of labeled training data
Models are domain-specific and don’t generalize well
Fine-tuning is computationally expensive
Slow inference times due to model size

Can modern large language models (LLMs) match or exceed fine-tuned models using only few-shot learning?

This blog post documents a series of experiments comparing:

A fine-tuned T5 model (baseline)
LLM-based extraction using Google’s LangExtract framework with Gemini
Native Gemini structured output
Iteratively improved few-shot prompting

All experiments use the FACTUAL Scene Graph dataset, which contains 50,000+ image captions with ground-truth scene graph annotations.

Experiment 1: T5 Baseline Performance

Establish baseline performance using a fine-tuned T5 model pre-trained on the FACTUAL dataset.

Model: flan-t5-base-VG-factual-sg (220M parameters)
Test Set: 100 complex samples (captions with >20 words)
Evaluation: Precision, Recall, and F1 for entities, attributes, and relationships

Metric	Precision	Recall	F1 Score	Support
Entities	0.946	0.884	0.907	343
Attributes	0.822	0.770	0.782	305
Relationships	0.722	0.635	0.662	227
Macro F1	-	-	0.784	-

Inference Speed: 4.64 seconds per sample

The T5 baseline demonstrates that fine-tuning on domain-specific data yields strong performance, particularly for relationship extraction. However, the 4.6-second inference time and lack of flexibility make it less practical for production use cases requiring fast inference or cross-domain generalization. The high performance on the FACTUAL dataset is to be expected since this model was specifically trained using FACTUAL data. In the following experiments we only provide a handful of examples to the LLM during the prompt to achieve a similar level of performance.

Experiment 2: LangExtract Proof of Concept

Validate whether LangExtract with Gemini can extract scene graphs with reasonable accuracy using few-shot learning (no fine-tuning required).

Framework: Google LangExtract with Gemini 2.5 Flash
Test Set: 30 diverse samples (subset of Experiment 1)
Format: Flat Entities (separate classes for entity, attribute, relationship)
Examples: 5 few-shot examples demonstrating extraction patterns
Processing: Batch processing (all 30 samples in single API call)

Metric	Precision	Recall	F1 Score	Support
Entities	0.956	0.932	0.944	114
Attributes	0.899	0.888	0.893	99
Relationships	0.237	0.139	0.174	72
Macro F1	-	-	0.670	-

Inference Speed: 0.045 seconds per sample (103x faster than T5)

LangExtract demonstrates the power of few-shot learning: with just 5 examples, it matches or exceeds T5’s performance on entities and attributes while being dramatically faster. However, relationship extraction is significantly weaker, suggesting the few-shot examples need improvement or the task requires more sophisticated prompting.

Experiment 3: Format Optimization

Identify the optimal output format for LangExtract to maximize extraction accuracy.

Framework: LangExtract + Gemini 2.5 Flash
Test Set: 50 diverse samples (subset of Experiment 1)
Formats Tested:
1. Flat Entities: Separate classes (entity, attribute, relationship)
2. Tuple Format: Direct FACTUAL format (subject, predicate, object)
3. Hierarchical: Nested objects with properties
4. JSON Structured: Clean JSON with entities/attributes/relationships arrays

Format	Entities F1	Attributes F1	Relationships F1	Macro F1
Flat Entities	0.922	0.865	0.161	0.649
Tuple Format	0.905	0.851	0.145	0.634
Hierarchical	0.898	0.842	0.138	0.626
JSON Structured	0.928	0.878	0.173	0.660

Winner: JSON Structured (0.660 macro F1)

While JSON Structured emerges as the winner, the relatively small differences between formats (0.626-0.660) suggest that representation format is not the primary bottleneck. The consistent weakness in relationship extraction across all formats points to a deeper issue with the few-shot examples or prompting strategy.

Experiment 4: Backend Comparison (LangExtract vs Native Gemini)

Compare LangExtract framework against native Gemini structured output using the winning JSON format.

Approaches:
1. LangExtract: Framework with 2 few-shot examples
2. Native Gemini: Direct API with response_schema + same 2 examples
Test Set: 50 samples (same as Experiment 3)
Format: JSON Structured (winner from Experiment 3)

After implementing centralized dataset loading to ensure all experiments use identical test data:

Approach	Entities F1	Attributes F1	Relationships F1	Macro F1	Speed (s)
LangExtract	0.944	0.893	0.174	0.670	0.045
Native Gemini	0.784	0.713	0.392	0.630	3.762

LangExtract’s framework optimizations for batch processing and few-shot learning make it more effective than raw Gemini API calls for entity and attribute extraction. However, Native Gemini’s structured schema enforcement provides better relationship extraction. The dramatic speed difference (83x) favors LangExtract for production use cases.

Experiment 4b: Targeted Improvement with Analysis-Driven Examples

Improve LangExtract’s relationship extraction by analyzing specific failures and creating targeted few-shot examples.

Step 1: Failure Analysis

Compared detailed results from LangExtract vs Native Gemini
Identified specific relationships Native Gemini extracted correctly but LangExtract missed
Found the root cause of failures

Key Discovery:

Analysis Results:
- Total failed relationships: 52
- "sit on" predicate failures: 48 (92%)
- "with" predicate failures: 4 (8%)

Root Cause: LangExtract was extracting "sitting on" instead of "sit on"
despite having instructions to use base verb forms.

Step 2: Create Targeted Examples Added 5 new examples specifically demonstrating “sitting on” → “sit on” normalization:

“A white teddy bear sitting on a green carpeted stair” → uses “sit on”
“A small child sitting on a wooden chair” → uses “sit on”
“A dog sitting beside a tree” → uses “beside” (not “sitting beside”)
“People sitting on benches next to tables” → uses “sit on”
“A man standing on a ladder” → uses “stand on” (same pattern)

Step 3: Re-run Evaluation

Same 50 samples as Experiment 4
Same LangExtract framework
Now with 7 total examples (2 original + 5 targeted)

Approach	Entities F1	Attributes F1	Relationships F1	Macro F1
Improved LangExtract	0.901	0.900	0.450	0.750
Original LangExtract	0.944	0.893	0.174	0.670
Native Gemini	0.784	0.713	0.392	0.630
T5 Baseline	0.907	0.782	0.662	0.784

Improvement from Targeted Examples:

Relationships: 0.174 → 0.450 (+159% improvement!)
Macro F1: 0.670 → 0.750 (+12% improvement)

This experiment demonstrates a powerful methodology for improving LLM performance:

Analyze failures: Don’t just look at aggregate metrics
Identify patterns: Find common error modes (92% were “sit on” failures)
Create targeted examples: Address specific weaknesses with focused demonstrations
Iterate: Measure impact and repeat

The result: Improved LangExtract now outperforms all other approaches including Native Gemini, while remaining extremely fast. It comes within 4.3% of T5’s performance (0.750 vs 0.784) despite using only 7 examples instead of full fine-tuning.

Comprehensive Comparison

Approach	Test Samples	Entities F1	Attributes F1	Relationships F1	Macro F1	Speed (s)
T5 Baseline	100	0.907	0.782	0.662	0.784	4.641
LangExtract Original	50	0.944	0.893	0.174	0.670	0.045
Native Gemini	50	0.784	0.713	0.392	0.630	3.762
LangExtract Improved	50	0.901	0.900	0.450	0.750	0.045

Conclusions and Recommendations

Few-Shot Learning is Surprisingly Effective
- With just 7 examples, LangExtract achieved 96% of T5’s performance (0.750 vs 0.784)
- No fine-tuning, no domain-specific training data required
- Dramatically faster inference (103x speedup)
Analysis-Driven Few-Shot Engineering is Powerful
- Analyzing specific failures revealed 92% of errors came from one pattern
- Creating 5 targeted examples improved relationship F1 by +159%
- This methodology can be repeated iteratively to close the gap with fine-tuned models
Dataset Consistency is Critical
- Initial experiment 4 results were misleading due to inconsistent test data
- Centralizing dataset loading (dataset_utils.py) ensured fair comparisons
- Lesson: Always use identical test sets when comparing approaches
LLMs Excel at Different Tasks Than Fine-tuned Models
- LLMs better at: Entities (0.90+ F1), Attributes (0.90 F1)
- Fine-tuned models better at: Relationships (0.66 vs 0.45 F1)
- This suggests relationships require more domain-specific knowledge
Speed vs Accuracy Trade-offs
- T5: Best accuracy, slowest (4.6s/sample)
- Improved LangExtract: Excellent accuracy, extremely fast (0.045s/sample)
- Native Gemini: Worst of both worlds (lower accuracy, slow)

Recommendations by Use Case

Choose T5 Fine-tuned Model if:

✅ You need the absolute best relationship extraction (0.662 F1)

✅ Inference speed is not a constraint

✅ You’re working with FACTUAL-style data

✅ You have computational resources for model loading

Choose Improved LangExtract if:

✅ You need fast inference (22 samples/second)

✅ You want excellent entity/attribute extraction (0.90 F1)

✅ You need flexibility to adapt to new domains (just change examples)

✅ You want good all-around performance without fine-tuning

✅ Recommended for most production use cases

Final Thoughts

This series of experiments demonstrates that modern LLMs with few-shot learning can approach the performance of fine-tuned models while offering dramatic speed advantages and flexibility. The key insight: analysis-driven few-shot engineering is a powerful technique for iteratively improving LLM performance on structured extraction tasks.

For scene graph extraction specifically:

Improved LangExtract achieved 96% of T5’s performance with 103x faster inference
Adding just 5 targeted examples improved relationship extraction by +159%
The gap between few-shot and fine-tuning continues to narrow

As LLMs continue to improve and few-shot learning techniques become more sophisticated, we expect the gap to close further. For practitioners, this means:

Start with few-shot LLMs for their speed and flexibility
Use fine-tuned models only when you need the absolute best accuracy
Invest time in analysis-driven example engineering rather than collecting labeled data for fine-tuning

The future of structured extraction is fast, flexible, and increasingly accurate—powered by few-shot learning with large language models.

Appendix: Reproduction

All experiments are reproducible using:

# Install dependencies
uv add FactualSceneGraph langextract anthropic datasets torch pandas numpy matplotlib seaborn python-dotenv scikit-learn

# Run experiments
uv run experiment_1_t5_baseline.py
uv run experiment_2_langextract_poc_batched.py
uv run experiment_3_format_optimization_batched.py
uv run experiment_4_backend_comparison.py
uv run experiment_4b_improved_langextract.py

# Compare all results
uv run compare_all_experiments.py

Dataset: FACTUAL Scene Graph Dataset

Code: Available in this repository

Share on

X Facebook LinkedIn Bluesky