23 minute read

This post demonstrates that reinforcement learning agents can be trained with sparse reward signals as effectively as carefully tuned dense rewards.

Optimizing Last-Mile Delivery with Reinforcement Learning: From Dense to Sparse Rewards

Introduction: The Problem of Last-Mile Delivery

Last-mile delivery—the final leg of a package’s journey from distribution center to customer—represents one of the most expensive and operationally challenging aspects of modern logistics. According to industry research, last-mile costs can account for 50-60% of total shipping expenses, and with the explosive growth of e-commerce, the problem has only become more acute.

The core challenge is deceptively simple to state: Given a fleet of vehicles with limited capacity, a stream of daily package arrivals with varying priorities and delivery deadlines, and geographic constraints on which truck can serve which zone, how do you decide which packages to deliver today and which to defer to maximize operational value while meeting customer expectations?

This is a constrained optimization problem that sits at the intersection of:

  • Vehicle Routing Problems (VRP): Classical logistics optimization
  • Knapsack Problems: Capacity-constrained selection
  • Temporal Scheduling: Deadline-aware decision making
  • Reinforcement Learning: Learning optimal policies from experience

Traditional approaches to logistics optimization have relied on integer linear programming, metaheuristics (genetic algorithms, simulated annealing), and handcrafted rule-based systems. But what if we could train an agent to learn better strategies through experience? And what if we could do it without designing a complex reward function?

That’s the premise of this project, and the surprising answer lies in sparse reward learning.


The Simulation Model

We model the delivery optimization problem as a constrained capacity problem with 10 trucks, 10 zones, and 30 daily package arrivals:

Infrastructure:

  • Warehouse: Central depot (location 0)
  • Zones: 10 geographic zones, each with 4 delivery addresses (40 total addresses)
  • Trucks: 10 trucks, one assigned per zone
    • Capacity: 2 packages per truck per day
    • Total daily fleet capacity: 20 packages

Daily Arrivals:

  • Volume: 30 packages per day
  • Shortage: 33% capacity shortage (must defer 10 packages)
  • Priority: Bimodal distribution
    • 40% low-priority (1-3)
    • 40% high-priority (8-10)
    • 20% medium-priority (4-7)
  • Deadlines: 1-7 days from arrival
  • Destinations: Uniformly distributed across zones

The Core Decision Problem

Each day, each truck independently faces the decision:

“Of my available packages (new arrivals + deferred backlog), which 2 should I deliver today?”

This is non-trivial because:

  1. Capacity is limited: Can’t deliver everything
  2. Deadlines create urgency: Some packages expire soon
  3. Priorities are heterogeneous: High-priority packages are worth more
  4. Deferrals have costs: Deferred packages tie up future capacity
  5. Information is incomplete: Don’t know future arrivals

Algorithmic Solutions: From Dense to Sparse Rewards

1. Greedy Baseline

Algorithm: Deliver all available packages (up to capacity limit)

For each truck:
  available_packages = deferred + new_arrivals
  deliver = sort_by_priority(available_packages)[:capacity]

Characteristics:

  • Deterministic
  • No learning required
  • Simple to implement
  • Treats all packages equally (ignores deadline pressure)

Performance Baseline:

  • Average daily reward: 805.7
  • On-time delivery: 75.6%
  • No variance (deterministic)

2. Dense Reward PPO/SAC (Traditional RL Approach)

Reward Function:

Total Reward = Σ(delivered_package_value) - Σ(missed_deadline_penalty) - distance_cost

Where:
  delivered_package_value = 100 × (priority / 10)
  missed_deadline_penalty = -50 × (priority / 10)
  distance_cost = -0.01 × kilometers_traveled

Key Characteristics:

  • Complex reward engineering required
  • Hyperparameters: 100 (delivery weight), -50 (miss penalty), -0.01 (distance)
  • Requires extensive domain knowledge to set correctly
  • Dense signal ranges from -500 to +1000 per decision
  • Scaling factors need careful tuning

3. Sparse Reward PPO/SAC (Novel Approach)

Core Insight: What if we don’t need a complex reward function at all? What if agents can learn from comparative signals only?

Sparse Reward Signal:

Instead of -500 to +1000, agents receive:

  • +1: When policy outperforms what the greedy baseline would have done
  • -1: When policy underperforms the baseline
  • 0: When policy matches the baseline

Why This Matters:

  • No hyperparameter tuning needed (just +1 and -1)
  • No domain knowledge required to scale rewards
  • Agents learn from pure comparison: “Am I better or worse?”
  • Signal is invariant to absolute reward magnitude
  • Simpler, more robust, more generalizable

Architecture Unchanged:

  • Same zone-agnostic shared policy
  • Same observations: priority, deadline_urgency, distance, truck_load
  • Same action space: continuous confidence scores for packages
  • Same training (PPO and SAC) and duration (500k timesteps)

The only difference: The reward signal during training


The Surprising Results: Sparse Wins Over Dense

We evaluated all agents (Greedy, Dense PPO, Dense SAC, Sparse PPO, Sparse SAC) using a relative value metric: How much better/worse than the baseline?

Agent Daily Advantage Win Rate Max Advantage Std Dev
Sparse PPO 🏆 +464.8 98.6% +1020.2 200.7
Dense PPO +461.5 99.6% +980.0 192.5
Dense SAC +399.7 96.8% +999.5 220.0
Sparse SAC +362.4 96.6% +860.1 198.8
Greedy Baseline 0 (reference) 50% 0 -

Absolute Rewards

Both PPO agents significantly outperform SAC and the baseline. The Sparse PPO’s slight edge over Dense PPO (127,054 vs 126,720) is noteworthy because it achieves this with a 1000x simpler reward signal.


Relative Values

This is the most important metric: How much better/worse than the greedy baseline?

The heights are nearly identical for the PPO variants, showing that sparse and dense rewards achieve essentially the same end result. The slight Sparse PPO edge (+3.3 points) is statistically insignificant but philosophically important: simpler signals achieved better results.

The SAC agents’ lower performance suggests that SAC’s off-policy nature may require denser reward signals for effective learning, or that our SAC hyperparameters weren’t as well-tuned as PPO’s.


Win Rates

Win rate = percentage of days the agent beats the greedy baseline

Interpretation:

  • All agents are highly reliable, beating the baseline >96% of the time
  • Dense PPO has a slight edge in consistency (99.6% vs 98.6%)
  • But this comes at the cost of complex reward engineering
  • The trade-off is acceptable: Sparse PPO’s 98.6% win rate is still excellent
  • Only 7 days out of 500 where Sparse PPO underperforms (1.4% failure rate)

Relative Evolution

This plot shows how the daily advantage changes throughout a 100-day episode

Interpretation:

  • All agents maintain stable performance across the full episode
  • No degradation as days progress (agents don’t “tire out”)
  • No improvement trend (agents don’t learn within an episode; learning happens across episodes)
  • Sparse PPO’s consistency mirrors Dense PPO perfectly
  • This stability is crucial for real-world deployment: reliable day-after-day performance

Why Does Sparse Reward Learning Work?

Agents trained with only -1, 0, +1 signals matched or exceeded agents trained with -500 to +1000 dense rewards. This challenges conventional RL wisdom. Why?

Hypothesis 1: Signal Noise Reduction

Dense rewards encode both:

  1. Direction: Should I improve or degrade my behavior?
  2. Magnitude: By how much?

The magnitude component introduces noise:

  • How much should I weight a +500 reward vs +550?
  • Are the scaling factors (100, -50, -0.01) optimal?
  • What if I increase scale 2x? The agent relearns everything.

Sparse signals provide only direction, removing scaling noise:

  • Better than baseline? → +1 (learn this policy)
  • Worse? → -1 (unlearn this policy)
  • Same? → 0 (neutral)

This cleaner signal might be more learnable.

Hypothesis 2: Relative Framebench Built-In

Dense rewards are absolute: “This decision is worth 547 points.”

Sparse rewards are relative: “This decision beats the baseline.”

Relative framebench is more robust:

  • Generalizes across different problem scales
  • Adapts if baseline changes (e.g., competitor improves)
  • Built-in regularization against extreme behaviors
  • Agents learn what matters: being better, not just good

Hypothesis 3: Simpler Loss Landscape

Dense reward functions create complex, multi-objective loss surfaces:

  • Deliver high-priority: +100 × priority/10
  • Avoid misses: -50 × priority/10
  • Minimize distance: -0.01 × km

Three conflicting objectives, each with scaling factors.

Sparse signals create simpler landscapes:

  • Single objective: maximize probability of beating baseline
  • Binary feedback (better/worse)
  • Cleaner gradient flow

Simpler landscapes = faster convergence, better local optima.

Real-World Advantages of Sparse Reward Learning

In academic settings, we can carefully design reward functions. In the real world, this is hard:

  1. Unknown Objectives: What’s the true business objective?
    • Maximize revenue? Minimize costs? Customer satisfaction? All of the above?
    • Often these conflict
  2. Scaling Uncertainty: How to weight different objectives?
    • Is on-time delivery worth 2x cost savings or 10x?
    • How much does customer satisfaction matter vs revenue?
    • These weights differ by market, season, customer segment
  3. Lack of Domain Knowledge: What if you’re optimizing something new?
    • Traditional logistics expertise doesn’t transfer perfectly
    • New market = new constraints
    • Hand-tuned weights from one domain fail in another
  4. Hyperparameter Sensitivity: Dense rewards require constant retuning
    • Market conditions change → tune weights again
    • Competitor behavior changes → tune again
    • Seasonal variations → more tuning
    • No end to the tuning work

With sparse rewards, you only need one thing: A baseline policy

# Define baseline (e.g., greedy heuristic, existing system, competitor)
baseline_action = get_greedy_action(state)
baseline_reward = evaluate(baseline_action)

# Define learned policy
learned_action = agent.predict(state)
learned_reward = evaluate(learned_action)

# That's it! No hyperparameter tuning needed
if learned_reward > baseline_reward:
    reward_signal = +1
else:
    reward_signal = -1

Advantages:

  • ✅ No reward scaling needed
  • ✅ Works with any baseline (greedy, heuristic, human expert, previous system)
  • ✅ Automatically adapts as baseline changes
  • ✅ Same code works for different objectives (just change baseline)
  • ✅ Explainable: “We learned a policy better than your existing system”

This is profoundly more practical for real systems.

Conclusion

This project demonstrates two critical insights:

First: Reinforcement learning can discover sophisticated logistics optimization strategies. Our shared policy agents beat a greedy baseline by 57-58% on the same problem instances that professional logistics systems solve with static heuristics.

Second, and more surprising: You might not need a complex reward function to achieve this. A simple comparative signal (+1 if better, -1 if worse) matched or exceeded carefully tuned dense rewards (-500 to +1000).

This challenges a fundamental assumption in RL: that bigger, richer reward signals are always better. In practice, simpler signals that provide clear direction (“beat the baseline”) might be more learnable and generalizable than complex signals with precise magnitudes.

The counter-intuitive result—that sparse PPO (+464.8 advantage) slightly outperformed dense PPO (+461.5)—suggests that how you train matters more than what you train with. A cleaner learning signal, even if simpler, can outweigh richer but more complex information.

For practitioners building real-world RL systems: Stop optimizing your reward function. Start comparing against your baseline. It’s simpler, more practical, and achieves the same (or better) results.

Final Thoughts

Optimization problems in logistics have been solved for decades using mathematical programming. Traditional RL adds learning capability. But sparse reward learning adds something different: simplicity and adaptability.

A traditional solver finds the optimal solution given a model. But if your model is wrong or incomplete, you’ve optimized the wrong thing.

A dense reward agent learns a policy that maximizes your hand-tuned objective. But if that objective changes (market shifts, new constraint discovered), the training loses relevance.

A sparse reward agent learns one simple thing: How to beat what you already have. This is more robust, more adaptable, and more practical.

The next generation of logistics software—and optimization software broadly—won’t just optimize given constraints. It will learn baseline-relative improvement through sparse, adaptive signals.

And sometimes, the simpler signal is the stronger one.

Code: Available in this repository