Reinforcement Learning Variants for Stochastic Dynamic Combinatorial Optimization Problems in Transportation


  • Florentin D. Hildebrandt
  • Alexander Bode
  • Marlin W. Ulmer
  • Dirk C. Mattfeld



stochastic dynamic transportation problems, sequential decision processes, mixed integer linear programming, reinforcement learning, approximate dynamic programming


With rising customer expectations and increasing computational potential, many transportation services face real-time decision making in stochastic and dynamic environments. They often need to find and adapt complex plans that are effective now but also flexible with respect to future developments. Mathematically, these three challenges of searching the large and complex decision space for effective and flexible decisions are reflected in the three parts of the famous Bellman Equation, namely the reward function (effective), the value function (flexible), and the decision space (search). In the transportation literature, reinforcement learning (RL) has shown potential to quickly evaluate the reward- and value function of the Bellman Equation for a limited number of decisions but struggles to search a complex, constrained decision space immanent in most transportation problems. The question of how to combine the thorough search of the complex decision space with RL-evaluation techniques is still open.We propose three RL-based solutions, each inspired by one component of the Bellman Equation, to search for and evaluate decisions in an integrated manner. The first and second method learn to dynamically manipulate the reward function and decision space to encourage effective and flexible and prohibit inflexible decisions, respectively. The third method models the Bellman Equation as a mixedinteger linear programming formulation in which the value function is given by a neural network approximator. We compare our proposed solution methods in a structured analysis for carefully designed problem classes based on longhaul, medium-haul, and short-haul transportation logistics. We demonstrate the overall effectiveness of our methods compared to prominent benchmark methods and highlight how the methods’ performances depend not only on the problem classes but also on the instances’ parameterizations.