Red Flags for RL Agents: A Quality Indicator for Reinforcement Learning in Electricity Markets

Reinforcement learning agents in electricity market simulations learn bidding strategies on their own. That is powerful, but it comes with a catch: how do you know whether the result is "good"? For small models, the optimal decision at each time step can be computed exactly. For large, realistic simulations it can no longer be computed, and it is precisely there that non-experts are supposed to be able to trust the results.
In this thesis, you develop an indicator that predicts when an RL strategy deviates from the optimal decision, using only signals the agent itself provides (e.g. uncertainty about the value of an action, or how familiar a situation is). You calibrate this indicator where the optimal solution is computable, and test whether it transfers to more complex scenarios.


Specifically:

  • Setting up small market models that are solvable both by optimization and by RL (unique ground truth)
  • Developing and implementing a suboptimality indicator from optimum-free signals
  • Validation: does the indicator reflect the true optimality gap, and does it react to deliberately introduced errors?
  • Testing how far the predictive quality carries as complexity grows

What you bring:

 solid Python skills, an interest in reinforcement learning and machine learning. An advantage, but not required: a background in optimization or energy economics.


What you get:

 close supervision including support with training and scenario definition, and a topic with a clear research contribution.
If you are interested, please send a cover letter and transcript of records by email to julius.grams∂kit.edu.

References:


Lee JD, See KA. Trust in automation: designing for appropriate reliance. Hum Factors. 2004 Spring;46(1):50-80. doi: 10.1518/hfes.46.1.50_30392. PMID: 15151155.
Timbers, Finbarr & Bard, Nolan & Lockhart, Edward & Lanctot, Marc & Schmid, Martin & Burch, Neil & Schrittwieser, Julian & Hubert, Thomas & Bowling, Michael. (2022). Approximate Exploitability: Learning a Best Response. 3462-3468. 10.24963/ijcai.2022/481.
Ralf Römer and Adrian Kobras and Luca Worbis and Angela P. Schoellig. Failure Prediction at Runtime for Generative Robot Policies. Advances in Neural Information Processing Systems (NeurIPS) (2025), https://arxiv.org/pdf/2510.09459
Wellman, M. P., Tuyls, K., & Greenwald, A. (2025). Empirical game theoretic analysis: A survey. Journal of artificial intelligence research, 82, 1017-1076.