The balancing of exploration and exploitation is often done via epsilon greedy policy. A real life strategy is the heuristic of choosing the option that leaves most doors open. I am trying to see if it is possible to capture this heuristics mathematically.

notation: let \(\rho(n)\) stand for the gate of the \(n^{th}\) episode. Let \(\eta\) stand for the increment of each experimental unit (step, episode, or n steps). Let \(\pi^{\zeta}(s)\) stand for the zeta greedy policy.


  1. set \(\eta\) and \(\rho(0)\)=0
  2. \(\pi^{\zeta}(s)\)=argmax q(s,a)>\(\rho(n)\)