The balancing of exploration and exploitation is often done via epsilon greedy policy. A real life strategy is the heuristic of choosing the option that leaves most doors open. I am trying to see if it is possible to capture this heuristics mathematically.

notation: let $$\rho(n)$$ stand for the gate of the $$n^{th}$$ episode. Let $$\eta$$ stand for the increment of each experimental unit (step, episode, or n steps). Let $$\pi^{\zeta}(s)$$ stand for the zeta greedy policy.

Algorithm:

1. set $$\eta$$ and $$\rho(0)$$=0
2.  $$\pi^{\zeta}(s)$$=argmax q(s,a)>$$\rho(n)$$