Implicit Q-LearningRevised September 28, 2025 The Deep Q-Learning note is optional but recommended background reading. Offline reinforcement learning assumes access to a fixed dataset \(\mathcal{B} = \{(s_t,a_t,r_t,s_{t+1})\}_{t=1}^Z\) generated by a behavior policy \(\pi_b\), which may be optimal, suboptimal, or a mixture of policies. The learning agent has no additional access to the environment and must train entirely from this static dataset. In offline RL the maximization in the mean squared Bellman error: \[ \begin{equation*} L_{TD}(\theta) = \mathbb{E}_{(s,a,r,s’) \sim \mathcal{B}} \left[ \left(r + \gamma \max_{a’} Q(s’, a’) - Q(s, a) \right)^2 \right] \;, \end{equation*} \] induces systematic overestimation. For actions \(a'\) outside the support of \(\mathcal{B}\) the estimates \(Q(s’,a’;\hat{\theta})\) are purely extrapolated, since \(\mathcal{B}\) provides no evidence for their value. If such actions receive spuriously high estimates the \(\max\) operator selects them and Bellman updates propagate the bias through successive targets. In online RL such errors can be corrected by evaluating actions through additional environmental interaction. Offline RL lacks this validation, so extrapolated estimates persist. Reliable offline learning therefore requires constraining policies to the support of \(\mathcal{B}\). Common approaches penalize out-of-distribution actions through value regularization or constrain the learned policy directly, though these methods do not fully prevent extrapolation error. Implicit Q-Learning (IQL) mitigates extrapolation error by (i) learning a state-value \(V(s)\) via expectile regression over in-dataset action values \(Q(s,a)\), and then (ii) training \(Q\) with targets \(r+\gamma V(s’)\). This replaces the explicit maximization over (potentially out-of-support) actions and biases value estimates toward higher in-support actions through the expectile parameter \(\tau\). Expectile RegressionOut-of-distribution actions can be trivially avoided by defining the backup only with actions \(a'\) sampled from the dataset: \[ \begin{equation}\label{eq:sarsa-like-loss} L(\theta) = \mathbb{E}_{(s,a,r,s’,\color{red}{a’})\sim\mathcal{B}} \left[\big(r + \gamma Q(s’,a’;\hat{\theta}) - Q(s,a;\theta)\big)^2\right], \end{equation} \] Equation \(\eqref{eq:sarsa-like-loss}\) recovers the Q-function of the behavior policy, \(Q^{\pi_b}\). A new policy \(\pi\) can then be extracted by \(\arg \max_a Q^{\pi_b}(s,a)\), which corresponds to a single step of policy iteration — first evaluate the behavior policy, then take the greedy improvement. Such one-step dynamic programming methods, however, are known to fail on complex tasks. The loss in Equation \(\eqref{eq:sarsa-like-loss}\) minimizes the Bellman error under the behavior policy, yielding the average action value under \(\pi_b\): \[ \begin{equation*} \mathbb{E}_{a \sim \pi_b}[Q(s,a)] \;. \end{equation*} \] The mean arises because squared error penalizes positive and negative deviations symmetrically. To bias the estimate toward higher values supported by the dataset, IQL replaces MSE with expectile regression. For a distribution \(X\), the \(\tau\)-expectile \(m_\tau\) is defined by: \[ \begin{equation}\label{eq:expectile} m_\tau = \arg\min_m \mathbb{E}_{x \sim X}\big[L_2^\tau(x-m)\big] \;, \end{equation} \] with asymmetric loss: \[ \begin{equation*} L_2^\tau(u) = |\tau - \mathbf{1}_{\{u<0\}}| u^2 \;. \end{equation*} \]
An expectile can be viewed as a squared analogue of a quantile. The first-order condition of Equation \(\eqref{eq:expectile}\) is: \[ \begin{equation*} \mathbb{E}\big[(\tau-\mathbf{1}_{\{X<m_\tau\}})(X-m_\tau)\big] = 0 \;. \end{equation*} \] When \(\tau\) increases, positive deviations \((X-m_\tau)>0\) are weighted more heavily, so \(m_\tau\) shifts upward. As \(\tau \to 1\), the weight on negative deviations vanishes and \(m_\tau\) approaches the supremum of the support of \(X\). Applied to the distribution of \(Q(s,a)\) under \(a \sim \pi_b(\cdot\mid s)\), the \(\tau\)-expectile interpolates between the mean action value (\(\tau=0.5\)) and, in the limit, the maximum in-support action value: \[ \begin{equation*} \lim_{\tau \to 1} \text{Expectile}_\tau\big(Q(s,a), a \sim \pi_b(\cdot\mid s)\big) = \max_{a \in \Omega(s)} Q(s,a) \;, \end{equation*} \] where \(\Omega(s)=\{a \in A : \pi_b(a\mid s) > 0\}\) denotes the dataset support. Thus when used as the target in the SARSA-style update (Equation \(\eqref{eq:sarsa-like-loss}\)), the expectile value function trains the Q-function to approximate the maximum value attainable from actions in the support of \(\pi_b\): \[ \begin{equation*} L_{TD}(\theta) = \mathbb{E}_{(s,a,r,s’) \sim \mathcal{B}} \Big[ \big(r + \gamma \max_{a’ \in \Omega(s)} Q(s’,a’;\hat{\theta}) - Q(s,a;\theta)\big)^2 \Big] \;. \end{equation*} \]
A naive approach to bias Q-learning toward higher in-support values would be to replace the squared loss in Equation \(\eqref{eq:sarsa-like-loss}\) with an asymmetric expectile loss: \[ \begin{equation}\label{eq:expectile-loss} L(\theta) = \mathbb{E}_{(s,a,r,s’,a’) \sim \mathcal{B}} \left[ L_2^\tau \left(r + \gamma Q(s’,a’;\hat{\theta}) - Q(s,a;\theta)\right) \right] \;, \end{equation} \] For \(\tau > 0.5\), positive errors \(u>0\) receive larger weight than negative errors \(u<0\). That is, underestimates of the target are penalized more heavily than overestimates. The resulting estimator is biased upward, shifting \(Q(s,a;\theta)\) toward the upper tail of the distribution of returns. However, applying expectile regression directly to the Bellman targets conflates two sources of variability: the action distribution under \(\pi_b\) and stochasticity in the transition dynamics \(s’ \sim P(\cdot \mid s,a)\). A “lucky” high target can arise from rare transitions rather than systematically good actions. IQL therefore does not use expectile loss for \(Q\). Instead, it applies expectile regression only to learn a state-value \(V(s)\) over in-dataset actions, and then trains \(Q\) with standard mean-squared error toward the target \(r + \gamma V(s’)\). The loss in Equation \(\eqref{eq:expectile-loss}\) mitigates overestimation but does not isolate the contribution of the action distribution — its targets also reflect stochasticity from the dynamics \(s’ \sim P(\cdot \mid s,a)\). A “lucky”, high value of \(r + \gamma Q(s’,a’;\hat{\theta})\) may therefore arise from a rare positive transition rather than the typical value of an action. Thus, IQL trains a separate value function \(V_\psi\) that estimates an expectile with respect only to the action distribution: \[ \begin{equation*} L_V(\psi) = \mathbb{E}_{(s,a) \sim \mathcal{B}} \Big[ L_2^\tau \big(Q(s,a;\hat{\theta}) - V(s;\psi)\big) \Big] \;. \end{equation*} \] The expectile value \(V_\psi\) then defines the targets for Q-learning through a mean-squared loss: \[ \begin{equation}\label{eq:iql-loss} L_Q(\theta) = \mathbb{E}_{(s,a,r,s’) \sim \mathcal{B}} \Big[\big(r + \gamma V(s’;\psi) - Q(s,a;\theta)\big)^2\Big] \;. \end{equation} \] Using \(V_\psi\) averages over transition stochasticity, avoiding spurious targets from rare “lucky” outcomes. Recovering a PolicyEquation \(\eqref{eq:iql-loss}\) trains \(Q\) toward targets \(r+\gamma V(s’)\), where \(V\) is an upper-biased in-support estimate (controlled by \(\tau\)). This implicitly favors higher in-support actions without taking an explicit \(\max\). The policy for which these Q-values are approximated is “implicit” — the Bellman update evaluates the policy that selects the maximal in-support action, but this policy is never constructed explicitly. Greedy extraction from \(Q\) is unstable in the offline setting, since \(\max_{a’} Q(s,a’)\) may still favor unsupported actions. IQL therefore extracts the policy with a constrained method, advantage-weighted regression (AWR). AWR fits a parametric policy \(\pi_\phi\) to the dataset actions while reweighting them according to their advantage: \[ \begin{equation*} A(s,a) = Q(s,a) - V(s) \;, \end{equation*} \] so that actions with higher advantage are imitated more strongly. Concretely, the policy is obtained by minimizing the weighted log-loss \[ \begin{equation*} L_\pi(\phi) = \mathbb{E}_{(s,a)\sim \mathcal{B}} \left[ - \exp\!\left(\tfrac{A(s,a)}{\beta}\right)\,\log \pi_\phi(a\mid s) \right] \;, \end{equation*} \] where \(\beta>0\) is a temperature parameter controlling how aggressively the policy concentrates on high-advantage actions. This procedure ensures that policy learning remains within the dataset support, while still biasing toward actions that improve on \(\pi_b\).
References
|