fisher_info

Fisher Information

Definition

The score function is defined as:

\[s(\theta; X) = \frac{\partial \log f(X; \theta)}{\partial \theta}\]

When it equals to \(0\), we have the maximum likelihood estimator:

\[s(\hat{\theta}; X) = 0 \implies \hat{\theta} = \arg\max_{\theta} \log f(X; \theta)\]

The expectation of the score function evaluated at the true parameter value \(\theta_0\) is:

\[E_{\theta_0}[s(\theta; X) |_{\theta = \theta_0}] = \int_{X} f(X| \theta_0)\frac{\partial f(X; \theta_0) / \partial \theta_0}{f(X| \theta_0)} dx = 0\]

This makes sense, since the true parameter should have gradient zero for all values of \(X\). Now we are interested in the curvature information around the score function of true parameter:

\[-E_{\theta_0} [\frac{\partial{s(\theta; X)}}{\partial \theta} |_{\theta = \theta_0}] = E_{\theta_0}[s(\theta ; X)^2 |_{\theta=\theta_0}] = \text{Var}_{\theta_0}(s(\theta ; X) |_{\theta=\theta_0})\]

Thus, the curvature information coincides with the variance of score function at the true parameter value. This is called the fisher information at parameter \(\theta_0\):

\[I(\theta_0) = \text{Var}_{\theta_0}(s(\theta ; X) |_{\theta=\theta_0})\]

If \(f\) is flat around \(\theta_0\) then \(I(\theta_0)\) will be large which indicates that we are not confident at the maximum value compare with steep peak with small \(I(\theta_0)\).

Relation to KL Divergence

\[\nabla^2_{\theta} KL(\theta_0 || \theta) |_{\theta = \theta_0} = I(\theta_0) \]