fisher_info

Posted on 2023-12-26 Edited on 2023-12-27 In ML Views:

Symbols count in article: 1.3k Reading time ≈ 1 mins.

Fisher Information

Definition

The score function is defined as:

\[s(\theta; X) = \frac{\partial \log f(X; \theta)}{\partial \theta}\]

When it equals to \(0\), we have the maximum likelihood estimator:

\[s(\hat{\theta}; X) = 0 \implies \hat{\theta} = \arg\max_{\theta} \log f(X; \theta)\]

The expectation of the score function evaluated at the true parameter value \(\theta_0\) is:

\[E_{\theta_0}[s(\theta; X) |_{\theta = \theta_0}] = \int_{X} f(X| \theta_0)\frac{\partial f(X; \theta_0) / \partial \theta_0}{f(X| \theta_0)} dx = 0\]

This makes sense, since the true parameter should have gradient zero for all values of \(X\). Now we are interested in the curvature information around the score function of true parameter:

\[-E_{\theta_0} [\frac{\partial{s(\theta; X)}}{\partial \theta} |_{\theta = \theta_0}] = E_{\theta_0}[s(\theta ; X)^2 |_{\theta=\theta_0}] = \text{Var}_{\theta_0}(s(\theta ; X) |_{\theta=\theta_0})\]

Thus, the curvature information coincides with the variance of score function at the true parameter value. This is called the fisher information at parameter \(\theta_0\):

\[I(\theta_0) = \text{Var}_{\theta_0}(s(\theta ; X) |_{\theta=\theta_0})\]

If \(f\) is flat around \(\theta_0\) then \(I(\theta_0)\) will be large which indicates that we are not confident at the maximum value compare with steep peak with small \(I(\theta_0)\).

Relation to KL Divergence

\[\nabla^2_{\theta} KL(\theta_0 || \theta) |_{\theta = \theta_0} = I(\theta_0) \]