Sorry, your browser cannot access this site
This page requires browser support (enable) JavaScript
Learn more >

Intro

Definition (mean(平均值)): The sample mean(平均值) of observed values $x_1,…,x_n \in \mathbb{R}$ is

用sample把它和概率论里的mean/expectation区分一下。


Definition (median(中位数)): The sample median(中位数) of observed values is

with $x_1 \leq x_2 \leq … \leq x_n$ being the sorted data points.

如果总数为基数,就取最中间的那个数;如果总数是偶数,那就取中间2个的平均值。


Definition (Statistical model 统计模型):


Definition (Parameter): A (statistical) parameter of a statistical model $\mathcal{P}$ is a map $\gamma : \mathcal{P} \to \text{ some set }\Tau$ .


例子:

- 期望值 Mean/expectation

- Variance

- Correlations

Construction of Estimators

Definition (Estimators): A estimator is a function that map data to estimates of quantities of interest.

简单来讲,estimator就是一个函数,它的定义域是数据,叫estimand,它的值域里的元素则是叫estimates。


Plug-in Estimator

Definition (empirical distribution): The empirical distribution of $x_1, \ldots, x_n \in \mathbb{R}$ is the probability distribution $\hat{P}_n$ given by

一个相对离散的概率分布。


Definition (empirical distribution function (ecdf)): The empirical distribution function (ecdf) of $x_1, \ldots, x_n$is the distribution function of $\hat{P}_n$, which is

一个连续的分布函数。


Theorem (Glivenko—Cantelli): If $X_1, X_2, \ldots$ are i.i.d. random variables with cdf (cumulative distribution function) $F$, then

也就是说,当我们有足够多的样本时,这个esimator会趋近于它原本的分布。


Theorem: If $U \sim Unif(0,1)$,then $X:=F^{-1}(U)$ has cdf $F$.


Definition: The plug-in estimator of $\gamma(F)$ is the estimator $\hat{\gamma} = \gamma(\hat{F_n})$.


例子:

考虑期望值 $\gamma(F) = \int x dF(x)$。那么有


简单来讲,我们先观测已知参数时,怎么样可以从结果得到这个参数(把这个过程量化成函数$\gamma$)。然后再把我们未知参数的观测结果带进这个函数$\gamma$里来计算出参数。

以伯努利分布为例:样本 $X_i\in\{0,1\}$,因为有$p=\mathbb E[X]$,所以很自然的就得到

M-Estimator

Definition (M-estimator): An estimator $\hat{\theta}(X_1, \ldots, X_n)$ maximizing a criterion function of the form

where $m_\theta$ is a known function, is called an M-estimator (maximum-likelihood type).


例子

- $\theta \in \mathbb{R}$, 如果选$m_\theta(x) = -(x - \theta)^2$,那么就会得到平均值 $\bar{X}_n$。

- 选$m_\theta(x) = -|x - \theta|$,会得到中位数。

Method of Moments (MOM)

Given a parametric model for real-valued observations

consider the moments

If it exists then the $j$-th moment may be estimated by

Definition: The MOM estimator $\hat{\theta}$ is the value of $\theta$ that solves the equation system


例子(高斯):

Suppose $P_\theta = \mathcal{N}(\mu, \sigma^2)$ with mean and variance unknown.So,

The density:

需要解的方程组:

最后得到sample mean以及empirical variance:

Maximum Likelihood Estimator (MLE)

Consider a parametric model for the observation

Assume the model $\mathcal{P} = \{ P_\theta : \theta \in \Theta \}$ is dominated by a $\sigma$-finite measure $\nu$, i.e., $P_\theta \ll \nu$ for all $\theta \in \Theta$, and so we have densities

(这个so的结论是概率论里的结论。)

Definition: The function $L_x(\theta) = p_\theta(x)$ is the likelihood function of model $\mathcal{P}$ for the data $x$.

有密度函数直接套密度函数(density function)。


Definition: The maximum likelihood estimate (MLE) of $\theta$ is

If $\hat{\theta}(X)$ is a measurable function of the observation $X$, then $\hat{\theta}(X)$ is called maximum likelihood estimator (MLE) of $\theta$.


只不过我们通常其实会考虑所谓的 log -likelihood function

这样做有2点好处:

  • 可以避免numerical overflow;
  • 更方便计算。
    • 如果$L_x$本身是$\Pi$的形式,那$l_x$则会变成$\sum$的形式。
    • 变成$\sum$后,求导的时候不相关的部分可以直接扔掉。

例子(高斯):

Suppose $X_1, \ldots, X_n$ i.i.d. $\mathcal{N}(\mu, \sigma^2)$, so $\theta = (\mu, \sigma^2) \in \mathbb{R} \times (0,\infty)$.

Assume $n \ge 2$, so that $\frac{1}{n} \sum_{i=1}^n (X_i - \overline{X}_n)^2 > 0
\quad \text{a.s.}$.

The log -likelihood function:

不难得到:

贝叶斯估计(Bayes Estimators)

在之前的构造里,所有参数都取决于我们拿到的数据,完全不受我们任何经验/前置知识的影响。

但我们现在希望修改这个模式:我们希望最后得到的参数同时取决于数据以及我们的经验/先前的判断。

贝叶斯推断(Bayesian Inference)的流程:

1. 把先前的常数值$\theta$当作random variable(随机变量),并选择一个prior distribution先验分布,即我们观察数据前的做出的判断)。

2. 把$P_\theta$ 当作 the conditional distribution of $X$ given $\theta$。

3. 在观察到了数据 $x$ 之后,做统计推断时,看 $\theta$ 的后验分布(posterior distribution),也就是 “在观测到 $X = x$ 之后,$\theta$ 的条件分布”。

Consider an observation modeled as $X\sim P_\theta, \theta \in \Theta \subseteq \mathbb{R}.$

Theorem (Bayes theorem): Suppose the prior distribution has density $\pi$ w.r.t. a measure $\nu$ and
$P_\theta \ll \nu \ \ \forall \theta$ with densities $p_\theta(x) = p(x \mid \theta)$.

Then the posterior distribution has density (w.r.t. $\nu$):

where

is the prior predictive density of $X$.


Bayes estimators of $\theta$ are obtained as characteristics of the posterior distribution.
Most frequently, one considers the posterior mean:

例子(高斯):

Assume $X_1, \ldots, X_n$ i.i.d.\ $\mathcal{N}(\mu, \sigma^2)$ with $\sigma^2 > 0$ known. We select as prior distribution $\mu \sim \mathcal{N}(m, \tau^2)$ so

The likelihood function is equal to

The posterior density is

We recognize that the posterior distribution will be a normal distribution.
More precisely,

where

We conclude that since $p(\mu \mid X) \propto \exp\{ a \cdot (\mu - b/a)^2 \}$,
it holds that $p(\mu \mid X)$ is the density of a normal distribution with mean and variance:

The posterior mean is a convex combination of $\overline{X}_n$ and the prior mean $m$:

当我们设:

则有:

注意到当n趋近于无穷时,$\lambda$会趋近于1。也就是说,当n越大,最后算出来的mean就会越取决于我们拿到的数据。反之,n越小,则会更取决于我们的先验知识。

Mean Square Error, Bias and Variance

Definition: The mean square error is defined as


Theorem: The mean square error decomposes as

where $\mathrm{Bias}_\theta[\hat{\theta}]:= \mathbb{E}_\theta[\hat{\theta}] -\theta$ is the bias of $\hat{\theta}.$

Proof:

Write $\mathrm{MSE}_\theta[\hat{\theta}]
= \mathbb{E}_\theta
\Big[
\big(\hat{\theta} - \mathbb{E}_\theta[\hat{\theta}] + \mathbb{E}_\theta[\hat{\theta}] - \theta \big)^2 \Big] $ and expand

$\square$


Sufficient Statistics

Definition: The statistic $T$ is sufficient for model $\mathcal{P}=\{P_\theta:\theta\in\Theta\}$ if there exists a determination of the conditional distribution of $X$ given $T(X)=t$ that does not depend on $\theta$.


Theorem (Neyman’s Factorization Criterion): Let $\mathcal{P}=\{P_\theta:\theta\in\Theta\}$ be dominated by a $\sigma$-finite measure $\nu$, i.e., each $P_\theta$ has a density

Then a statistic $T:\mathcal{X}\to\mathcal{Y}$ is sufficient for $\mathcal{P}$ if and only if there exist measurable non-negative functions $h$ and $g_\theta$ such that for all $\theta\in\Theta$ the density of $P_\theta$ may be given the form

所以可以很简单地通过分解density函数来得到一个sufficient的statistic。


Definition: A sufficient statistic $T$ is minimal sufficient if $T$ is a function of every other sufficient statistic $T’$, or more precisely if

‘Almost every’ means that $T(x)=S(T’(x))$ for $x$ in a set $\mathcal{X}^\subseteq\mathcal{X}$ with $P_\theta(\mathcal{X}^)=1$ for all $\theta\in\Theta$. The existence of such a map $S$ is equivalent to


Theorem: Let $\mathcal{P}=\{P_\theta:\theta\in\Theta\}$ be dominated by a $\sigma$-finite measure $\nu$ with densities

Define the support $\operatorname{supp}(\mathcal{P})=\{x:\exists\,\theta\text{ with }p_\theta(x)>0\}$.

Let $T$ be a statistic such that for all $x,\tilde{x}\in\operatorname{supp}(\mathcal{P})$ it holds that

Then $T$ is minimal sufficient.


Completeness

Theorem (Basu's Theorem): Let $X$ be an observation from the statistical model $\mathcal{P} = \{ P_{\theta}: \theta \in \Theta \}$ and $T$ a complete and sufficient statistic for $\theta$. If $A$ is an ancillary statistic, i.e. the distribution of $A(X)$ does not depend on $\theta$, then $A(X)$ is independent of $T(X)$.

Proof:

Since $A$ is ancillary, the probability

will be constant with respect to $\theta$ for all $B$ in the range of $A$.
Since $T$ is sufficient, $E_\theta\left[\mathbf{1}_{\{A(X)\in B\}}\mid T\right]$ is a function of $T$ (independent of $\theta$).

Then

Hence,

with $h(T)=P\bigl(X\in A^{-1}(B)\bigr)-E\left[\mathbf{1}_{\{A(X)\in B\}}\mid T\right].$

But since $T$ is complete, this implies

with $P_{\ast}$ and $E_{\ast}$ denoting to be independent of $\theta$.

Let sets $B$ and $C$ be arbitrary sets from the range of $A$ and $T$, respectively.
Then, we have:

for all $\theta\in\Theta$, which means $A$ and $T$ are independent.

Exponential Families

Definition: Let $\mathcal{P}=\{P_\theta:\theta\in\Theta\}$ be a statistical model on $(\mathcal{X},\mathcal{A})$ that is dominated by a $\sigma$-finite measure $\nu$. Suppose there are functions

such that the densities $p_\theta(x)=\frac{dP_\theta}{d\nu}(x)$ are of the form

Then $\mathcal{P}$ is called a $k$-dimensional exponential family with natural parameters $\eta(\theta)$ and sufficient statistic $T$ (recall Neyman’s criterion).

Corollary: If $\eta(\Theta)=\{\eta(\theta):\theta\in\Theta\}\subseteq\mathbb{R}^k$ contains $k+1$ points $\eta^{(0)},\ldots,\eta^{(k)}$ s.t. $\eta^{(j)}-\eta^{(0)}$, $j=1,\ldots,k$ are linearly independent, then the sufficient statistic $T$ is minimal sufficient.

By re-parametrizing in terms of natural parameters, we may bring an exponential family into its canonical form

We write $P_\eta$ for the distribution given by the density $p_\eta$.

The density $p_\eta(x)$ is well-defined for every vector $\eta$ in the natural parameter space

Indeed for $\eta\in\mathcal{H}$, a well-defined density is obtained by setting

The function $A(\eta)$ is the cumulant generating function or also log-Laplace transform.


Theorem:


有些时候可以用这个公式快速计算moments。

Theorem:

1. The natural parameter space $\mathcal{H}$ is a convex set.

2. The cumulant generating function $A(\eta)$ is a convex function on $\mathcal{H}$. It is strictly convex if $P_\eta\neq P_{\eta’}$ for all $\eta,\eta’\in\mathcal{H}$ with $\eta\neq \eta’$.

3. The log-likelihood function $\ell_x(\eta)=\log p_\eta(x)$ is concave on $\mathcal{H}$. It is strictly concave if $P_\eta\neq P_{\eta’}$ for all $\eta,\eta’\in\mathcal{H}$ with $\eta\neq \eta’$.