Sorry, your browser cannot access this site
This page requires browser support (enable) JavaScript
Learn more >

Intro

Definition (mean(平均值)): The sample mean(平均值) of observed values $x_1,…,x_n \in \mathbb{R}$ is

用sample把它和概率论里的mean/expectation区分一下。


Definition (median(中位数)): The sample median(中位数) of observed values is

with $x_1 \leq x_2 \leq … \leq x_n$ being the sorted data points.

如果总数为基数,就取最中间的那个数;如果总数是偶数,那就取中间2个的平均值。


Definition (Statistical model 统计模型):


Definition (Parameter): A (statistical) parameter of a statistical model $\mathcal{P}$ is a map $\gamma : \mathcal{P} \to \text{ some set }\Tau$ .


例子:

- 期望值 Mean/expectation

- Variance

- Correlations

Construction of Estimators

Definition (Estimators): A estimator is a function that map data to estimates of quantities of interest.

简单来讲,estimator就是一个函数,它的定义域是数据,叫estimand,它的值域里的元素则是叫estimates。


Plug-in Estimator

Definition (empirical distribution): The empirical distribution of $x_1, \ldots, x_n \in \mathbb{R}$ is the probability distribution $\hat{P}_n$ given by

一个相对离散的概率分布。


Definition (empirical distribution function (ecdf)): The empirical distribution function (ecdf) of $x_1, \ldots, x_n$is the distribution function of $\hat{P}_n$, which is

一个连续的分布函数。


Theorem (Glivenko—Cantelli): If $X_1, X_2, \ldots$ are i.i.d. random variables with cdf (cumulative distribution function) $F$, then

也就是说,当我们有足够多的样本时,这个esimator会趋近于它原本的分布。


Theorem: If $U \sim Unif(0,1)$,then $X:=F^{-1}(U)$ has cdf $F$.


Definition: The plug-in estimator of $\gamma(F)$ is the estimator $\hat{\gamma} = \gamma(\hat{F_n})$.


例子:

考虑期望值 $\gamma(F) = \int x dF(x)$。那么有

M-Estimator

Definition (M-estimator): An estimator $\hat{\theta}(X_1, \ldots, X_n)$ maximizing a criterion function of the form

where $m_\theta$ is a known function, is called an M-estimator (maximum-likelihood type).


例子

- $\theta \in \mathbb{R}$, 如果选$m_\theta(x) = -(x - \theta)^2$,那么就会得到平均值 $\bar{X}_n$。

- 选$m_\theta(x) = -|x - \theta|$,会得到中位数。

Method of Moments (MOM)

Given a parametric model for real-valued observations

consider the moments

If it exists then the $j$-th moment may be estimated by

Definition: The MOM estimator $\hat{\theta}$ is the value of $\theta$ that solves the equation system


例子(高斯):

Suppose $P_\theta = \mathcal{N}(\mu, \sigma^2)$ with mean and variance unknown.So,

The density:

需要解的方程组:

最后得到sample mean以及empirical variance:

Maximum Likelihood Estimator (MLE)

Consider a parametric model for the observation

Assume the model $\mathcal{P} = \{ P_\theta : \theta \in \Theta \}$ is dominated by a $\sigma$-finite measure $\nu$, i.e., $P_\theta \ll \nu$ for all $\theta \in \Theta$, and so we have densities

(这个so的结论是概率论里的结论。)

Definition: The function $L_x(\theta) = p_\theta(x)$ is the likelihood function of model $\mathcal{P}$ for the data $x$.

有密度函数直接套密度函数(density function)。


Definition: The maximum likelihood estimate (MLE) of $\theta$ is

If $\hat{\theta}(X)$ is a measurable function of the observation $X$, then $\hat{\theta}(X)$ is called maximum likelihood estimator (MLE) of $\theta$.


只不过我们通常其实会考虑所谓的 log -likelihood function

这样做有2点好处:

  • 可以避免numerical overflow;
  • 更方便计算。(比如说如果$L_x$本身是product的形式,那$l_x$则会变成sum的形式。)

例子(高斯):

Suppose $X_1, \ldots, X_n$ i.i.d. $\mathcal{N}(\mu, \sigma^2)$, so $\theta = (\mu, \sigma^2) \in \mathbb{R} \times (0,\infty)$.

Assume $n \ge 2$, so that $\frac{1}{n} \sum_{i=1}^n (X_i - \overline{X}_n)^2 > 0
\quad \text{a.s.}$.

The log -likelihood function:

不难得到:

贝叶斯估计(Bayes Estimators)

在之前的构造里,所有参数都取决于我们拿到的数据,完全不受我们任何经验/前置知识的影响。

但我们现在希望修改这个模式:我们希望最后得到的参数同时取决于数据以及我们的经验/先前的判断。

贝叶斯推断(Bayesian Inference)的流程:

1. 把先前的常数值$\theta$当作random variable(随机变量),并选择一个prior distribution先验分布,即我们观察数据前的做出的判断)。

2. 把$P_\theta$ 当作 the conditional distribution of $X$ given $\theta$。

3. 在观察到了数据 $x$ 之后,做统计推断时,看 $\theta$ 的后验分布(posterior distribution),也就是 “在观测到 $X = x$ 之后,$\theta$ 的条件分布”。

Consider an observation modeled as $X\sim P_\theta, \theta \in \Theta \subseteq \mathbb{R}.$

Theorem (Bayes theorem): Suppose the prior distribution has density $\pi$ w.r.t. a measure $\nu$ and
$P_\theta \ll \nu \ \ \forall \theta$ with densities $p_\theta(x) = p(x \mid \theta)$.

Then the posterior distribution has density (w.r.t. $\nu$):

where

is the prior predictive density of $X$.


Bayes estimators of $\theta$ are obtained as characteristics of the posterior distribution.
Most frequently, one considers the posterior mean:

例子(高斯):

Assume $X_1, \ldots, X_n$ i.i.d.\ $\mathcal{N}(\mu, \sigma^2)$ with $\sigma^2 > 0$ known. We select as prior distribution $\mu \sim \mathcal{N}(m, \tau^2)$ so

The likelihood function is equal to

The posterior density is

We recognize that the posterior distribution will be a normal distribution.
More precisely,

where

We conclude that since $p(\mu \mid X) \propto \exp\{ a \cdot (\mu - b/a)^2 \}$,
it holds that $p(\mu \mid X)$ is the density of a normal distribution with mean and variance:

The posterior mean is a convex combination of $\overline{X}_n$ and the prior mean $m$:

当我们设:

则有:

注意到当n趋近于无穷时,$\lambda$会趋近于1。也就是说,当n越大,最后算出来的mean就会越取决于我们拿到的数据。反之,n越小,则会更取决于我们的先验知识。

Mean Square Error, Bias and Variance

Definition: The mean square error is defined as


Theorem: The mean square error decomposes as

where $\mathrm{Bias}_\theta[\hat{\theta}]:= \mathbb{E}_\theta[\hat{\theta}] -\theta$ is the bias of $\hat{\theta}.$

Proof:

Write $\mathrm{MSE}_\theta[\hat{\theta}]
= \mathbb{E}_\theta
\Big[
\big(\hat{\theta} - \mathbb{E}_\theta[\hat{\theta}] + \mathbb{E}_\theta[\hat{\theta}] - \theta \big)^2 \Big] $ and expand

$\square$