Intro
Definition (mean(平均值)): The sample mean(平均值) of observed values $x_1,…,x_n \in \mathbb{R}$ is
用sample把它和概率论里的mean/expectation区分一下。
Definition (median(中位数)): The sample median(中位数) of observed values is
with $x_1 \leq x_2 \leq … \leq x_n$ being the sorted data points.
如果总数为基数,就取最中间的那个数;如果总数是偶数,那就取中间2个的平均值。
Definition (Statistical model 统计模型):
Definition (Parameter): A (statistical) parameter of a statistical model $\mathcal{P}$ is a map $\gamma : \mathcal{P} \to \text{ some set }\Tau$ .
例子:
- 期望值 Mean/expectation
- Variance
- Correlations
Construction of Estimators
Definition (Estimators): A estimator is a function that map data to estimates of quantities of interest.
简单来讲,estimator就是一个函数,它的定义域是数据,叫estimand,它的值域里的元素则是叫estimates。
Plug-in Estimator
Definition (empirical distribution): The empirical distribution of $x_1, \ldots, x_n \in \mathbb{R}$ is the probability distribution $\hat{P}_n$ given by
一个相对离散的概率分布。
Definition (empirical distribution function (ecdf)): The empirical distribution function (ecdf) of $x_1, \ldots, x_n$is the distribution function of $\hat{P}_n$, which is
一个连续的分布函数。
Theorem (Glivenko—Cantelli): If $X_1, X_2, \ldots$ are i.i.d. random variables with cdf (cumulative distribution function) $F$, then
也就是说,当我们有足够多的样本时,这个esimator会趋近于它原本的分布。
Theorem: If $U \sim Unif(0,1)$,then $X:=F^{-1}(U)$ has cdf $F$.
Definition: The plug-in estimator of $\gamma(F)$ is the estimator $\hat{\gamma} = \gamma(\hat{F_n})$.
例子:
考虑期望值 $\gamma(F) = \int x dF(x)$。那么有
简单来讲,我们先观测已知参数时,怎么样可以从结果得到这个参数(把这个过程量化成函数$\gamma$)。然后再把我们未知参数的观测结果带进这个函数$\gamma$里来计算出参数。
以伯努利分布为例:样本 $X_i\in\{0,1\}$,因为有$p=\mathbb E[X]$,所以很自然的就得到
M-Estimator
Definition (M-estimator): An estimator $\hat{\theta}(X_1, \ldots, X_n)$ maximizing a criterion function of the form
where $m_\theta$ is a known function, is called an M-estimator (maximum-likelihood type).
例子:
- $\theta \in \mathbb{R}$, 如果选$m_\theta(x) = -(x - \theta)^2$,那么就会得到平均值 $\bar{X}_n$。
- 选$m_\theta(x) = -|x - \theta|$,会得到中位数。
Method of Moments (MOM)
Given a parametric model for real-valued observations
consider the moments
If it exists then the $j$-th moment may be estimated by
Definition: The MOM estimator $\hat{\theta}$ is the value of $\theta$ that solves the equation system
例子(高斯):
Suppose $P_\theta = \mathcal{N}(\mu, \sigma^2)$ with mean and variance unknown.So,
The density:
需要解的方程组:
最后得到sample mean以及empirical variance:
Maximum Likelihood Estimator (MLE)
Consider a parametric model for the observation
Assume the model $\mathcal{P} = \{ P_\theta : \theta \in \Theta \}$ is dominated by a $\sigma$-finite measure $\nu$, i.e., $P_\theta \ll \nu$ for all $\theta \in \Theta$, and so we have densities
(这个so的结论是概率论里的结论。)
Definition: The function $L_x(\theta) = p_\theta(x)$ is the likelihood function of model $\mathcal{P}$ for the data $x$.
有密度函数直接套密度函数(density function)。
Definition: The maximum likelihood estimate (MLE) of $\theta$ is
If $\hat{\theta}(X)$ is a measurable function of the observation $X$, then $\hat{\theta}(X)$ is called maximum likelihood estimator (MLE) of $\theta$.
只不过我们通常其实会考虑所谓的 log -likelihood function
这样做有2点好处:
- 可以避免numerical overflow;
- 更方便计算。
- 如果$L_x$本身是$\Pi$的形式,那$l_x$则会变成$\sum$的形式。
- 变成$\sum$后,求导的时候不相关的部分可以直接扔掉。
例子(高斯):
Suppose $X_1, \ldots, X_n$ i.i.d. $\mathcal{N}(\mu, \sigma^2)$, so $\theta = (\mu, \sigma^2) \in \mathbb{R} \times (0,\infty)$.
Assume $n \ge 2$, so that $\frac{1}{n} \sum_{i=1}^n (X_i - \overline{X}_n)^2 > 0
\quad \text{a.s.}$.
The log -likelihood function:
不难得到:
贝叶斯估计(Bayes Estimators)
在之前的构造里,所有参数都取决于我们拿到的数据,完全不受我们任何经验/前置知识的影响。
但我们现在希望修改这个模式:我们希望最后得到的参数同时取决于数据以及我们的经验/先前的判断。
贝叶斯推断(Bayesian Inference)的流程:
1. 把先前的常数值$\theta$当作random variable(随机变量),并选择一个prior distribution(先验分布,即我们观察数据前的做出的判断)。
2. 把$P_\theta$ 当作 the conditional distribution of $X$ given $\theta$。
3. 在观察到了数据 $x$ 之后,做统计推断时,看 $\theta$ 的后验分布(posterior distribution),也就是 “在观测到 $X = x$ 之后,$\theta$ 的条件分布”。
Consider an observation modeled as $X\sim P_\theta, \theta \in \Theta \subseteq \mathbb{R}.$
Theorem (Bayes theorem): Suppose the prior distribution has density $\pi$ w.r.t. a measure $\nu$ and
$P_\theta \ll \nu \ \ \forall \theta$ with densities $p_\theta(x) = p(x \mid \theta)$.
Then the posterior distribution has density (w.r.t. $\nu$):
where
is the prior predictive density of $X$.
Bayes estimators of $\theta$ are obtained as characteristics of the posterior distribution.
Most frequently, one considers the posterior mean:
例子(高斯):
Assume $X_1, \ldots, X_n$ i.i.d.\ $\mathcal{N}(\mu, \sigma^2)$ with $\sigma^2 > 0$ known. We select as prior distribution $\mu \sim \mathcal{N}(m, \tau^2)$ so
The likelihood function is equal to
The posterior density is
We recognize that the posterior distribution will be a normal distribution.
More precisely,
where
We conclude that since $p(\mu \mid X) \propto \exp\{ a \cdot (\mu - b/a)^2 \}$,
it holds that $p(\mu \mid X)$ is the density of a normal distribution with mean and variance:
The posterior mean is a convex combination of $\overline{X}_n$ and the prior mean $m$:
当我们设:
则有:
注意到当n趋近于无穷时,$\lambda$会趋近于1。也就是说,当n越大,最后算出来的mean就会越取决于我们拿到的数据。反之,n越小,则会更取决于我们的先验知识。
Mean Square Error, Bias and Variance
Definition: The mean square error is defined as
Theorem: The mean square error decomposes as
where $\mathrm{Bias}_\theta[\hat{\theta}]:= \mathbb{E}_\theta[\hat{\theta}] -\theta$ is the bias of $\hat{\theta}.$
Proof:
Write $\mathrm{MSE}_\theta[\hat{\theta}]
= \mathbb{E}_\theta
\Big[
\big(\hat{\theta} - \mathbb{E}_\theta[\hat{\theta}] + \mathbb{E}_\theta[\hat{\theta}] - \theta \big)^2 \Big] $ and expand
$\square$
Sufficient Statistics
Definition: The statistic $T$ is sufficient for model $\mathcal{P}=\{P_\theta:\theta\in\Theta\}$ if there exists a determination of the conditional distribution of $X$ given $T(X)=t$ that does not depend on $\theta$.
Theorem (Neyman’s Factorization Criterion): Let $\mathcal{P}=\{P_\theta:\theta\in\Theta\}$ be dominated by a $\sigma$-finite measure $\nu$, i.e., each $P_\theta$ has a density
Then a statistic $T:\mathcal{X}\to\mathcal{Y}$ is sufficient for $\mathcal{P}$ if and only if there exist measurable non-negative functions $h$ and $g_\theta$ such that for all $\theta\in\Theta$ the density of $P_\theta$ may be given the form
所以可以很简单地通过分解density函数来得到一个sufficient的statistic。
Definition: A sufficient statistic $T$ is minimal sufficient if $T$ is a function of every other sufficient statistic $T’$, or more precisely if
‘Almost every’ means that $T(x)=S(T’(x))$ for $x$ in a set $\mathcal{X}^\subseteq\mathcal{X}$ with $P_\theta(\mathcal{X}^)=1$ for all $\theta\in\Theta$. The existence of such a map $S$ is equivalent to
Theorem: Let $\mathcal{P}=\{P_\theta:\theta\in\Theta\}$ be dominated by a $\sigma$-finite measure $\nu$ with densities
Define the support $\operatorname{supp}(\mathcal{P})=\{x:\exists\,\theta\text{ with }p_\theta(x)>0\}$.
Let $T$ be a statistic such that for all $x,\tilde{x}\in\operatorname{supp}(\mathcal{P})$ it holds that
Then $T$ is minimal sufficient.
Completeness
Theorem (Basu's Theorem): Let $X$ be an observation from the statistical model $\mathcal{P} = \{ P_{\theta}: \theta \in \Theta \}$ and $T$ a complete and sufficient statistic for $\theta$. If $A$ is an ancillary statistic, i.e. the distribution of $A(X)$ does not depend on $\theta$, then $A(X)$ is independent of $T(X)$.
Proof:
Since $A$ is ancillary, the probability
will be constant with respect to $\theta$ for all $B$ in the range of $A$.
Since $T$ is sufficient, $E_\theta\left[\mathbf{1}_{\{A(X)\in B\}}\mid T\right]$ is a function of $T$ (independent of $\theta$).
Then
Hence,
with $h(T)=P\bigl(X\in A^{-1}(B)\bigr)-E\left[\mathbf{1}_{\{A(X)\in B\}}\mid T\right].$
But since $T$ is complete, this implies
with $P_{\ast}$ and $E_{\ast}$ denoting to be independent of $\theta$.
Let sets $B$ and $C$ be arbitrary sets from the range of $A$ and $T$, respectively.
Then, we have:
for all $\theta\in\Theta$, which means $A$ and $T$ are independent.
Exponential Families
Definition: Let $\mathcal{P}=\{P_\theta:\theta\in\Theta\}$ be a statistical model on $(\mathcal{X},\mathcal{A})$ that is dominated by a $\sigma$-finite measure $\nu$. Suppose there are functions
such that the densities $p_\theta(x)=\frac{dP_\theta}{d\nu}(x)$ are of the form
Then $\mathcal{P}$ is called a $k$-dimensional exponential family with natural parameters $\eta(\theta)$ and sufficient statistic $T$ (recall Neyman’s criterion).
Corollary: If $\eta(\Theta)=\{\eta(\theta):\theta\in\Theta\}\subseteq\mathbb{R}^k$ contains $k+1$ points $\eta^{(0)},\ldots,\eta^{(k)}$ s.t. $\eta^{(j)}-\eta^{(0)}$, $j=1,\ldots,k$ are linearly independent, then the sufficient statistic $T$ is minimal sufficient.
By re-parametrizing in terms of natural parameters, we may bring an exponential family into its canonical form
We write $P_\eta$ for the distribution given by the density $p_\eta$.
The density $p_\eta(x)$ is well-defined for every vector $\eta$ in the natural parameter space
Indeed for $\eta\in\mathcal{H}$, a well-defined density is obtained by setting
The function $A(\eta)$ is the cumulant generating function or also log-Laplace transform.
Theorem:
有些时候可以用这个公式快速计算moments。
Theorem:
1. The natural parameter space $\mathcal{H}$ is a convex set.
2. The cumulant generating function $A(\eta)$ is a convex function on $\mathcal{H}$. It is strictly convex if $P_\eta\neq P_{\eta’}$ for all $\eta,\eta’\in\mathcal{H}$ with $\eta\neq \eta’$.
3. The log-likelihood function $\ell_x(\eta)=\log p_\eta(x)$ is concave on $\mathcal{H}$. It is strictly concave if $P_\eta\neq P_{\eta’}$ for all $\eta,\eta’\in\mathcal{H}$ with $\eta\neq \eta’$.