统计 Statistics

Intro

Definition (mean（平均值）): The sample mean（平均值） of observed values $x_1,…,x_n \in \mathbb{R}$ is

$\bar{x}_n = \frac{1}{n}\sum^n_{i=1}x_i.$

用sample把它和概率论里的mean/expectation区分一下。

Definition (median（中位数）): The sample median（中位数） of observed values is

$x_{0.5} =\begin{cases}x_{\frac{n+1}{2}}, & \text{if $n$ odd}, \\[1em]\dfrac{1}{2}\Bigl[x_{\frac{n}{2}} + x_{\frac{n}{2}+1}\Bigr], & \text{if $n$ even},\end{cases}$

with $x_1 \leq x_2 \leq … \leq x_n$ being the sorted data points.

如果总数为基数，就取最中间的那个数；如果总数是偶数，那就取中间2个的平均值。

Definition (Statistical model 统计模型):

$\mathcal{P} = \text{a set of probability distributions } P \text{ on } (X,A).$

Definition (Parameter): A (statistical) parameter of a statistical model $\mathcal{P}$ is a map $\gamma : \mathcal{P} \to \text{ some set }\Tau$ .

例子：

- 期望值 Mean/expectation

$\gamma(P) = \int x dP(x)$

- Variance

$\gamma(P) = \int x^2 dP(x) - (\int x dP(x))^2$

- Correlations

Construction of Estimators

Definition (Estimators): A estimator is a function that map data to estimates of quantities of interest.

简单来讲，estimator就是一个函数，它的定义域是数据，叫estimand，它的值域里的元素则是叫estimates。

Plug-in Estimator

Definition (empirical distribution): The empirical distribution of $x_1, \ldots, x_n \in \mathbb{R}$ is the probability distribution $\hat{P}_n$ given by

$\hat{P}_n(A) := \frac{1}{n} \#\{ i : x_i \in A \}= \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{\{ x_i \in A \}}, \qquad A \subseteq \mathbb{R}.$

一个相对离散的概率分布。

Definition (empirical distribution function (ecdf)): The empirical distribution function (ecdf) of $x_1, \ldots, x_n$is the distribution function of $\hat{P}_n$, which is

$\hat{F}_n(t) := \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{\{ x_i \le t \}},\qquad t \in \mathbb{R}.$

一个连续的分布函数。

Theorem (Glivenko—Cantelli): If $X_1, X_2, \ldots$ are i.i.d. random variables with cdf (cumulative distribution function) $F$, then

$\lVert \hat{F}_n - F \rVert_\infty:= \sup_{t \in \mathbb{R}} \bigl| \hat{F}_n(t; X_1, \ldots, X_n) - F(t) \bigr|\xrightarrow{\text{a.s.}} 0.$

也就是说，当我们有足够多的样本时，这个esimator会趋近于它原本的分布。

Theorem: If $U \sim Unif(0,1)$,then $X:=F^{-1}(U)$ has cdf $F$.

Definition: The plug-in estimator of $\gamma(F)$ is the estimator $\hat{\gamma} = \gamma(\hat{F_n})$.

例子：

考虑期望值 $\gamma(F) = \int x dF(x)$。那么有

$\gamma(\hat{F_n}) = \int x d\hat{F_n}(x) = \frac{1}{n} \sum_{i=1}^n X_i = \overline{X}_n$

简单来讲，我们先观测已知参数时，怎么样可以从结果得到这个参数（把这个过程量化成函数$\gamma$）。然后再把我们未知参数的观测结果带进这个函数$\gamma$里来计算出参数。

以伯努利分布为例：样本 $X_i\in\{0,1\}$，因为有$p=\mathbb E[X]$，所以很自然的就得到

$\gamma(\bar{X}) = p.$

M-Estimator

Definition (M-estimator): An estimator $\hat{\theta}(X_1, \ldots, X_n)$ maximizing a criterion function of the form

$\theta \;\mapsto\; \frac{1}{n} \sum_{i=1}^n m_\theta(X_i),$

where $m_\theta$ is a known function, is called an M-estimator (maximum-likelihood type).

也就是说：

$\hat{\theta} = \underset{\theta}{\text{arg max }} \frac{1}{n} \sum_{i=1}^n m_\theta(X_i)$

后面的MLE就是这个的一个特例。

例子：

- $\theta \in \mathbb{R}$, 如果选$m_\theta(x) = -(x - \theta)^2$，那么就会得到平均值 $\bar{X}_n$。

- 选$m_\theta(x) = -|x - \theta|$，会得到中位数。

Method of Moments (MOM)

Given a parametric model for real-valued observations

$X_1,...,X_n \overset{i.i.d}{\sim}P_\theta, \ \ \theta \in \Theta \subseteq \mathbb{R}^k$

consider the moments

$\mu_j(\theta) = E_\theta [X^j_i] \text{ for } j=1,...,k.$

If it exists then the $j$-th moment may be estimated by

$\hat{\mu}_j = \frac{1}{n}\sum^n_{i=1}X^j_i.$

Definition: The MOM estimator $\hat{\theta}$ is the value of $\theta$ that solves the equation system

$\mu_j(\theta) = \hat{\mu}_j, \ \ j =1,...,k.$

也就是

$E_\theta [X^j_i] = \frac{1}{n}\sum^n_{i=1}X^j_i.$

注意，参数$\theta$是几维的我们就解几个方程。

例子（高斯）：

Suppose $P_\theta = \mathcal{N}(\mu, \sigma^2)$ with mean and variance unknown.So,

$\theta = (\mu, \sigma^2) \in \Theta = \mathbb{R} \times (0,\infty).$

The density:

$p_\theta(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{ -\frac{1}{2\sigma^2} (x - \mu)^2 \right\}, \qquad x \in \mathbb{R}.$

需要解的方程组：

$\mu_1(\theta) = \mathbb{E}_\theta[X_1] = \mu \;\stackrel{!}{=}\; \overline{X}_n,$ $\mu_2(\theta) = \mathrm{Var}_\theta[X_1] + \big(\mathbb{E}_\theta[X_1]\big)^2 = \sigma^2 + \mu^2 \;\stackrel{!}{=}\; \hat{\mu}_2.$

最后得到sample mean以及empirical variance：

$\hat{\mu} = \overline{X}_n,$ $\hat{\sigma}^2 = \hat{\mu}_2 - (\overline{X}_n)^2 = \frac{1}{n} \sum_{i=1}^n X_i^2 - \left( \frac{1}{n} \sum_{i=1}^n X_i \right)^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X}_n)^2.$

Maximum Likelihood Estimator (MLE)

Consider a parametric model for the observation

$X \sim P_\theta, \qquad \theta \in \Theta \subseteq \mathbb{R}^k.$

Assume the model $\mathcal{P} = \{ P_\theta : \theta \in \Theta \}$ is dominated by a $\sigma$-finite measure $\nu$, i.e., $P_\theta \ll \nu$ for all $\theta \in \Theta$, and so we have densities

$p_\theta(x) = \frac{\mathrm{d} P_\theta}{\mathrm{d} \nu}(x), \qquad x \in \mathcal{X}.$

（这个so的结论是概率论里的结论。）

Definition: The function $L_x(\theta) = p_\theta(x)$ is the likelihood function of model $\mathcal{P}$ for the data $x$.

有密度函数直接套密度函数（density function）。

Definition: The maximum likelihood estimate (MLE) of $\theta$ is

$\hat{\theta} = \arg \max_{\theta \in \Theta} L_x(\theta).$

If $\hat{\theta}(X)$ is a measurable function of the observation $X$, then $\hat{\theta}(X)$ is called maximum likelihood estimator (MLE) of $\theta$.

只不过我们通常其实会考虑所谓的 log -likelihood function

$\ell_x (\theta) = \text{log}L_x(\theta).$

这样做有2点好处：

可以避免numerical overflow；
更方便计算。
- 如果$L_x$本身是$\Pi$的形式，那$l_x$则会变成$\sum$的形式。
- 变成$\sum$后，求导的时候不相关的部分可以直接扔掉。

例子（高斯）：

Suppose $X_1, \ldots, X_n$ i.i.d. $\mathcal{N}(\mu, \sigma^2)$, so $\theta = (\mu, \sigma^2) \in \mathbb{R} \times (0,\infty)$.

Assume $n \ge 2$, so that $\frac{1}{n} \sum_{i=1}^n (X_i - \overline{X}_n)^2 > 0
\quad \text{a.s.}$.

The log -likelihood function：

$\begin{align*} \ell_X(\theta) &= \sum_{i=1}^n \log \left( \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left\{ -\frac{1}{2\sigma^2}(X_i - \mu)^2 \right\} \right) \\ &= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2. \end{align*}$

不难得到：

$\hat{\mu} = \arg\min_{\mu \in \mathbb{R}} \sum_{i=1}^n (X_i - \mu)^2 = \overline{X}_n,$ $\hat{\sigma}^2 = \arg\min_{\sigma^2 > 0} \left[ \log(\sigma^2) + \frac{1}{\sigma^2} \cdot \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X}_n)^2 \right] = \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X}_n)^2.$

贝叶斯估计（Bayes Estimators）

在之前的构造里，所有参数都取决于我们拿到的数据，完全不受我们任何经验/前置知识的影响。

但我们现在希望修改这个模式：我们希望最后得到的参数同时取决于数据以及我们的经验/先前的判断。

贝叶斯推断（Bayesian Inference）的流程：

1. 把先前的常数值$\theta$当作random variable（随机变量），并选择一个prior distribution（先验分布，即我们观察数据前的做出的判断）。

2. 把$P_\theta$ 当作 the conditional distribution of $X$ given $\theta$。

3. 在观察到了数据 $x$ 之后，做统计推断时，看 $\theta$ 的后验分布（posterior distribution），也就是 “在观测到 $X = x$ 之后，$\theta$ 的条件分布”。

Consider an observation modeled as $X\sim P_\theta, \theta \in \Theta \subseteq \mathbb{R}.$

Theorem (Bayes theorem): Suppose the prior distribution has density $\pi$ w.r.t. a measure $\nu$ and
$P_\theta \ll \nu \ \ \forall \theta$ with densities $p_\theta(x) = p(x \mid \theta)$.

Then the posterior distribution has density (w.r.t. $\nu$):

$p(\theta \mid x)= \frac{p(x \mid \theta)\,\pi(\theta)}{p(x)} ,$

where

$p(x) = \int_\Theta p(x \mid \theta)\,\pi(\theta)\, \mathrm{d}\nu(\theta)$

is the prior predictive density of $X$.

Bayes estimators of $\theta$ are obtained as characteristics of the posterior distribution.
Most frequently, one considers the posterior mean:

$\hat{\theta} = \mathbb{E}[\theta \mid X = x] = \int \theta \cdot p(\theta \mid x) \,\mathrm{d}\nu(\theta).$

例子（高斯）：

Assume $X_1, \ldots, X_n$ i.i.d.\ $\mathcal{N}(\mu, \sigma^2)$ with $\sigma^2 > 0$ known. We select as prior distribution $\mu \sim \mathcal{N}(m, \tau^2)$ so

$\pi(\mu) = \frac{1}{\sqrt{2\pi\tau^2}} \exp\!\left( -\frac{1}{2\tau^2}(\mu - m)^2 \right).$

The likelihood function is equal to

$L_X(\mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left( -\frac{1}{2\sigma^2} (X_i - \mu)^2 \right).$

The posterior density is

$p(\mu \mid X) \propto L_X(\mu)\,\pi(\mu) \propto \exp\!\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2 -\frac{1}{2\tau^2} (\mu - m)^2 \right) \qquad \text{(quadratic in $\mu$)}.$

We recognize that the posterior distribution will be a normal distribution.
More precisely,

$\begin{align*} p(\mu \mid X) &\propto \exp\!\left\{ -\frac{1}{2\sigma^2} \Bigl( n\cdot \mu^2 - 2\mu \sum_{i=1}^n X_i \Bigr) -\frac{1}{2\tau^2}(\mu^2 - 2\mu m) \right\}\\ &= \exp\!\left\{ \Bigl( -\frac{n}{2\sigma^2} - \frac{1}{2\tau^2} \Bigr)\mu^2 -2\mu \Bigl( -\frac{n\overline{X}_n}{2\sigma^2} - \frac{m}{2\tau^2} \Bigr) \right\}\\ &= \exp\!\left\{ a \mu^2 - 2 b \mu \right\}, \end{align*}$

where

$a = -\frac{n}{2\sigma^2} - \frac{1}{2\tau^2}, \qquad b = \frac{n\overline{X}_n}{2\sigma^2} - \frac{m}{2\tau^2}.$

We conclude that since $p(\mu \mid X) \propto \exp\{ a \cdot (\mu - b/a)^2 \}$,
it holds that $p(\mu \mid X)$ is the density of a normal distribution with mean and variance:

$\mathbb{E}[\mu \mid X] = \frac{b}{a}, \qquad \mathrm{Var}[\mu \mid X] = -\frac{1}{2a}.$

The posterior mean is a convex combination of $\overline{X}_n$ and the prior mean $m$:

$\begin{align*} \mathbb{E}[\mu \mid X] = \frac{b}{a} &= \frac{\frac{n\overline{X}_n}{2\sigma^2} - \frac{m}{2\tau^2}} {-\frac{n}{2\sigma^2} - \frac{1}{2\tau^2}}\\ &= \overline{X}_n \cdot \frac{n/\sigma^2}{n/\sigma^2 + 1/\tau^2} + m \cdot \frac{1/\tau^2}{n/\sigma^2 + 1/\tau^2}\\ &= \overline{X}_n \cdot \frac{\tau^2}{\tau^2 + \sigma^2/n} + m \cdot \frac{\sigma^2/n}{\tau^2 + \sigma^2/n}. \end{align*}$

当我们设：

$\lambda = \frac{\tau^2}{\tau^2 + \sigma^2/n}$

则有：

$\mathbb{E}[\mu \mid X] = \lambda \cdot\overline{X}_n + (1-\lambda)m.$

注意到当n趋近于无穷时，$\lambda$会趋近于1。也就是说，当n越大，最后算出来的mean就会越取决于我们拿到的数据。反之，n越小，则会更取决于我们的先验知识。

Mean Square Error, Bias and Variance

Definition: The mean square error is defined as

$\mathrm{MSE}_\theta[\hat{\theta}]:= \mathbb{E}_\theta \big[ (\hat{\theta} - \theta)^2 \big]= \int_{\mathcal{X}} (\hat{\theta}(x) - \theta)^2 \, \mathrm{d}P_\theta(x).$

Theorem: The mean square error decomposes as

$\mathrm{MSE}_\theta[\hat{\theta}]= \big( \mathrm{Bias}_\theta[\hat{\theta}] \big)^2+ \mathrm{Var}_\theta[\hat{\theta}],$

where $\mathrm{Bias}_\theta[\hat{\theta}]:= \mathbb{E}_\theta[\hat{\theta}] -\theta$ is the bias of $\hat{\theta}.$

Proof:

Write $\mathrm{MSE}_\theta[\hat{\theta}]
= \mathbb{E}_\theta
\Big[
\big(\hat{\theta} - \mathbb{E}_\theta[\hat{\theta}] + \mathbb{E}_\theta[\hat{\theta}] - \theta \big)^2 \Big] $ and expand

$\begin{align*} \mathrm{MSE}_\theta[\hat{\theta}] &= \mathbb{E}_\theta \Big[ \big(\hat{\theta} - \mathbb{E}_\theta[\hat{\theta}]\big)^2 \Big] + \mathbb{E}_\theta \Big[ \big(\mathbb{E}_\theta[\hat{\theta}] - \theta\big)^2 \Big] + 2 \, \mathbb{E}_\theta \Big[ \big(\hat{\theta} - \mathbb{E}_\theta[\hat{\theta}]\big) \big(\mathbb{E}_\theta[\hat{\theta}] - \theta\big) \Big].\\ &= \mathrm{Var}_\theta[\hat{\theta}] + \big(\mathrm{Bias}_\theta[\hat{\theta}]\big)^2 + 2 \cdot \mathrm{Bias}_\theta[\hat{\theta}] \cdot \underbrace{ \mathbb{E}_\theta \big[ \hat{\theta} - \mathbb{E}_\theta[\hat{\theta}] \big] }_{=0}\\ &= \mathrm{Var}_\theta[\hat{\theta}] + \big(\mathrm{Bias}_\theta[\hat{\theta}]\big)^2. \end{align*}$

$\square$

Sufficient Statistics

Definition: The statistic $T$ is sufficient for model $\mathcal{P}=\{P_\theta:\theta\in\Theta\}$ if there exists a determination of the conditional distribution of $X$ given $T(X)=t$ that does not depend on $\theta$.

Theorem (Neyman’s Factorization Criterion): Let $\mathcal{P}=\{P_\theta:\theta\in\Theta\}$ be dominated by a $\sigma$-finite measure $\nu$, i.e., each $P_\theta$ has a density

$p_\theta(x)=\frac{dP_\theta}{d\nu}(x),\qquad \theta\in\Theta.$

Then a statistic $T:\mathcal{X}\to\mathcal{Y}$ is sufficient for $\mathcal{P}$ if and only if there exist measurable non-negative functions $h$ and $g_\theta$ such that for all $\theta\in\Theta$ the density of $P_\theta$ may be given the form

$p_\theta(x)=g_\theta(T(x))\,h(x),\qquad x\in\mathcal{X}. \tag{3.1}$

所以可以很简单地通过分解density函数来得到一个sufficient的statistic。

Definition: A sufficient statistic $T$ is minimal sufficient if $T$ is a function of every other sufficient statistic $T’$, or more precisely if

$\exists\ \text{measurable map } S \text{ s.t. } T(x)=S(T'(x)) \text{ for almost every } x\in\mathcal{X}.$

‘Almost every’ means that $T(x)=S(T’(x))$ for $x$ in a set $\mathcal{X}^\subseteq\mathcal{X}$ with $P_\theta(\mathcal{X}^)=1$ for all $\theta\in\Theta$. The existence of such a map $S$ is equivalent to

$T'(x)=T'(\tilde{x})\ \Longrightarrow\ T(x)=T(\tilde{x}) \qquad \forall x,\tilde{x}\in\mathcal{X}^*. \tag{3.2}$

Theorem: Let $\mathcal{P}=\{P_\theta:\theta\in\Theta\}$ be dominated by a $\sigma$-finite measure $\nu$ with densities

$p_\theta(x)=\frac{dP_\theta}{d\nu}(x).$

Define the support $\operatorname{supp}(\mathcal{P})=\{x:\exists\,\theta\text{ with }p_\theta(x)>0\}$.

Let $T$ be a statistic such that for all $x,\tilde{x}\in\operatorname{supp}(\mathcal{P})$ it holds that

$T(x)=T(\tilde{x})\ \Longleftrightarrow\ \exists\,c(x,\tilde{x})>0\ \forall \theta\in\Theta:\ p_\theta(x)=p_\theta(\tilde{x})\cdot c(x,\tilde{x}). \tag{3.3}$

Then $T$ is minimal sufficient.

Completeness

Theorem (Basu's Theorem): Let $X$ be an observation from the statistical model $\mathcal{P} = \{ P_{\theta}: \theta \in \Theta \}$ and $T$ a complete and sufficient statistic for $\theta$. If $A$ is an ancillary statistic, i.e. the distribution of $A(X)$ does not depend on $\theta$, then $A(X)$ is independent of $T(X)$.

Proof:

Since $A$ is ancillary, the probability

$P_\theta(A(X)\in B)=P_\theta\bigl(X\in A^{-1}(B)\bigr)=c_B$

will be constant with respect to $\theta$ for all $B$ in the range of $A$.
Since $T$ is sufficient, $E_\theta\left[\mathbf{1}_{\{A(X)\in B\}}\mid T\right]$ is a function of $T$ (independent of $\theta$).

Then

$E_\theta\left[P_\theta\bigl(X\in A^{-1}(B)\bigr)\right] = P_\theta\bigl(X\in A^{-1}(B)\bigr) = E_\theta\left[\mathbf{1}_{\{A(X)\in B\}}\right] = E_\theta\left[E_\theta\left[\mathbf{1}_{\{A(X)\in B\}}\mid T\right]\right], \qquad \forall \theta\in\Theta.$

Hence,

$E_\theta[h(T)] = 0,\qquad \forall \theta\in\Theta.$

with $h(T)=P\bigl(X\in A^{-1}(B)\bigr)-E\left[\mathbf{1}_{\{A(X)\in B\}}\mid T\right].$

But since $T$ is complete, this implies

$0=h(T)=P_*\bigl(X\in A^{-1}(B)\bigr)-E_*\left[\mathbf{1}_{\{A(X)\in B\}}\mid T\right] \qquad [P\text{-a.e.}],$

with $P_{\ast}$ and $E_{\ast}$ denoting to be independent of $\theta$.

Let sets $B$ and $C$ be arbitrary sets from the range of $A$ and $T$, respectively.
Then, we have:

$\begin{aligned} P_\theta(A(X)\in B,\;T(X)\in C) &=E_\theta\left[\mathbf{1}_{\{A(X)\in B\}}\cdot \mathbf{1}_{\{T(X)\in C\}}\right] \\ &=E_\theta\left[E_\theta\!\left[\mathbf{1}_{\{A(X)\in B\}}\cdot \mathbf{1}_{\{T(X)\in C\}}\mid T(X)\right]\right] \\ &=E_\theta\left[\mathbf{1}_{\{T(X)\in C\}}\cdot E_*\!\left[\mathbf{1}_{\{A(X)\in B\}}\mid T(X)\right]\right] \\ &=E_\theta\left[\mathbf{1}_{\{T(X)\in C\}}\cdot P_\theta\!\bigl(X\in A^{-1}(B)\bigr)\right] \\ &=P_\theta\bigl(X\in A^{-1}(B)\bigr)\cdot E_\theta\!\left[\mathbf{1}_{\{T(X)\in C\}}\right] \\ &=P_\theta(A(X)\in B)\cdot P_\theta(T(X)\in C), \end{aligned}$

for all $\theta\in\Theta$, which means $A$ and $T$ are independent.

Exponential Families

Definition: Let $\mathcal{P}=\{P_\theta:\theta\in\Theta\}$ be a statistical model on $(\mathcal{X},\mathcal{A})$ that is dominated by a $\sigma$-finite measure $\nu$. Suppose there are functions

$B:\Theta\to\mathbb{R},\qquad \eta:\Theta\to\mathbb{R}^k,$ $T:\mathcal{X}\to\mathbb{R}^k,\qquad h:\mathcal{X}\to[0,\infty)$

such that the densities $p_\theta(x)=\frac{dP_\theta}{d\nu}(x)$ are of the form

$p_\theta(x)=\exp\left\{\sum_{i=1}^k \eta_i(\theta)T_i(x)-B(\theta)\right\}h(x),\qquad x\in\mathcal{X}. \tag{4.1}$ $=\exp\left\{\langle \eta(\theta),T(x)\rangle - B(\theta)\right\}h(x).$

Then $\mathcal{P}$ is called a $k$-dimensional exponential family with natural parameters $\eta(\theta)$ and sufficient statistic $T$ (recall Neyman’s criterion).

Corollary: If $\eta(\Theta)=\{\eta(\theta):\theta\in\Theta\}\subseteq\mathbb{R}^k$ contains $k+1$ points $\eta^{(0)},\ldots,\eta^{(k)}$ s.t. $\eta^{(j)}-\eta^{(0)}$, $j=1,\ldots,k$ are linearly independent, then the sufficient statistic $T$ is minimal sufficient.

By re-parametrizing in terms of natural parameters, we may bring an exponential family into its canonical form

$p_\eta(x)=\exp\{\langle \eta, T(x)\rangle - A(\eta)\}h(x),\qquad x\in\mathcal{X}.$

We write $P_\eta$ for the distribution given by the density $p_\eta$.

The density $p_\eta(x)$ is well-defined for every vector $\eta$ in the natural parameter space

$\mathcal{H}=\left\{\eta\in\mathbb{R}^k:\int_{\mathcal{X}}\exp\{\langle \eta,T(x)\rangle\}h(x)\,d\nu(x)<\infty\right\}.$

Indeed for $\eta\in\mathcal{H}$, a well-defined density is obtained by setting

$A(\eta)=\log\int_{\mathcal{X}}\exp\{\langle \eta,T(x)\rangle\}h(x)\,d\nu(x).$

The function $A(\eta)$ is the cumulant generating function or also log-Laplace transform.

Theorem:

$\frac{\partial}{\partial \eta_j}A(\eta)=\mathbb{E}_\eta[T_j(X)],\qquad\frac{\partial^2}{\partial \eta_j\partial \eta_k}A(\eta)=\operatorname{Cov}_\eta[T_j(X),T_k(X)].$

有些时候可以用这个公式快速计算moments。

Theorem:

1. The natural parameter space $\mathcal{H}$ is a convex set.

2. The cumulant generating function $A(\eta)$ is a convex function on $\mathcal{H}$. It is strictly convex if $P_\eta\neq P_{\eta’}$ for all $\eta,\eta’\in\mathcal{H}$ with $\eta\neq \eta’$.

3. The log-likelihood function $\ell_x(\eta)=\log p_\eta(x)$ is concave on $\mathcal{H}$. It is strictly concave if $P_\eta\neq P_{\eta’}$ for all $\eta,\eta’\in\mathcal{H}$ with $\eta\neq \eta’$.