Taihong Xiao

AI玩微信跳一跳的正确姿势

2018-01-04T00:00:00-08:00

最近，微信小游戏跳一跳可以说是火遍了全国，从小孩子到大孩子仿佛每一个人都在刷跳一跳，作为无（zhi）所（hui）不（ban）能（zhuan）的AI程序员，我们在想，能不能用人工智能（AI）和计算机视觉（CV）的方法来玩一玩这个游戏？于是，我们开发了微信跳一跳Auto-Jump算法，重新定义了玩跳一跳的正确姿势，我们的算法不仅远远超越了人类的水平，在速度和准确度上也远远超越了目前已知的所有算法，可以说是跳一跳界的state-of-the-art，下面我们详细介绍我们的算法。

算法的第一步是获取手机屏幕的截图并可以控制手机的触控操作，我们的github仓库里详细介绍了针对Android和IOS手机的配置方法，你只需要按照将手机连接电脑，按照教程执行就可以完成配置。在获取到屏幕截图之后，就是个简单的视觉问题。我们需要找的就是小人的位置和下一次需要跳的台面的中心。如图所示，绿色的点代表小人当前的位置，红点代表目标位置。

多尺度搜索（Multiscale Search）

这个问题可以有非常多的方法去解，为了糙快猛地刷上榜，我一开始用的方式是多尺度搜索。我随便找了一张图，把小人抠出来，就像下面这样。

另外，我注意到小人在屏幕的不同位置，大小略有不同，所以我设计了多尺度的搜索，用不同大小的进行匹配，最后选取置信度（confidence score）最高的。

多尺度搜索的代码长这样

def multi_scale_search(pivot, screen, range=0.3, num=10):
    H, W = screen.shape[:2]
    h, w = pivot.shape[:2]

    found = None
    for scale in np.linspace(1-range, 1+range, num)[::-1]:
        resized = cv2.resize(screen, (int(W * scale), int(H * scale)))
        r = W / float(resized.shape[1])
        if resized.shape[0] < h or resized.shape[1] < w:
            break
        res = cv2.matchTemplate(resized, pivot, cv2.TM_CCOEFF_NORMED)

        loc = np.where(res >= res.max())
        pos_h, pos_w = list(zip(*loc))[0]

        if found is None or res.max() > found[-1]:
            found = (pos_h, pos_w, r, res.max())

    if found is None: return (0,0,0,0,0)
    pos_h, pos_w, r, score = found
    start_h, start_w = int(pos_h * r), int(pos_w * r)
    end_h, end_w = int((pos_h + h) * r), int((pos_w + w) * r)
    return [start_h, start_w, end_h, end_w, score]

我们来试一试，效果还不错，应该说是又快又好，我所有的实验中找小人从来没有失误。不过这里的位置框的底部中心并不是小人的位置，真实的位置是在那之上一些。

同理，目标台面也可以用这种办法搜索，但是我们需要收集一些不同的台面，有圆形的，方形的，便利店，井盖，棱柱等等。由于数量一多，加上多尺度的原因，速度上会慢下来。这时候，我们就需要想办法加速了。首先可以注意到目标位置始终在小人的位置的上面，所以可以操作的一点就是在找到小人位置之后把小人位置以下的部分都舍弃掉，这样可以减少搜索空间。

但是这还是不够，我们需要进一步去挖掘游戏里的故事。小人和目标台面基本上是关于屏幕中心对称的位置的。这提供了一个非常好的思路去缩小搜索空间。假设屏幕分辨率是（1280，720）的，小人底部的位置是（h1, w1), 那么关于中心对称点的位置就是（1280 - h1， 720 - w1），以这个点为中心的一个边长300的正方形内，我们再去多尺度搜索目标位置，就会又快有准了。效果见下图，蓝色框是（300，300）的搜索区域，红色框是搜到的台面，矩形中心就是目标点的坐标了。

加速的奇技淫巧（Fast-Search）

玩游戏需要细心观察。我们可以发现，小人上一次如果跳到台面中心，那么下一次目标台面的中心会有一个白点，就像刚才所展示的图里的。更加细心的人会发现，白点的RGB值是（245，245，245），这就让我找到了一个非常简单并且高效的方式，就是直接去搜索这个白点，注意到白点是一个连通区域，像素值为（245，245，245）的像素个数稳定在 280-310之间，所以我们可以利用这个去直接找到目标的位置。这种方式只在前一次跳到中心的时候可以用，不过没有关系，我们每次都可以试一试这个不花时间的方法，不行再考虑多尺度搜索。

讲到这里，我们的方法已经可以运行的非常出色了，基本上是一个永动机。下面是用我的手机玩了一个半小时左右，跳了859次的状态，我们的方法正确的计算出来了小人的位置和目标位置，不过我选择狗带了，因为手机卡的已经不行了。

这里有一个示例视频，欢迎观看!

到这里就结束了吗？那我们和业余玩家有什么区别？下面进入正经的学术时间，非战斗人员请迅速撤离！

CNN Coarse-to-Fine 模型

考虑到IOS设备由于屏幕抓取方案的限制（WebDriverAgent获得的截图经过了压缩，图像像素受损，不再是原来的像素值，原因不详，欢迎了解详情的小伙伴提出改进意见~）无法使用fast-search，同时为了兼容多分辨率设备，我们使用卷积神经网络构建了一个更快更鲁棒的目标检测模型，下面分数据采集与预处理，coarse模型，fine模型，cascade四部分介绍我们的算法。

数据采集与预处理

基于我们非常准确的multiscale-search、fast-search模型，我们采集了7次实验数据，共计大约3000张屏幕截图，每一张截图均带有目标位置标注，对于每一张图，我们进行了两种不同的预处理方式，并分别用于训练coarse模型和fine模型，下面分别介绍两种不同的预处理方式。

coarse 模型数据预处理

由于每一张图像中真正对于当前判断有意义的区域只在屏幕中央位置，即人和目标物体所在的位置，因此，每一张截图的上下两部分都是没有意义的，因此，我们将采集到的大小为(1280, 720)的图像沿h方向上下各截去(320, 720)大小，只保留中心(640, 720)的图像作为训练数据。

我们观察到，游戏中，每一次当小人落在目标物中心位置时，下一个目标物的中心会出现一个白色的圆点，

考虑到训练数据中fast-search会产生大量有白点的数据，为了杜绝白色圆点对网络训练的干扰，我们对每一张图进行了去白点操作，具体做法是，用白点周围的纯色像素填充白点区域。

fine 模型数据预处理

为了进一步提升模型的精度，我们为fine模型建立了数据集，对训练集中的每一张图，在目标点附近截取320*320大小的一块作为训练数据，

为了防止网络学到trivial的结果，我们对每一张图增加了50像素的随机偏移。fine模型数据同样进行了去白点操作。

coarse 模型

我们把这一问题看成了回归问题，coarse模型使用一个卷积神经网络回归目标的位置，

def forward(self, img, is_training, keep_prob, name='coarse'):
    with tf.name_scope(name):
        with tf.variable_scope(name):
            out = self.conv2d('conv1', img, [3, 3, self.input_channle, 16], 2)
            # out = tf.layers.batch_normalization(out, name='bn1', training=is_training)
            out = tf.nn.relu(out, name='relu1')

            out = self.make_conv_bn_relu('conv2', out, [3, 3, 16, 32], 1, is_training)
            out = tf.nn.max_pool(out, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')

            out = self.make_conv_bn_relu('conv3', out, [5, 5, 32, 64], 1, is_training)
            out = tf.nn.max_pool(out, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')

            out = self.make_conv_bn_relu('conv4', out, [7, 7, 64, 128], 1, is_training)
            out = tf.nn.max_pool(out, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')

            out = self.make_conv_bn_relu('conv5', out, [9, 9, 128, 256], 1, is_training)
            out = tf.nn.max_pool(out, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')

            out = tf.reshape(out, [-1, 256 * 20 * 23])
            out = self.make_fc('fc1', out, [256 * 20 * 23, 256], keep_prob)
            out = self.make_fc('fc2', out, [256, 2], keep_prob)

    return out

经过十小时的训练，coarse模型在测试集上达到了6像素的精度，实际测试精度大约为10像素，在测试机器（MacBook Pro Retina, 15-inch, Mid 2015, 2.2 GHz Intel Core i7）上inference时间0.4秒。这一模型可以很轻松的拿到超过1k的分数，这已经远远超过了人类水平和绝大多数自动算法的水平，日常娱乐完全够用，不过，你认为我们就此为止那就大错特错了~

fine 模型

fine模型结构与coarse模型类似，参数量稍大，fine模型作为对coarse模型的refine操作，

def forward(self, img, is_training, keep_prob, name='fine'):
    with tf.name_scope(name):
        with tf.variable_scope(name):
            out = self.conv2d('conv1', img, [3, 3, self.input_channle, 16], 2)
            # out = tf.layers.batch_normalization(out, name='bn1', training=is_training)
            out = tf.nn.relu(out, name='relu1')

            out = self.make_conv_bn_relu('conv2', out, [3, 3, 16, 64], 1, is_training)
            out = tf.nn.max_pool(out, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')

            out = self.make_conv_bn_relu('conv3', out, [5, 5, 64, 128], 1, is_training)
            out = tf.nn.max_pool(out, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')

            out = self.make_conv_bn_relu('conv4', out, [7, 7, 128, 256], 1, is_training)
            out = tf.nn.max_pool(out, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')

            out = self.make_conv_bn_relu('conv5', out, [9, 9, 256, 512], 1, is_training)
            out = tf.nn.max_pool(out, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')

            out = tf.reshape(out, [-1, 512 * 10 * 10])
            out = self.make_fc('fc1', out, [512 * 10 * 10, 512], keep_prob)
            out = self.make_fc('fc2', out, [512, 2], keep_prob)

    return out

经过十小时训练，fine模型测试集精度达到了0.5像素，实际测试精度大约为1像素，在测试机器上的inference时间0.2秒。

cascade

总体精度1像素左右，时间0.6秒。

总结

针对这一问题，我们利用AI和CV技术，提出了合适适用于IOS和Android设备的完整解决方案，稍有技术背景的用户都可以实现成功配置、运行，我们提出了Multiscale-Search，Fast-Search，CNN Coarse-to-Fine三种解决这一问题的算法，三种算法相互配合，可以实现快速准确的搜索、跳跃，用户针对自己的设备稍加调整跳跃参数即可接近实现“永动机”。讲到这里，似乎可以宣布，我们的工作terminate了这个问题，微信小游戏跳一跳game over！

代码链接

友情提示：适度游戏益脑，沉迷游戏伤身，技术手段的乐趣在于技术本身而不在游戏排名，希望大家理性对待游戏排名和本文提出的技术，用游戏娱乐自己的生活

声明：本文提出的算法及开源代码符合MIT开源协议，以商业目的使用该算法造成的一切后果须由使用者本人承担

Contributors

Xiao Taihong xiaotaihong@126.com
An Jie jie.an@pku.edu.cn

A Mathematical View towards CycleGAN

2017-07-05T00:00:00-07:00

Recently, CycleGAN is a very popular image translation method, which arouse many people’s interests. Lots of people are busy with reproducing it or designing interesting image applications by replacing the training data. However, few people have thought about its limitations, though the original paper gave some discussions about them.

A failure case of CycleGAN was given by the author himself.

A failure case of CycleGAN.

Limitations of CycleGAN

We can actually obtain lots of information from this picture.

CycleGAN is not able to disentangle the object from the background. Putin and background are both zebraized at the same time, as shown the in the picture.
CycleGAN is suitable for global image style transfer, but weak at doing object transfiguration. What is object transfiguration? We want to segment a certain part from an image and seamlessly implant it into another images. For example, I want to generate a smiling face of a person by imitating the smile from another smiling face.

Smile transfiguration.

Lack of diversity in generated images, or single-modal phenomenon. For example, we use CycleGAN to do image translation between two domains. One is facial images without eyeglasses, and the other on is facial images with eyeglasses. CycleGAN can generate novel images with eyeglasses from those images without eyeglasses, but the novel eyeglasses seems always to be a black sunglasses. This phenomenon was observed by Shuchang Zhou, who was very prescient. (see our paper GeneGAN) I believe Jun-Yan Zhu, the author CycleGAN has also noticed this limitation. Otherwise, he would not publish another paper BiCycleGAN to generate multi-modal images in NIPS 2017, right after ICCV 2017 submission deadline, though the idea of BiCycleGAN is somewhat simple and the training process is very similar to IcGAN.
Weak at learning the shape of object. It is impossible if we want to use CycleGAN to generating a round object from a quadrate object.

The Reason Behind It

What cause these limitations of CycleGAN? Recall the structure of CycleGAN, there are two image domains $X$ and $Y$. The ultimate goal of CycleGAN is to learn two maps $G$ and $F$, where $G$ maps domain $X$ to domain $Y$, and $F$ maps domain $Y$ to domain $X$.

CycleGAN framework.

Recalling a simple result in topology, Invariance of domain theorem states

If $U$ is an open subset of $\mathbb{R}^n$ and $f:U\to \mathbb{R}^n$ is an injective continuous map, then $V = f(U)$ is an open and $f$ is a homeomorphism between $U$ and $V$.

We will leave out the proof since it uses tools of algebraic topology. This theorem tells us an important consequence:

$\mathbb{R}^n$ can not be homeomorphic to $\mathbb{R}^m$ if $m\neq n$. Indeed, no non-empty open subset of $\mathbb{R}^n$ can be homeomorphic to any open subset of $\mathbb{R}^m$ if $m\neq n$.

Back to our discussion of CycleGAN, $G$ and $F$ are generally neural networks with auto-encoder structures, therefore continuous. (The composition of continuous map is continuous.) The cycle consistency guarantees that $G$ and $F$ are inverse to each other. Therefore the domain $X$ and domain $Y$ are homeomorphic. According to the theorem of invariance of domain, the intrinsic dimensions of $X$ and $Y$ should be the same.

This is the fundamental reason for its limitations!!! Because the intrinsic dimensions of $X$ and $Y$ may not be the same.

For example, if we want to do image translation between domain $X$ of facial images with eyeglasses and domain $Y$ of facial images without eyeglasses. Let’s compare the intrinsic dimensions of two domains. The intrinsic dimension of domain $Y$ comes from the variety of facial images. However, the intrinsic dimension of domain $X$ is more than that, because eyeglasses also varies, which increase the intrinsic dimensions.

In conclusion, the limitation of CycleGAN comes from the difference of intrinsic dimensions between source image domain and target image domain. If you are interested, please read our paper GeneGAN for an alternative method.

An Overview on Optimization Algorithms in Deep Learning 1

2016-02-04T00:00:00-08:00

Recently, I have been learning about optimization algorithms in deep learning. And it is necessary, I think, to sum them up, so I plan to write a series of articles about different kinds of these algorithms. This article will mainly talk about the basic optimization algorithms used in machine learning and deep learning.

Gradient Descent

Gradient descent is the most basic gradient-based algorithm to train a deep model. Once we get the function, let’s say $L$, that needs to optimize, especially when the function is convex, we can easily apply this method to approach the minimum of a function. This method involves updating the model parameters $\theta$ with a small step in the direction of the gradient of the objective function. For the case of supervised learning with data pairs $(x^{(i)}, y^{(i)})$, we have

\[\theta\leftarrow \theta + \epsilon\nabla_\theta \sum_t L(f(x^{(i)};\theta), y^{(i)};\theta).\]

where $\epsilon$ is the learning rate or the step size, that controls the size of the step the parameter takes in each iteration.

Stochastic Gradient Descent (SGD)

In spite of its impressive convergence property, batch gradient descent is rarely used in machine learning, because the cost of calculating the sum over gradient of each sample would be enormous when the number of training samples becomes large. A computationally efficient way is stochastic gradient descent in which we use the stochastic estimator of the gradient to perform its update. Based on the assumption that all samples are i.i.d, we sample one or a small suubset of $m$ training samples and compute their gradient. Then we use the gradient to update the parameter $\theta$.

When $m=1$, this algorithm is sometimes called online gradient descent. When $m>1$, the algorithm is sometimes called minibatch SGD. The algorithm is showed below.

GD or SGD, cannot escape from the occasion: if the learning rate is too big, the weights may travel to and fro across the ravine and not converge to the minimum; if the learning rate is too small, it will take a long time to converge or coonverge to local minima. Thus, we need adjust the learning rate accordingly: if the error is falliing fairly consistently but slowly, increase the learning rate; if the error keeps getting worse or oscillates wildly, then we should reduce the learning rate. Is there any algorithms that adapt the learning rate automatically?

Momentum Method

One method of speeding up training is the momentum method. This is perhaps the simplest extension to SGD that has been sucessfully used for decades. The intuition behind momentum, as the name suggests, is derived from a physical interpretation of the optimization process. Imaging a ball is rolling on a slope, the track of the ball is a combination of velocity and the instantaneous force pulling the ball downhill. And the momentum plays a role in accumulating gradient contribution.

Back to our optimization process, we want to accelerate progress along dimensions in which gradient consistently point in the same direction and to slow progress along dimensions where the sign of the gradient continues to change. This is done by keeping track of past parameter updates with an exponential decay:

\[\begin{align*} \Delta \theta \leftarrow \rho\Delta \theta + \eta g\\ \theta \leftarrow \theta - \Delta \theta \end{align*}\]

which is mathematically equivalent to

\[\begin{align*} \Delta \theta \leftarrow \rho\Delta \theta - \eta g\\ \theta \leftarrow \theta + \Delta \theta \end{align*}\]

where $\rho$ is a constant controlling the decay of the previous parameter updates and $\eta$ is the learning rate. The algorithm is as follows.

This gives a nice improvement over SGD when optimizing difficult cost surfaces. The issue occurred with SGD that a higher learning rate causes oscillations back and forth across the valley has been effectively solved, because the sign of gradient changes and thus the momentum term damps down these updates to slow progress across the valley. And the progress along the valley is unaffected.

Nesterov Momentum

The standard momentum method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient. Ilya Sutskever (2012 unpublished) suggested a new form of momentum that often works better. The better type of momentum is called Nesterov momentum that first make a big jump in the direction of the previous accumulated gradient, and then measure the gradient where you end up and make a correction. However, in the stochastic gradient case, Nesterov momentum does not improve the rate of convergence.

The algorithm is as follows.

An Overview on Optimization Algorithms in Deep Learning 2

2016-02-04T00:00:00-08:00

In the last article, I have introduced several basic optimization algorithms. However, those algorithms rely on the hyperparameter - the learning rate $\eta$ that has a significant impact on model performance. Though the use of momentum can go some way to alleviate these issues, it does so by introducing another hyperparameter $\rho$ that may be just as difficult to set as the original learning rate. In the face of this, it’s naturally to find other way to set learning rate automatically.

AdaGrad

AdaGrad algorithm adapts the learning rates of all model parameters by scaling them inversely proportional to the accumulated sum of squared partial derivatives over all training iterations. The update rule for AdaGrad is as follows:

\[\Delta \theta = -{\eta\over \sqrt{\sum_{\tau=1}^t g_{\tau}^2}}g_t\]

Here the denominator computes the $l^2$ norm of all previous gradients on a per-dimension basis and $\eta$ is a global learning rate shared by all dimensions.

The AdaGrad algorithm relies on the first order information but has some properties of second order methods and annealing. Since the dynamic rate grows with the inverse of gradient magnitudes, large gradients have smaller learning rates and small gradients have large learning rates. This nice property, as in second order methods, makes progress along each dimension even out over time. This is very useful in deep learning model, because the scale of gradients in each layer varies by several orders of magnitude. Additionally, the denominator of the scaling coefficient has the same effects as annealing, reducing the learning rate over time.

RMSprop

The RMSprop algorithm addresses the deficiency of AdaGrad by changing the gradient accumulation into an exponentially weighted moving average. In deep networks, directions in parameter space with strong partial derivatives may flatten out early, so RMSprop introduces a new hyperparameter $\rho$ that controls the length scale of the moving average to prevent that from happening.

RMSprop with Nesterov momentum algorithm is shown below.

Adam

Adam is another adaptive learning rate algorithm presented below. It can been seen as a combination of RMSprop and momentum. Adam algorithm includes bias corrections to the estimates of both the first order moment and second order moment to prevent parameters from high bias early in training.

AdaDelta

This method was derived from AdaGrad in order to improve upon the two main drawbacks of the method:

the continual decay of learning rates throughout training;
the need for a manually selected global learning rate.

Instead of accumulating the sum of squared gradients over all time, we restricted the window of past gradients that are accumulated to be some fixed size $w$ instead of size $t$ where $t$ is the current iteration as in AdaGrad. Since storing $w$ previous squared gradients is inefficient, our methods implements this accumulation as an exponentially decaying average of the squared gradients. Assume at time $t$ this running average is $\E(g^2)_t$ then we compute

\[\E(g^2)_t = \rho \E(g^2)_{t-1} +(1-\rho)g_t^2\]

Since we require the square root of this quantity in the parameter updates, this effectively becomes the RMS of previous squared gradients up to time $t$

\[\RMS(g)_t=\sqrt{\E(g^2)_t+\epsilon}\]

The resulting parameter update is then

\[\Delta\theta = -{\eta\over \RMS(g)_t}g_t \tag 1\label{eq-1}\]

Since the RMS of the previous gradients is already presented in the denominator in \eqref{eq-1}, we considered a measure of the $\Delta\theta$ quantity in the numerator. By computing the exponentially decaying RMS over a window of size $w$ of previous $\Delta\theta$ to give the AdaDelta method:

\[\Delta\theta=-{\RMS(\Delta\theta)_{t-1}\over \RMS(g)_t}g_t\]

where the same constant $\epsilon$ is added to the numerator RMS as well to ensure progress continues to be made even if previous updates becomes small.

Extra boon: A pdf-format cheet sheet containing all these algorithms could be downloaded here for reference.

Reference

翻墙方法总结

2016-02-02T00:00:00-08:00

hosts文件位置

首先我们要根据系统找到hosts文件所在的位置

Windows: C:\windows\system32\drivers\etc
Mac: /private/etc/hosts
Linux: /etc/hosts
Android: /system/etc/hosts
iOS: /etc/hosts

请注意Linux下更改要记得sudo,而且最好保留一开始的localhost的地址。 Android用户需要获取root权限，iOS用户需要越狱。

如何获得hosts

下面给大家列举一些不断提供hosts更新的网站：

老D博客
Netsh 网站中可以选择你想要进行翻墙的网站，然后下面有个框可以复制出hosts。这个网站中有提供ipv6的hosts，如果你在大学里用的是教育网，那么你可以试试ipv6的hosts。
hosts文件配置工具(密码: khn8)这个适合懒人操作。

检查hosts中地址是否有效

找到之后只要替换可用的hosts文件即可，如何知道hosts上的地址可以用呢？我们可以尝试用Ping命令，去测试是否能连接那个网址。比如Windows下可以用(Win+R) 打开CMD，之后输入

ping [ip address]

你就知道那个hosts中这个代理ip地址有没有用了。

其他工具和资源

自由门7.39版(密码：fng7)，适用于Windows，Ubuntu用户如果装过wine的也可以使用
xskywalker浏览器(密码：vzq9)，长得很像Chrome，其实就是从Chrome改的，只是帮你配置好了翻墙
火狐范免费版吉阿姨免配置包(密码：pivz)，适用于Windows，非常方便
fqrouter，适用于Android，需要root
shadowsocks，适用于Android，需要root
VPN master(密码：gdod)，适用于Android，免root
天行VPN，适用于iOS，非常好用而且永久免费
Chrome一键翻墙包
firefox一键翻墙包
Goagent+Chrome+SwitchyOmega翻墙

Maximum Likelihood and Bayes Estimation

2016-01-29T00:00:00-08:00

As we know, maximum likelihood estimation (MLE) and Bayes estimation (BE) are two kinds of methods for parameter estimation in machine learning. However, they are on behalf of different view but closely interconnected with each other. In this article, I would like to talk about the differences and connections of them.

Maximum Likelihood Estimation

Consider a set of $N$ examples $\mathscr{X}=\{x^{(1)},\ldots, x^{(N)}\}$ drawn independently from the true but unknown data generating distribution $p_{data}(x)$. Let $p_{model}(x; \theta)$ be a parametric family of probability distributions over the same space indexed by $\theta$. In other words, $p_{model}(x;\theta)$ maps any $x$ to a real number estimating the true probability $p_{data}(x)$.

The maximum likelihood estimator for $\theta$ is then defined by

\[\theta_{ML} = \mathop{\arg\max}_\theta p_{model}(\mathscr{X};\theta) = \mathop{\arg\max}_\theta \prod_{i=1}^N p_{model}(x^{(i)};\theta)\]

For convenience, we usually maximize the logarithm of that

\[\theta_{ML} = \mathop{\arg\max}_\theta \sum_{i=1}^N \log p_{model}(x^{(i)};\theta)\]

Since rescaling the cost function does not change the result of $\mathop{\arg\max}$, so we can divide by $N$ to obtain a formula expressed as an expectation.

\[\theta_{ML} = \mathop{\arg\max}_\theta \E_{x\sim \hat{p}_{data}}\log p_{model}(x;\theta)\]

Maximizing something is equivalent to minimizing the negative of something, thus we have

\[\theta_{ML} = \mathop{\arg\min}_\theta -\E_{x\sim \hat{p}_{data}}\log p_{model}(x;\theta)\]

One way to interprete MLE is to view what we are minimizing the dissimilarity between the experical distribution defined by training set and the model distribution, with the degree of dissimilarity between the two distributions measured by the KL divergence. The KL divergence is given by

\[\KL(\hat{p}_{data}||p_{model})=\E_{x\sim \hat{p}_{data}}(\log \hat{p}_{data}(x) - \log p_{model}(x;\theta)).\]

Since the expectation of $\log \hat{p}_{data}(x)$ is a constant, we can see the optimal $\theta$ of maximum likelihood principle attempts to minimize the KL divergence.

Bayes Estimation

As discussed above, the frequentist perspective is the true parameter $\theta$ is fixed but unknown, while the MLE $\theta_{ML}$ is a random variable on account of it being a function of the data. But the bayesian perspective on statistics is quite different. The data is intuitively observed rather than viewed randomly. They use prior probability distribution $p(\theta)$ to reflect some knowledge they know about the distribution to some degree. Now that we have observed a set of data samples $\mathscr{X}={x^{(1)},\ldots,x^{(N)}}$, we can recover possibility or our belief about a certain value $\theta$ by combining the prior with the conditional distribution $p(\mathscr{X}|\theta)$ via bayes formula

\[p(\theta|\mathscr{X}) = {p(\mathscr{X}|\theta)p(\theta)\over p(\mathscr{X})},\]

which is the posterior probability.

Unlike what we did in MLE, Bayes estimation was effected with respect to a full distribution over $\theta$. The quintessential idea of bayes estimation is minimizing conditional risk or expected loss function $R(\hat{\theta}|X)$, given by

\[R(\hat{\theta}|X) = \int_\Theta \lambda(\hat{\theta},\theta)p(\theta|X)d\theta,\]

where $\Theta$ is the parameter space of $\theta$. If we take the loss function to be quadratic function, i.e. $\lambda(\hat{\theta},\theta)=(\theta-\hat{\theta})^2$, then the bayes estimation of $\theta$ is

\[\theta_{BE} = \E(\theta|X) = \int_\Theta \theta p(\theta|X)d\theta.\]

The proof is easy.

It is worth mentioning that in bayes learning, we need not to estimate $\theta$. Instead, we could give the probability distribution function of a sample $x$ directly. For example, after obsering $N$ data samples, the predicted distribution of the next example $x^{(N+1)}$, is given by

\[p(x^{(N+1)}|\mathscr{X}) = \int p(x^{(N+1)}|\theta)p(\theta|\mathscr{X})d\theta.\]

Maximum A Posteriori Estimation

A more commonn way to estimate parameters is ccarried out using a so called maximum a posteriori (MAP) method. The MAP estimate choose the point of maximal posterior probability

\[\theta_{MAP} = \mathop{\arg\max}_\theta p(\theta|\mathscr{X}) = \mathop{\arg\max}_\theta \log p(\mathscr{X}|\theta) + \log p(\theta)\]

Relations

As we talked above, Maximizing likelihood function is equivalent to minimizing the KL divergence between model distribution and empirical distribution. In a bayesian view of this, we can say that MLE is equivalent to minimizing empirical risk when the loss function is taken to be the logarithm loss (cross entropy loss).

The advatage brought by introducing the influence of the prior on MAP estimate is to leverage the additional information other than the unpredicted data. This additional information helps us to reduce the variance in MAP point estimate in comparison to MLE, however at the expense of increasing the bias. A good example help illustrate this idea.

Example: (Linear Regression) The problem is to find appropriate $w$ such that a mapping defined by

\[y=w^T x\]

gives the best prediction of $y$ over the entire training set $\mathscr{X}=\{x^{(1)}, \ldots,x^{(N)}\}$. Expressing the predition in a matrix form,

\[y= \mathscr{X}^T w\]

Besides, let us asssume the conditional distribution of $y$ given $w$ and $\mathscr{X}$ is Gaussian distribution parametrized by mean vector $\mathscr{X}^T w$ and variance matrix $I$. In this case, the MLE gives an estimate

\[\hat{w}_{ML} = (\mathscr{X}^T\mathscr{X})^{-1}\mathscr{X}y. \tag 1 \label{eq-1}\]

We also assume the prior of $w$ is another Gaussian distribution parametrized by mean $0$ and variance matrix $\Lambda_0=\lambda_0I$. With the prior specified, we can now determine the posterior distribution over the model parameters.

\[\begin{align*} p(w|\mathscr{X},y) \propto p(y|\mathscr{X},w)p(w)\\ \propto \exp\left(-{1\over2}(y-\mathscr{X}w)^T(y-\mathscr{X}w)\right)\exp\left(-{1\over2}w^T\Lambda_0^{-1}w\right)\\ \propto \exp\left(-{1\over2}\left(-2y^T\mathscr{X}w + w^T\mathscr{X}^T\mathscr{X}w + w^T\Lambda_0^{-1}w\right)\right)\\ \propto \exp\left(-{1\over2}(w-\mu_N)^T\Lambda_N^{-1}(w-\mu_N)\right). \end{align*}\]

where

\[\begin{align*} \Lambda_N = (\mathscr{X}^T\mathscr{X} + \Lambda_0^{-1})^{-1}\\ \mu_N = \Lambda_N\mathscr{X}^Ty \end{align*}\]

Thus the MAP estimate of the $w$ becomes

\[\hat{w}_{MAP} = (\mathscr{X}^T\mathscr{X} + \lambda_0^{-1}I)^{-1}\mathscr{X}^T y. \tag 2 \label{eq-2}\]

Compared \eqref{eq-2} with \eqref{eq-1}, we see that the MAP estimate amounts to adding a weighted term related with variance of prior distribution in the parenthesis at the basis of MLE. Also, it is easy to show that the MLE is unbiased, i.e. $\E(\hat{w}_{ML})=w$ and that it has a variance given by

\[\Var(\hat{w}_{ML})=(\mathscr{X}^T\mathscr{X})^{-1}. \tag 3\label{eq-3}\]

In order to derive the bias of the MAP estimate, we need to evaluate the expectation

\[\begin{align*} \E(\hat{w}_{MAP}) = E(\Lambda_N \mathscr{X}^Ty)\\ = \E(\Lambda_N \mathscr{X}^T(\mathscr{X}w + \epsilon))\\ = \Lambda_N(\mathscr{X}^T\mathscr{X}w) + \Lambda_N\mathscr{X}^T \E(\epsilon)\\ = (\mathscr{X}^T\mathscr{X} + \lambda_0^{-1}I)^{-1}\mathscr{X}^T\mathscr{X}w\\ = (\mathscr{X}^T\mathscr{X} + \lambda_0^{-1}I)^{-1} (\mathscr{X}^T\mathscr{X} + \lambda_0^{-1}I - \lambda_0^{-1}I)w\\ = (I - (\lambda_0\mathscr{X}^T\mathscr{X} + I)^{-1} )w \end{align*}\]

Thus, the bias can be derived as

\[\Bias(\hat{w}_{MAP}) = \E(\hat{w}_{MAP}) - w = -(\lambda_0\mathscr{X}^T\mathscr{X} + I)^{-1}w.\]

Therefore, we can conclude that the MAP estimate is unbiased, and as the variance of prior $\lambda_0 \to \infty$, the bias tends to $0$. And as the variance of the prior $\lambda_0 \to 0$, the bias tends to $w$. This case is exactly the ML estimate, because the variance tending to $\infty$ implies that the prior distribution is asymptotically uniform. In other words, knowing nothing about the prior distribution, we assign the same probability to every value of $w$.

Before computing the variance, we need to compute

\[\begin{align*} \E(\hat{w}_{MAP}\hat{w}_{MAP}^T) = \E(\Lambda_N \mathscr{X}^T yy^T \mathscr{X} \Lambda_N)\\ = \E(\Lambda_N \mathscr{X}^T (\mathscr{X}w+\epsilon)(\mathscr{X}w+\epsilon)^T \mathscr{X} \Lambda_N) \\ = \Lambda_N \mathscr{X}^T\mathscr{X}ww^T\mathscr{X}^T\mathscr{X}\Lambda_N + \Lambda_N \mathscr{X}^T\E(\epsilon\epsilon^T)\mathscr{X}\Lambda_N \\ = \Lambda_N \mathscr{X}^T\mathscr{X}ww^T\mathscr{X}^T\mathscr{X}\Lambda_N + \Lambda_N \mathscr{X}^T\mathscr{X}\Lambda_N \\ = \E(\hat{w}_{MAP})\E(\hat{w}_{MAP})^T + \Lambda_N \mathscr{X}^T\mathscr{X}\Lambda_N. \end{align*}\]

Therefore, the variance of the MAP estimate of our linear regression model is given by

\[\begin{align*} \Var(\hat{w}_{MAP}) = \E(\hat{w}_{MAP}\hat{w}_{MAP}^T) - \E(\hat{w}_{MAP})\E(\hat{w}_{MAP})^T\\ = \Lambda_N \mathscr{X}^T\mathscr{X}\Lambda_N\\ = (\mathscr{X}^T\mathscr{X} + \lambda_0^{-1}I)^{-1}\mathscr{X}^T \mathscr{X}(\mathscr{X}^T\mathscr{X} + \lambda_0^{-1}I)^{-1}. \tag 4 \label{eq-4} \end{align*}\]

It is perhaps difficult to compare \eqref{eq-3} and \eqref{eq-4}. But if we take a look at one-dimensional case, it becomes easier to see that, as long as $\lambda_0 >1$,

\[\Var(\hat{w}_{ML})={1\over \sum_{i=1}^N x_i^2} > {\lambda_0\sum_{i=1}^N x_i^2\over (1+\lambda_0\sum_{i=1}^N x_i^2)^2 } = \Var(\hat{w}_{MAP}).\]

From the above analysis, we can see that the MAP estimate reduces the variance at the expense of increasing the bias. However, the goal is to prevent overfitting.