常用的梯度下降法分为：

批量梯度下降法（Batch Gradient Descent）
随机梯度下降法（Stochastic Gradient Descent）
小批量梯度下降法（Mini-Batch Gradient Descent）

简单的算法示例

数据

x = np.random.uniform(-3,3,100)

X = x.reshape(-1,1)

y = x * 2 + 5 + np.random.normal(0, 1, 100)

BGD

批量梯度下降法的简单实现：

def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):

    def J(theta):

        return np.mean((X_b.dot(theta) - y) ** 2)

    def dj(theta):

        return X_b.T.dot((X_b.dot(theta) - y)) * (2 / len(y))

    theta = initial_theta

    for i in range(1, int(n_iters)):

        gradient = dj(theta)                   # 获得梯度

        last_theta = theta

        theta = theta - eta * gradient    # 迭代梯度

        if np.absolute(J(theta) - J(last_theta)) < epsilon:

            break                         # 满足条件就跳出

    return theta

结果是：

X_b = np.hstack([np.ones((len(y), 1)), X])

initial_theta = np.ones(X_b.shape[1])

eta = 0.1

%time s_gradient_descent(X_b, y, initial_theta, eta, n_iters=1)

## array([4.72619109, 3.08239321])

SGD

这里n_iters表示将所有数据迭代的轮数。

def s_gradient_descent(X_b, y, initial_theta, eta, batch_size=10, n_iters=10, epsilon=1e-8):

    def J(theta):

        return np.mean((X_b.dot(theta) - y) ** 2)

    # 这是随机梯度下降的，随机一个样本的梯度

    def dj_sgd(X_b_i, y_i, theta):

        # return X_b.T.dot((X_b.dot(theta) - y)) * (2 / len(y))

        return 2 * X_b_i.T.dot(X_b_i.dot(theta) - y_i)

    theta = initial_theta

    for i in range(0, int(n_iters)):

        for j in range(batch_size, len(y), batch_size):

            gradient = dj_sgd(X_b[j,:], y[j], theta)

            last_theta = theta

            theta = theta - eta * gradient         # 迭代梯度

            if np.absolute(J(theta) - J(last_theta)) < epsilon:

                break                              # 满足条件就跳出

    return theta

结果是：

X_b = np.hstack([np.ones((len(y), 1)), X])

initial_theta = np.ones(X_b.shape[1])

eta = 0.1

%time s_gradient_descent(X_b, y, initial_theta, eta, n_iters=1)

## array([4.72619109, 3.08239321])

MBGD

在随机梯度下降的基础上，对dj做了一点点修改，batch_size指定批量的大小，dj每次计算batch_size个样本的梯度并取平均值。

不得不说，同样是迭代一轮数据，小批量梯度下降法的准确度要比随机梯度下降法高多了。

def b_gradient_descent(X_b, y, initial_theta, eta, batch_size=10, n_iters=10, epsilon=1e-8):

    def J(theta):

        return np.mean((X_b.dot(theta) - y) ** 2)

    # 这是小批量梯度下降的，随机一个样本的梯度

    def dj_bgd(X_b_b, y_b, theta):

        # return X_b.T.dot((X_b.dot(theta) - y)) * (2 / len(y))

        return X_b_b.T.dot(X_b_b.dot(theta) - y_b) * (2 / len(y_b))

    theta = initial_theta

    for i in range(0, int(n_iters)):

        for j in range(batch_size, len(y), batch_size):

            gradient = dj_bgd(X_b[j-batch_size:j,:], y[j-batch_size:j], theta)

            last_theta = theta

            theta = theta - eta * gradient         # 迭代梯度

            if np.absolute(J(theta) - J(last_theta)) < epsilon:

                break                              # 满足条件就跳出

    return theta

结果是：

X_b = np.hstack([np.ones((len(y), 1)), X])

initial_theta = np.ones(X_b.shape[1])

eta = 0.1

%time b_gradient_descent(X_b, y, initial_theta, eta, n_iters=1)

array([4.4649369 , 2.27164876])

三种梯度下降法的对比(BGD & SGD & MBGD)

简单的算法示例

数据

BGD

SGD

MBGD

三种梯度下降法的对比(BGD & SGD & MBGD)的相关教程结束。

相关推荐

理解VMware的三种工作模式

三种整合Struts 应用程序与Spring的方式

在pytorch中实现只让指定变量向后传播梯度

实现 Java 平台的三种方式(转)

请求量突增一下，系统有效QPS为何下降很多？

示例讲解PostgreSQL表分区的三种方式

热门推荐