从头实现多层感知机和在PyTorch中实现的区别是什么？

Question

从头实现多层感知机和在PyTorch中实现的区别是什么？

pythonnumpyneural-networkdeep-learningpytorch

10

考虑XOR问题：

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

一个简单的

两层多层感知器（MLP），它们之间使用sigmoid激活函数，
均方误差（MSE）作为损失函数/优化标准

如果我们从头开始训练模型：

from itertools import chain
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def mse(predicted, truth):
    return 0.5 * np.mean(np.square(predicted - truth))

def mse_derivative(predicted, truth):
    return predicted - truth

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))

# Initialize weigh
num_epochs = 5000
learning_rate = 0.3

losses = []

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    cost_error = mse(layer2, Y)
    cost_delta = mse_derivative(layer2, Y)

    #print(layer2_error)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_error = np.dot(cost_delta, cost_error)
    layer2_delta = cost_delta *  sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
    #print(np.dot(layer0.T, layer1_delta))
    #print(epoch_n, list((layer2)))

    # Log the loss value as we proceed through the epochs.
    losses.append(layer2_error.mean())
    #print(cost_delta)


# Visualize the losses
plt.plot(losses)
plt.show()

我们在第0轮迭代中损失急剧下降，然后迅速饱和：

但是，如果我们使用pytorch训练类似的模型，训练曲线在饱和之前会逐渐下降：

从零开始的MLP和PyTorch代码有什么区别？

为什么它在不同的收敛点实现收敛？

除了权重初始化之外，在从头开始的代码中使用的np.random.rand()和默认的torch初始化，我似乎看不到模型上的区别。

PyTorch代码：

from tqdm import tqdm
import numpy as np

import torch
from torch import nn
from torch import tensor
from torch import optim

import matplotlib.pyplot as plt

torch.manual_seed(0)
device = 'gpu' if torch.cuda.is_available() else 'cpu'

# XOR gate inputs and outputs.
X = xor_input = tensor([[0,0], [0,1], [1,0], [1,1]]).float().to(device)
Y = xor_output = tensor([[0],[1],[1],[0]]).float().to(device)


# Use tensor.shape to get the shape of the matrix/tensor.
num_data, input_dim = X.shape
print('Inputs Dim:', input_dim) # i.e. n=2 

num_data, output_dim = Y.shape
print('Output Dim:', output_dim) 
print('No. of Data:', num_data) # i.e. n=4

# Step 1: Initialization. 

# Initialize the model.
# Set the hidden dimension size.
hidden_dim = 5
# Use Sequential to define a simple feed-forward network.
model = nn.Sequential(
            # Use nn.Linear to get our simple perceptron.
            nn.Linear(input_dim, hidden_dim),
            # Use nn.Sigmoid to get our sigmoid non-linearity.
            nn.Sigmoid(),
            # Second layer neurons.
            nn.Linear(hidden_dim, output_dim),
            nn.Sigmoid()
        )
model

# Initialize the optimizer
learning_rate = 0.3
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# Initialize the loss function.
criterion = nn.MSELoss()

# Initialize the stopping criteria
# For simplicity, just stop training after certain no. of epochs.
num_epochs = 5000 

losses = [] # Keeps track of the loses.

# Step 2-4 of training routine.

for _e in tqdm(range(num_epochs)):
    # Reset the gradient after every epoch. 
    optimizer.zero_grad() 
    # Step 2: Foward Propagation
    predictions = model(X)

    # Step 3: Back Propagation 
    # Calculate the cost between the predictions and the truth.
    loss = criterion(predictions, Y)
    # Remember to back propagate the loss you've computed above.
    loss.backward()

    # Step 4: Optimizer take a step and update the weights.
    optimizer.step()

    # Log the loss value as we proceed through the epochs.
    losses.append(loss.data.item())


plt.plot(losses)

- alvas

如果您能告诉我们手写示例中下降的速度有多快，那可能会有所帮助。2个时期？20个？对我来说，图表的明显解释是学习率在某种程度上非常不同。(另外，作为一个单独的注释：MSE损失可能不是这里适当的误差函数，您将想要在实践中使用负对数损失/交叉熵损失来处理输出在$[0,1]$范围内的问题，但对于这么简单的问题，这并不重要，当然也与问题本身没有太大关系。) - Mees de Vries

你从头开始编写的代码抛出了以下异常：

---> 60     layer1_error = np.dot(layer2_delta, W2.T) ..... ValueError: shapes (4,50) and (1,5) not aligned: 50 (dim 1) != 1 (dim 0)

。 - cs95

1

@alvas 在训练结束时，你的损失理想情况下应该是0.0，对吧？这是否意味着PyTorch代码有问题？ @coldspeed 我能够从零开始的代码中复现OP的结果。当你运行它时，似乎layer2_delta以(4,50)的形状结束（对我来说，layer2_delta.shape是(4,1)）。 - tel

好的，花了一点时间，但我已经想出如何让您自己编写的代码产生与Pytorch代码完全相同的结果。有四个显著差异需要考虑。这都是些小调整，所以看起来您自己编写的代码核心很好（除了必须加倍学习率的问题：这可能是某处的数学错误）。 - tel

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tel · Accepted Answer

手写代码和PyTorch代码之间的差异列表

事实证明，您的手写代码和PyTorch代码之间存在许多差异。以下是我发现的，按对结果影响最大到最小的顺序列出：

您的代码和PyTorch代码使用两个不同的函数报告损失。
您的代码和PyTorch代码设置初始权重非常不同。您在问题中提到了这一点，但事实证明它对结果有相当大的影响。
默认情况下，torch.nn.Linear层向模型添加了一堆额外的“偏置”权重。因此，Pytorch模型的第一层实际上具有3x5权重，第二层具有6x1权重。手写代码中的层分别具有2x5和5x1权重。
- 偏置似乎有助于模型更快地学习和适应。如果关闭偏置，则需要大约两倍的训练周期才能使Pytorch模型达到接近0的损失。
奇怪的是，Pytorch模型似乎使用的学习率实际上是您指定的一半。或者，可能有一个杂散的2因素在手写代码中的某个地方。

如何从手写和Pytorch代码中获得相同的结果

通过仔细考虑上述4个因素，可以在手写和Pytorch代码之间实现完全相等。通过正确的调整和设置，这两个片段将产生相同的结果：

最重要的调整-使损失报告函数匹配

关键区别在于，您最终使用了两个完全不同的函数来测量两个代码片段中的损失：

In the hand rolled code, you measure the loss as layer2_error.mean(). If you unpack the variable, you can see that layer2_error.mean() is a somewhat screwy and meaningless value:

layer2_error.mean()
== np.dot(cost_delta, cost_error).mean()
== np.dot(mse_derivative(layer2, Y), mse(layer2, Y)).mean()
== np.sum(.5 * (layer2 - Y) * ((layer2 - Y)**2).mean()).mean()

On the other hand, in the PyTorch code the loss is measured in terms of the traditional definition of the mse, ie as the equivalent of np.mean((layer2 - Y)**2). You can prove this to yourself by modifying your PyTorch loop like so:

def mse(x, y):
    return np.mean((x - y)**2)

torch_losses = [] # Keeps track of the loses.
torch_losses_manual = [] # for comparison

# Step 2-4 of training routine.

for _e in tqdm(range(num_epochs)):
    # Reset the gradient after every epoch. 
    optimizer.zero_grad() 
    # Step 2: Foward Propagation
    predictions = model(X)

    # Step 3: Back Propagation 
    # Calculate the cost between the predictions and the truth.
    loss = criterion(predictions, Y)
    # Remember to back propagate the loss you've computed above.
    loss.backward()

    # Step 4: Optimizer take a step and update the weights.
    optimizer.step()

    # Log the loss value as we proceed through the epochs.
    torch_losses.append(loss.data.item())
    torch_losses_manual.append(mse(predictions.detach().numpy(), Y.detach().numpy()))

plt.plot(torch_losses, lw=5, label='torch_losses')
plt.plot(torch_losses_manual, lw=2, label='torch_losses_manual')
plt.legend()

输出:

同样重要 - 使用相同的初始权重

PyTorch使用自己的特殊程序设置初始权重，与np.random.rand产生非常不同的结果。我还没有能够精确复制它，但对于下一个最好的事情，我们只需劫持PyTorch。下面是一个将获得与Pytorch模型使用的相同初始权重的函数:

import torch
from torch import nn
torch.manual_seed(0)

def torch_weights(nodes_in, nodes_hidden, nodes_out, bias=None):
    model = nn.Sequential(
        nn.Linear(nodes_in, nodes_hidden, bias=bias),
        nn.Sigmoid(),
        nn.Linear(nodes_hidden, nodes_out, bias=bias),
        nn.Sigmoid()
    )

    return [t.detach().numpy() for t in model.parameters()]

最终 - 在Pytorch中关闭所有偏置权重并加倍学习率

最终，您可能希望在自己的代码中实现偏置权重。目前，我们将在Pytorch模型中关闭偏置，并将手动编写的模型的结果与无偏置的Pytorch模型的结果进行比较。

此外，为了使结果匹配，您需要将Pytorch模型的学习率加倍。这有效地沿着x轴缩放结果（即加倍速率意味着需要的时代数减半才能到达损失曲线上的某个特定特征）。

将它们结合起来

为了重新生成我文章开头的图表中的hand_rolled_losses数据，您只需使用手动编写的代码并将mse函数替换为：

def mse(predicted, truth):
    return np.mean(np.square(predicted - truth))

这些代码行用于初始化权重:

W1,W2 = [w.T for w in torch_weights(input_dim, hidden_dim, output_dim)]

并跟踪损失的那条线是：

losses.append(cost_error)

你只需要这样做，就可以顺利进行操作。

为了重现绘图中的torch_losses数据，我们还需要在Pytorch模型中关闭偏置权重。要做到这一点，只需将定义Pytorch模型的行更改如下：

model = nn.Sequential(
    # Use nn.Linear to get our simple perceptron.
    nn.Linear(input_dim, hidden_dim, bias=None),
    # Use nn.Sigmoid to get our sigmoid non-linearity.
    nn.Sigmoid(),
    # Second layer neurons.
    nn.Linear(hidden_dim, output_dim, bias=None),
    nn.Sigmoid()
)

您还需要更改定义learning_rate的行：

learning_rate = 0.3 * 2

完整代码清单

手写代码

以下是我手写神经网络代码的完整清单，以帮助重现我的结果:

from itertools import chain
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.stats
import torch
from torch import nn

np.random.seed(0)
torch.manual_seed(0)

def torch_weights(nodes_in, nodes_hidden, nodes_out, bias=None):
    model = nn.Sequential(
        nn.Linear(nodes_in, nodes_hidden, bias=bias),
        nn.Sigmoid(),
        nn.Linear(nodes_hidden, nodes_out, bias=bias),
        nn.Sigmoid()
    )

    return [t.detach().numpy() for t in model.parameters()]

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def mse(predicted, truth):
    return np.mean(np.square(predicted - truth))

def mse_derivative(predicted, truth):
    return predicted - truth

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Define the shape of the output vector. 
output_dim = len(Y.T)

W1,W2 = [w.T for w in torch_weights(input_dim, hidden_dim, output_dim)]

num_epochs = 5000
learning_rate = 0.3
losses = []

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    cost_delta = mse_derivative(layer2, Y)
    layer2_delta = cost_delta *  sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)

    # Log the loss value as we proceed through the epochs.
    losses.append(mse(layer2, Y))

# Visualize the losses
plt.plot(losses)
plt.show()

Pytorch 代码

import matplotlib.pyplot as plt
from tqdm import tqdm
import numpy as np

import torch
from torch import nn
from torch import tensor
from torch import optim

torch.manual_seed(0)
device = 'gpu' if torch.cuda.is_available() else 'cpu'

num_epochs = 5000
learning_rate = 0.3 * 2

# XOR gate inputs and outputs.
X = tensor([[0,0], [0,1], [1,0], [1,1]]).float().to(device)
Y = tensor([[0],[1],[1],[0]]).float().to(device)

# Use tensor.shape to get the shape of the matrix/tensor.
num_data, input_dim = X.shape
num_data, output_dim = Y.shape

# Step 1: Initialization. 

# Initialize the model.
# Set the hidden dimension size.
hidden_dim = 5
# Use Sequential to define a simple feed-forward network.
model = nn.Sequential(
    # Use nn.Linear to get our simple perceptron.
    nn.Linear(input_dim, hidden_dim, bias=None),
    # Use nn.Sigmoid to get our sigmoid non-linearity.
    nn.Sigmoid(),
    # Second layer neurons.
    nn.Linear(hidden_dim, output_dim, bias=None),
    nn.Sigmoid()
)

# Initialize the optimizer
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# Initialize the loss function.
criterion = nn.MSELoss()

def mse(x, y):
    return np.mean((x - y)**2)

torch_losses = [] # Keeps track of the loses.
torch_losses_manual = [] # for comparison

# Step 2-4 of training routine.

for _e in tqdm(range(num_epochs)):
    # Reset the gradient after every epoch. 
    optimizer.zero_grad() 
    # Step 2: Foward Propagation
    predictions = model(X)

    # Step 3: Back Propagation 
    # Calculate the cost between the predictions and the truth.
    loss = criterion(predictions, Y)
    # Remember to back propagate the loss you've computed above.
    loss.backward()

    # Step 4: Optimizer take a step and update the weights.
    optimizer.step()

    # Log the loss value as we proceed through the epochs.
    torch_losses.append(loss.data.item())
    torch_losses_manual.append(mse(predictions.detach().numpy(), Y.detach().numpy()))

plt.plot(torch_losses, lw=5, c='C1', label='torch_losses')
plt.plot(torch_losses_manual, lw=2, c='C2', label='torch_losses_manual')
plt.legend()

笔记

偏置权重

您可以在这个教程中找到一些非常有指导意义的例子，展示了偏置权重是什么以及如何实现它们。他们列出了一堆纯Python实现的神经网络，非常类似于您手动编写的代码，因此您很可能可以借鉴他们的代码，以制作自己的偏置权重实现。

产生初始权重猜测的函数

这里是我从同样的教程中改编的一个函数，可以产生合理的权重初始值。我认为Pytorch内部使用的算法略有不同，但是这个函数可以产生类似的结果：

import scipy as sp
import scipy.stats

def tnorm_weights(nodes_in, nodes_out, bias_node=0):
    # see https://www.python-course.eu/neural_network_mnist.php
    wshape = (nodes_out, nodes_in + bias_node)
    bound = 1 / np.sqrt(nodes_in)
    X = sp.stats.truncnorm(-bound, bound)
    return X.rvs(np.prod(wshape)).reshape(wshape)