首页
/ 深入理解多层感知机(MLP)的实现 - 基于homemade-machine-learning项目

深入理解多层感知机(MLP)的实现 - 基于homemade-machine-learning项目

2025-07-05 06:03:11作者:傅爽业Veleda

前言

多层感知机(Multilayer Perceptron, MLP)是最基础的前馈神经网络之一,广泛应用于分类和回归问题。本文将基于一个开源实现,详细解析MLP的核心实现原理和关键技术细节。

MLP基础架构

MLP由输入层、隐藏层和输出层组成,层与层之间全连接。每个神经元接收前一层所有神经元的输出,经过加权求和后通过激活函数产生输出。

核心组件

  1. 网络初始化:随机初始化权重矩阵
  2. 前向传播:计算网络输出
  3. 反向传播:计算梯度
  4. 参数更新:使用梯度下降优化权重

代码实现解析

1. 网络初始化

def __init__(self, data, labels, layers, epsilon, normalize_data=False):
    # 数据预处理
    data_processed = prepare_for_training(data, normalize_data=normalize_data)[0]
    
    self.data = data_processed
    self.labels = labels
    self.layers = layers  # 如[784, 30, 10]表示输入层784节点,隐藏层30节点,输出层10节点
    self.epsilon = epsilon  # 权重初始化范围
    
    # 随机初始化权重矩阵
    self.thetas = MultilayerPerceptron.thetas_init(layers, epsilon)

权重初始化采用均匀分布随机数,范围在[-ε, ε]之间。这种对称的小随机数初始化有助于避免初始权重过大导致的梯度消失或爆炸问题。

2. 前向传播实现

@staticmethod
def feedforward_propagation(data, thetas, layers):
    num_layers = len(layers)
    num_examples = data.shape[0]
    
    in_layer_activation = data  # 输入层激活值
    
    for layer_index in range(num_layers - 1):
        theta = thetas[layer_index]
        out_layer_activation = sigmoid(in_layer_activation @ theta.T)
        out_layer_activation = np.hstack((np.ones((num_examples, 1)), out_layer_activation))
        in_layer_activation = out_layer_activation
    
    return in_layer_activation[:, 1:]  # 去掉输出层的偏置单元

前向传播过程:

  1. 输入层激活值直接使用输入数据
  2. 对每一层计算加权和并通过sigmoid激活函数
  3. 添加偏置单元(除输出层外)
  4. 最终输出层结果作为预测值

3. 反向传播实现

反向传播是MLP训练的核心,计算损失函数对每个权重的梯度:

@staticmethod
def back_propagation(data, labels, thetas, layers, regularization_param):
    num_layers = len(layers)
    (num_examples, num_features) = data.shape
    num_label_types = layers[-1]
    
    # 初始化累积梯度
    deltas = {}
    for layer_index in range(num_layers - 1):
        in_count = layers[layer_index]
        out_count = layers[layer_index + 1]
        deltas[layer_index] = np.zeros((out_count, in_count + 1))
    
    # 对每个样本计算梯度
    for example_index in range(num_examples):
        # 存储各层输入和激活值
        layers_inputs = {}
        layers_activations = {}
        
        # 前向传播记录中间值
        layer_activation = data[example_index, :].reshape((num_features, 1))
        layers_activations[0] = layer_activation
        
        for layer_index in range(num_layers - 1):
            layer_theta = thetas[layer_index]
            layer_input = layer_theta @ layer_activation
            layer_activation = np.vstack((np.array([[1]]), sigmoid(layer_input)))
            layers_inputs[layer_index + 1] = layer_input
            layers_activations[layer_index + 1] = layer_activation
        
        # 计算输出层误差
        output_layer_activation = layer_activation[1:, :]
        bitwise_label = np.zeros((num_label_types, 1))
        bitwise_label[labels[example_index][0]] = 1
        delta[num_layers - 1] = output_layer_activation - bitwise_label
        
        # 反向传播计算隐藏层误差
        for layer_index in range(num_layers - 2, 0, -1):
            layer_theta = thetas[layer_index]
            next_delta = delta[layer_index + 1]
            layer_input = layers_inputs[layer_index]
            layer_input = np.vstack((np.array([[1]]), layer_input))
            delta[layer_index] = (layer_theta.T @ next_delta) * sigmoid_gradient(layer_input)
            delta[layer_index] = delta[layer_index][1:, :]
        
        # 累积梯度
        for layer_index in range(num_layers - 1):
            layer_delta = delta[layer_index + 1] @ layers_activations[layer_index].T
            deltas[layer_index] = deltas[layer_index] + layer_delta
    
    # 加入正则化项
    for layer_index in range(num_layers - 1):
        current_delta = deltas[layer_index]
        current_delta = np.hstack((np.zeros((current_delta.shape[0], 1)), current_delta[:, 1:]))
        regularization = (regularization_param / num_examples) * current_delta
        deltas[layer_index] = (1 / num_examples) * deltas[layer_index] + regularization
    
    return deltas

反向传播关键点:

  1. 对每个样本先进行前向传播,记录各层输入和激活值
  2. 计算输出层误差(预测值-真实值)
  3. 从后向前逐层传播误差
  4. 使用链式法则计算各层权重梯度
  5. 加入L2正则化防止过拟合

4. 参数更新

@staticmethod
def gradient_descent(data, labels, unrolled_theta, layers, regularization_param, max_iteration, alpha):
    optimized_theta = unrolled_theta
    cost_history = []
    
    for _ in range(max_iteration):
        cost = MultilayerPerceptron.cost_function(
            data, labels, 
            MultilayerPerceptron.thetas_roll(optimized_theta, layers), 
            layers, regularization_param
        )
        cost_history.append(cost)
        
        theta_gradient = MultilayerPerceptron.gradient_step(
            data, labels, optimized_theta, layers, regularization_param
        )
        optimized_theta = optimized_theta - alpha * theta_gradient
    
    return optimized_theta, cost_history

采用批量梯度下降法更新参数,每次迭代使用全部训练数据计算梯度。

关键技术细节

1. 权重矩阵的展开与恢复

由于不同层的权重矩阵尺寸不同,为方便梯度下降的统一处理,实现了权重矩阵的展开和恢复:

@staticmethod
def thetas_unroll(thetas):
    """将各层权重矩阵展开为一个长向量"""
    unrolled_thetas = np.array([])
    for theta_layer_index in range(len(thetas)):
        unrolled_thetas = np.hstack((unrolled_thetas, thetas[theta_layer_index].flatten()))
    return unrolled_thetas

@staticmethod
def thetas_roll(unrolled_thetas, layers):
    """将长向量恢复为各层权重矩阵"""
    thetas = {}
    unrolled_shift = 0
    for layer_index in range(len(layers) - 1):
        in_count = layers[layer_index]
        out_count = layers[layer_index + 1]
        thetas_volume = (in_count + 1) * out_count
        layer_thetas_unrolled = unrolled_thetas[unrolled_shift:unrolled_shift + thetas_volume]
        thetas[layer_index] = layer_thetas_unrolled.reshape((out_count, in_count + 1))
        unrolled_shift += thetas_volume
    return thetas

2. 损失函数计算

损失函数采用交叉熵损失并加入L2正则化项:

@staticmethod
def cost_function(data, labels, thetas, layers, regularization_param):
    num_examples = data.shape[0]
    num_labels = layers[-1]
    
    # 前向传播获取预测值
    predictions = MultilayerPerceptron.feedforward_propagation(data, thetas, layers)
    
    # 将标签转换为one-hot编码
    bitwise_labels = np.zeros((num_examples, num_labels))
    for example_index in range(num_examples):
        bitwise_labels[example_index][labels[example_index][0]] = 1
    
    # 计算正则化项
    theta_square_sum = 0
    for layer_index in range(len(layers) - 1):
        theta = thetas[layer_index]
        theta_square_sum += np.sum(theta[:, 1:] ** 2)  # 不计算偏置项
    
    regularization = (regularization_param / (2 * num_examples)) * theta_square_sum
    
    # 计算交叉熵损失
    bit_set_cost = np.sum(np.log(predictions[bitwise_labels == 1]))
    bit_not_set_cost = np.sum(np.log(1 - predictions[bitwise_labels == 0]))
    cost = (-1 / num_examples) * (bit_set_cost + bit_not_set_cost) + regularization
    
    return cost

实际应用

训练模型

# 初始化MLP
mlp = MultilayerPerceptron(
    data=train_data, 
    labels=train_labels,
    layers=[784, 30, 10],  # 输入层784节点(28x28图像),隐藏层30节点,输出层10节点(0-9数字)
    epsilon=0.12,
    normalize_data=True
)

# 训练模型
thetas, cost_history = mlp.train(
    regularization_param=1,
    max_iterations=1000,
    alpha=0.1
)

进行预测

predictions = mlp.predict(test_data)
accuracy = np.mean(predictions == test_labels) * 100
print(f"Test Accuracy: {accuracy:.2f}%")

性能优化建议

  1. 学习率调整:实现学习率衰减策略,如随着迭代次数增加逐渐减小学习率
  2. 动量优化:加入动量项加速收敛
  3. 批量处理:实现mini-batch梯度下降,减少内存消耗
  4. 激活函数:尝试ReLU等现代激活函数替代sigmoid
  5. 早停机制:验证集性能不再提升时提前终止训练

总结

本文详细解析了MLP的核心实现,包括前向传播、反向传播、参数更新等关键步骤。通过这个实现,我们可以深入理解神经网络的工作原理。虽然这是一个基础实现,但它包含了神经网络最核心的思想,为进一步学习更复杂的网络架构奠定了基础。