首页
/ DeepNLP-models-Pytorch项目:Skip-gram负采样模型详解与实现

DeepNLP-models-Pytorch项目:Skip-gram负采样模型详解与实现

2025-07-10 04:38:42作者:彭桢灵Jeremy

1. 模型背景与原理

Skip-gram模型是Word2Vec框架中的一种重要方法,用于学习单词的分布式表示(词向量)。与传统的连续词袋模型(CBOW)不同,Skip-gram通过中心词预测上下文词,特别适合处理小型数据集和稀有词。

负采样(negative sampling)是Skip-gram的一种优化技术,它通过采样少量负例来近似整个词汇表的softmax计算,大大提高了训练效率。本文基于DeepNLP-models-Pytorch项目,详细解析Skip-gram负采样模型的实现过程。

2. 环境准备与数据加载

首先需要准备Python环境和必要的库:

import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.optim as optim
import torch.nn.functional as F
import nltk
import random
import numpy as np
from collections import Counter

我们使用NLTK库中的Gutenberg语料库,加载《白鲸记》的部分文本作为示例数据:

corpus = list(nltk.corpus.gutenberg.sents('melville-moby_dick.txt'))[:500]
corpus = [[word.lower() for word in sent] for sent in corpus]

3. 数据预处理

3.1 低频词过滤

为了减少噪声和提高模型效果,我们过滤掉出现次数过少的单词:

word_count = Counter(flatten(corpus))
MIN_COUNT = 3
exclude = [w for w, c in word_count.items() if c < MIN_COUNT]

3.2 构建词汇表

vocab = list(set(flatten(corpus)) - set(exclude))
word2index = {vo:i for i, vo in enumerate(vocab)}
index2word = {v:k for k, v in word2index.items()}

3.3 构建训练数据

我们使用滑动窗口方法构建(中心词,上下文词)对:

WINDOW_SIZE = 5
windows = flatten([list(nltk.ngrams(['<DUMMY>'] * WINDOW_SIZE + c + ['<DUMMY>'] * WINDOW_SIZE, 
                  WINDOW_SIZE * 2 + 1)) for c in corpus])

train_data = []
for window in windows:
    for i in range(WINDOW_SIZE * 2 + 1):
        if window[i] in exclude or window[WINDOW_SIZE] in exclude:
            continue
        if i == WINDOW_SIZE or window[i] == '<DUMMY>':
            continue
        train_data.append((window[WINDOW_SIZE], window[i]))

4. 负采样技术实现

4.1 构建一元分布表

负采样基于修正后的一元分布:

P(w)=U(w)3/4/ZP(w)=U(w)^{3/4}/Z

实现代码如下:

word_count = Counter(flatten(corpus))
num_total_words = sum([c for w, c in word_count.items() if w not in exclude])

unigram_table = []
Z = 0.001
for vo in vocab:
    unigram_table.extend([vo] * int(((word_count[vo]/num_total_words)**0.75)/Z))

4.2 负采样函数

def negative_sampling(targets, unigram_table, k):
    batch_size = targets.size(0)
    neg_samples = []
    for i in range(batch_size):
        nsample = []
        target_index = targets[i].data.cpu().tolist()[0] if USE_CUDA else targets[i].data.tolist()[0]
        while len(nsample) < k:
            neg = random.choice(unigram_table)
            if word2index[neg] == target_index:
                continue
            nsample.append(neg)
        neg_samples.append(prepare_sequence(nsample, word2index).view(1, -1))
    return torch.cat(neg_samples)

5. Skip-gram模型实现

模型包含两个嵌入层:一个用于中心词,一个用于上下文词:

class SkipgramNegSampling(nn.Module):
    def __init__(self, vocab_size, projection_dim):
        super(SkipgramNegSampling, self).__init__()
        self.embedding_v = nn.Embedding(vocab_size, projection_dim) # 中心词嵌入
        self.embedding_u = nn.Embedding(vocab_size, projection_dim) # 上下文词嵌入
        self.logsigmoid = nn.LogSigmoid()
        
        # Xavier初始化
        initrange = (2.0 / (vocab_size + projection_dim))**0.5
        self.embedding_v.weight.data.uniform_(-initrange, initrange)
        self.embedding_u.weight.data.uniform_(-0.0, 0.0)
        
    def forward(self, center_words, target_words, negative_words):
        center_embeds = self.embedding_v(center_words) # B x 1 x D
        target_embeds = self.embedding_u(target_words) # B x 1 x D
        neg_embeds = -self.embedding_u(negative_words) # B x K x D
        
        positive_score = target_embeds.bmm(center_embeds.transpose(1, 2)).squeeze(2)
        negative_score = torch.sum(neg_embeds.bmm(center_embeds.transpose(1, 2)).squeeze(2), 1).view(negs.size(0), -1)
        
        loss = self.logsigmoid(positive_score) + self.logsigmoid(negative_score)
        return -torch.mean(loss)
    
    def prediction(self, inputs):
        return self.embedding_v(inputs)

6. 模型训练

设置训练参数并开始训练:

EMBEDDING_SIZE = 30 
BATCH_SIZE = 256
EPOCH = 100
NEG = 10 # 负采样数量

model = SkipgramNegSampling(len(word2index), EMBEDDING_SIZE)
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(EPOCH):
    for batch in getBatch(BATCH_SIZE, train_data):
        inputs, targets = zip(*batch)
        inputs = torch.cat(inputs)
        targets = torch.cat(targets)
        negs = negative_sampling(targets, unigram_table, NEG)
        
        model.zero_grad()
        loss = model(inputs, targets, negs)
        loss.backward()
        optimizer.step()

7. 模型测试与应用

训练完成后,我们可以查询与给定词最相似的其他词:

def word_similarity(target, vocab):
    target_V = model.prediction(prepare_word(target, word2index))
    similarities = []
    for word in vocab:
        if word == target: continue
        vector = model.prediction(prepare_word(word, word2index))
        cosine_sim = F.cosine_similarity(target_V, vector).data.tolist()[0]
        similarities.append([word, cosine_sim])
    return sorted(similarities, key=lambda x: x[1], reverse=True)[:10]

# 测试随机词
test_word = random.choice(list(vocab))
similar_words = word_similarity(test_word, vocab)

8. 总结

本文详细介绍了基于PyTorch的Skip-gram负采样模型的实现过程,包括:

  1. 数据预处理与词汇表构建
  2. 负采样技术的原理与实现
  3. Skip-gram模型架构设计
  4. 模型训练与评估方法

Skip-gram负采样模型通过高效的训练方式和良好的词向量表示能力,成为自然语言处理领域的基础技术之一。通过调整窗口大小、负采样数量等参数,可以进一步优化模型在不同任务上的表现。