# 第一次作业

本次作业我们将实现经典的deepwalk算法，主要包含以下几部分内容：
* deepwalk的随机游走部分需要大家来完成；
* 模型部分我们借助gensim提供的Word2Vec包来实现；
* 调用CogDL来获取Cora这个节点分类数据集；
* 使用线性分类器对deepwalk学到的向量进行评估。

本作业需要安装[CogDL](https://github.com/THUDM/cogdl)：pip install cogdl

如需使用gpu版，请先安装gpu版本的[PyTorch](https://pytorch.org/get-started/locally/)，再安装cogdl。

本作业由智谱GNN中心及课程团队筹备，由CogDL团队提供技术支持。

## 0. 安装CogDL

In [None]:
!pip install cogdl

## 1. 加载数据集

从cogdl中加载论文引用网络Cora数据集。在Cora数据集中，节点代表每篇文章，边代表论文之间的引用关系，节点标签代表文章所属的类型（7分类）。

In [None]:
from cogdl.datasets import build_dataset_from_name

dataset = build_dataset_from_name("cora")
graph = dataset[0]
print(graph)

train_mask = graph.train_mask
val_mask = graph.val_mask
test_mask = graph.test_mask
labels = graph.y.numpy()

graph = graph.to_networkx()

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Downloading https://cloud.tsinghua.edu.cn/d/6808093f7f8042bfa1f0/files/?p=%2Fcora.zip&dl=1
unpacking cora.zip
Processing...
Done!
Graph(x=[2708, 1433], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], edge_index=[2, 10556])


## 2. 实现DeepWalk算法

在本次作业中，同学们可以先阅读已提供的DeepWalk算法的框架，然后完成其中随机游走生成路径部分的代码。

In [None]:
import random
import numpy as np
from gensim.models import Word2Vec
from tqdm import tqdm

class DeepWalk:
    r"""The DeepWalk model from the `"DeepWalk: Online Learning of Social Representations"
    <https://arxiv.org/abs/1403.6652>`_ paper
    Args:
        hidden_size (int) : The dimension of node representation.
        walk_length (int) : The walk length.
        walk_num (int) : The number of walks to sample for each node.
        window_size (int) : The actual context size which is considered in language model.
        worker (int) : The number of workers for word2vec.
        iteration (int) : The number of training iteration in word2vec.
    """
    def __init__(self, dimension, walk_length, walk_num, window_size, worker=1, iteration=10):
        super(DeepWalk, self).__init__()
        self.dimension = dimension
        self.walk_length = walk_length
        self.walk_num = walk_num
        self.window_size = window_size
        self.worker = worker
        self.iteration = iteration

    def train(self, graph):
        nx_nodes = graph.nodes()
        num_nodes = len(nx_nodes)
        
        '''
        请实现随机游走算法来获取walks，形式为list[list]，比如[[1,2,3], [2,3,4]]。
        从图中每个节点出发walk_num次，每次走一条walk_length长度的路径。
        '''
        ###################
        ##### 代码填空 #####
        ###################
        
        # walks = ...
        
        walks = [[str(node) for node in walk] for walk in walks] # 将walk中的元素转成str，满足Word2Vec计算的要求
        print("training word2vec...")
        model = Word2Vec(
            walks,
            size=self.dimension,
            window=self.window_size,
            min_count=0,
            sg=1,
            workers=self.worker,
            iter=self.iteration,
        )
        id2node = dict([(vid, node) for vid, node in enumerate(graph.nodes())])
        embeddings = np.asarray([model.wv[str(id2node[i])] for i in range(len(id2node))])

        features_matrix = np.zeros((num_nodes, embeddings.shape[1]))
        features_matrix[nx_nodes] = embeddings[np.arange(num_nodes)]
        return features_matrix


## 3. DeepWalk模型训练


In [None]:
model = DeepWalk(dimension=128, walk_length=20, walk_num=10, window_size=5)
emb = model.train(graph)
print(emb.shape)

 10%|█         | 1/10 [00:00<00:01,  6.89it/s]

node number: 2708
generating random walks...


100%|██████████| 10/10 [00:01<00:00,  7.60it/s]


training word2vec...
(2708, 128)


## 4. 训练下游线性分类器得到预测结果

In [None]:
from sklearn.linear_model import LogisticRegression

train_X = emb[train_mask]
test_X = emb[test_mask]
train_y = labels[train_mask]
test_y = labels[test_mask]

clf = LogisticRegression(solver="liblinear")
clf.fit(train_X, train_y)
pred = clf.predict(test_X)

acc = (pred == test_y).sum() / len(pred)
print(f"预测准确率为{acc*100:.1f}%")

预测准确率为68.4%
