CS224W Colab1: Learning Node Embeddings

在这个colab中,我们会通过node embedding来对Karate Club Network的节点做一个分类。

本文主要参考资料为CS224W的Colab1。

Graph Basics

在这节课中依然要用到NetworkX,我们首先导入Karate Club Network

import networkx as nx
G = nx.karate_club_graph() # G is an undirected graph
nx.draw(G, with_labels = True)

Karate Club Network节点的平均的度

def average_degree(num_edges, num_nodes):
# TODO: Implement this function that takes number of edges
# and number of nodes, and returns the average node degree of 
# the graph. Round the result to nearest integer (for example 
# 3.3 will be rounded to 3 and 3.7 will be rounded to 4)
avg_degree = 0
############# Your code here ############
avg_degree += round((num_edges * 2) / num_nodes)
#########################################
return avg_degree
num_edges = G.number_of_edges()
num_nodes = G.number_of_nodes()
avg_degree = average_degree(num_edges, num_nodes)
print("Average degree of karate club network is {}".format(avg_degree))

Output: Average degree of karate club network is 5

Karate Club Network节点的平均聚类系数

定义节点并添加特征和label

def average_clustering_coefficient(G):
# TODO: Implement this function that takes a nx.Graph
# and returns the average clustering coefficient. Round 
# the result to 2 decimal places (for example 3.333 will
# be rounded to 3.33 and 3.7571 will be rounded to 3.76)
avg_cluster_coef = 0
############# Your code here ############
## Note: 
## 1: Please use the appropriate NetworkX clustering function
avg_cluster_coef += round(nx.average_clustering(G), 2)
#########################################
return avg_cluster_coef
avg_cluster_coef = average_clustering_coefficient(G)
print("Average clustering coefficient of karate club network is {}".format(avg_cluster_coef))

Output: Average clustering coefficient of karate club network is 0.57

节点0一轮迭代后的PageRank

PageRank的迭代式为\(r_{j}=\sum_{i \rightarrow j} \beta \frac{r_{i}}{d_{i}}+(1-\beta) \frac{1}{N}\)

def one_iter_pagerank(G, beta, r0, node_id):
# TODO: Implement this function that takes a nx.Graph, beta, r0 and node id.
# The return value r1 is one interation PageRank value for the input node.
# Please round r1 to 2 decimal places.
r1 = 0
############# Your code here ############
## Note: 
## 1: You should not use nx.pagerank
temp = 0
for n in G.neighbors(node_id):
    temp += 1 / G.degree(n)
r1 = round(beta * r0 * temp + (1 - beta) / G.number_of_nodes(), 2)
#########################################
return r1
beta = 0.8
r0 = 1 / G.number_of_nodes()
node = 0
r1 = one_iter_pagerank(G, beta, r0, node)
print("The PageRank value for node 0 after one iteration is {}".format(r1))

Output: The PageRank value for node 0 after one iteration is 0.13

节点5的邻近中心度(closeness centrality)

def closeness_centrality(G, node=5):
# TODO: Implement the function that calculates closeness centrality 
# for a node in karate club network. G is the input karate club 
# network and node is the node id in the graph. Please round the 
# closeness centrality result to 2 decimal places.
closeness = 0
## Note:
## 1: You can use networkx closeness centrality function.
## 2: Notice that networkx closeness centrality returns the normalized 
## closeness directly, which is different from the raw (unnormalized) 
## one that we learned in the lecture.
closeness = round(nx.closeness_centrality(G,node) / (len(nx.node_connected_component(G, node)) - 1), 2)
#########################################
return closeness
node = 5
closeness = closeness_centrality(G, node=node)
print("The karate club network has closeness centrality {}".format(closeness))

Output: The karate club network has closeness centrality 0.01

用Tensor表示图

将Karate Club Network的边的列表用torch.LongTensor表示

import torch
def graph_to_edge_list(G):
# TODO: Implement the function that returns the edge list of
# an nx.Graph. The returned edge_list should be a list of tuples
# where each tuple is a tuple representing an edge connected 
# by two nodes.
edge_list = []
############# Your code here ############
edge_list = [edge for edge in G.edges()]
#########################################
return edge_list
def edge_list_to_tensor(edge_list):
# TODO: Implement the function that transforms the edge_list to
# tensor. The input edge_list is a list of tuples and the resulting
# tensor should have the shape [2 x len(edge_list)].
edge_index = torch.tensor([])
############# Your code here ############
edge_index = torch.tensor(edge_list, dtype=torch.long).t()
#########################################
return edge_index
pos_edge_list = graph_to_edge_list(G)
pos_edge_index = edge_list_to_tensor(pos_edge_list)
print("The pos_edge_index tensor has shape {}".format(pos_edge_index.shape))
print("The pos_edge_index tensor has sum value {}".format(torch.sum(pos_edge_index)))

Output: 
The pos_edge_index tensor has shape torch.Size([2, 78])
The pos_edge_index tensor has sum value 2535

实现负采样函数,讨论给出的五个边是否为负边(原图不存在的边)

import random
import numpy as np
def sample_negative_edges(G, num_neg_samples):
# TODO: Implement the function that returns a list of negative edges.
# The number of sampled negative edges is num_neg_samples. You do not
# need to consider the corner case when the number of possible negative edges
# is less than num_neg_samples. It should be ok as long as your implementation 
# works on the karate club network. In this implementation, self loop should 
# not be considered as either a positive or negative edge. Also, notice that 
# the karate club network is an undirected graph, if (0, 1) is a positive 
# edge, do you think (1, 0) can be a negative one?
neg_edge_list = []
############# Your code here ############
neg_edge_list = [random.sample(list(enumerate(nx.non_edges(G))), num_neg_samples)[i][1] for i in range(num_neg_samples)]
#########################################
return neg_edge_list
# Sample 78 negative edges
neg_edge_list = sample_negative_edges(G, len(pos_edge_list))
# Transform the negative edge list to tensor
neg_edge_index = edge_list_to_tensor(neg_edge_list)
print("The neg_edge_index tensor has shape {}".format(neg_edge_index.shape))
# Which of following edges can be negative ones?
edge_1 = (7, 1)
edge_2 = (1, 33)
edge_3 = (33, 22)
edge_4 = (0, 4)
edge_5 = (4, 2)
############# Your code here ############
## Note:
## 1: For each of the 5 edges, print whether it can be negative edge
print('edge_1'+(" can't" if G.has_edge(edge_1[0],edge_1[1]) else ' can')+' be negative edge')
print('edge_2'+(" can't" if G.has_edge(edge_2[0],edge_2[1]) else ' can')+' be negative edge')
print('edge_3'+(" can't" if G.has_edge(edge_3[0],edge_3[1]) else ' can')+' be negative edge')
print('edge_4'+(" can't" if G.has_edge(edge_4[0],edge_4[1]) else ' can')+' be negative edge')
print('edge_5'+(" can't" if G.has_edge(edge_5[0],edge_5[1]) else ' can')+' be negative edge')
#########################################

Output:
The neg_edge_index tensor has shape torch.Size([2, 78])
edge_1 can't be negative edge
edge_2 can be negative edge
edge_3 can't be negative edge
edge_4 can't be negative edge
edge_5 can be negative edge

Node Embedding Learning

首先调用需要的包,生成初始embedding

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
torch.manual_seed(1)
def create_node_emb(num_node=34, embedding_dim=16):
# TODO: Implement this function that will create the node embedding matrix.
# A torch.nn.Embedding layer will be returned. You do not need to change 
# the values of num_node and embedding_dim. The weight matrix of returned 
# layer should be initialized under uniform distribution. 
emb = None
############# Your code here ############
emb = nn.Embedding(num_nodes, embedding_dim)
emb.weight.data = torch.rand(emb.weight.data.shape)
#########################################
return emb
emb = create_node_emb()
ids = torch.LongTensor([0, 3])
# Print the embedding layer
print("Embedding: {}".format(emb))
# An example that gets the embeddings for node 0 and 3
print(emb(ids))

Output:
Embedding: Embedding(34, 16)
tensor([[0.2114, 0.7335, 0.1433, 0.9647, 0.2933, 0.7951, 0.5170, 0.2801, 0.8339,
        0.1185, 0.2355, 0.5599, 0.8966, 0.2858, 0.1955, 0.1808],
        [0.7486, 0.6546, 0.3843, 0.9820, 0.6012, 0.3710, 0.4929, 0.9915, 0.8358,
        0.4629, 0.9902, 0.7196, 0.2338, 0.0450, 0.7906, 0.9689]],
    grad_fn=<EmbeddingBackward>)

利用PCA将16维的embedding降到两维并进行可视化

def visualize_emb(emb):
X = emb.weight.data.numpy()
pca = PCA(n_components=2)
components = pca.fit_transform(X)
plt.figure(figsize=(6, 6))
club1_x = []
club1_y = []
club2_x = []
club2_y = []
for node in G.nodes(data=True):
    if node[1]['club'] == 'Mr. Hi':
    club1_x.append(components[node[0]][0])
    club1_y.append(components[node[0]][1])
    else:
    club2_x.append(components[node[0]][0])
    club2_y.append(components[node[0]][1])
plt.scatter(club1_x, club1_y, color="red", label="Mr. Hi")
plt.scatter(club2_x, club2_y, color="blue", label="Officer")
plt.legend()
plt.show()
# Visualize the initial random embeddding
visualize_emb(emb)

可以看到随机初始化的embedding完全没有分开不同类的节点

接下来通过SGD学习embedding,这里的学习目标为相邻的节点的embedding的内积相近

from torch.optim import SGD
def accuracy(pred, label):
# TODO: Implement the accuracy function. This function takes the 
# pred tensor (the resulting tensor after sigmoid) and the label 
# tensor (torch.LongTensor). Predicted value greater than 0.5 will 
# be classified as label 1. Else it will be classified as label 0.
# The returned accuracy should be rounded to 4 decimal places. 
# For example, accuracy 0.82956 will be rounded to 0.8296.
accu = 0.0
############# Your code here ############
accu = sum(torch.round(pred) == label) / len(pred)
#########################################
return accu
def train(emb, loss_fn, sigmoid, train_label, train_edge):
# TODO: Train the embedding layer here. You can also change epochs and 
# learning rate. In general, you need to implement: 
# (1) Get the embeddings of the nodes in train_edge
# (2) Dot product the embeddings between each node pair
# (3) Feed the dot product result into sigmoid
# (4) Feed the sigmoid output into the loss_fn
# (5) Print both loss and accuracy of each epoch 
# (as a sanity check, the loss should decrease during training)
epochs = 500
learning_rate = 0.1
optimizer = SGD(emb.parameters(), lr=learning_rate, momentum=0.9)
for i in range(epochs):
    ############# Your code here ############
    optimizer.zero_grad()
    pred = sigmoid(torch.sum(emb(train_edge)[0].mul(emb(train_edge)[1]),1))
    loss = loss_fn(pred, train_label)        
    loss.backward()  # Derive gradients.
    optimizer.step()  # Update parameters based on gradients.                
    print("Epoch {} Loss: {}, Accuracy: {}".format(i,loss,accuracy(pred, train_label)))
    #########################################
loss_fn = nn.BCELoss()
sigmoid = nn.Sigmoid()
# Generate the positive and negative labels
pos_label = torch.ones(pos_edge_index.shape[1], )
neg_label = torch.zeros(neg_edge_index.shape[1], )
# Concat positive and negative labels into one tensor
train_label = torch.cat([pos_label, neg_label], dim=0)
# Concat positive and negative edges into one tensor
# Since the network is very small, we do not split the edges into val/test sets
train_edge = torch.cat([pos_edge_index, neg_edge_index], dim=1)
train(emb, loss_fn, sigmoid, train_label, train_edge)
visualize_emb(emb)

可以看到学习后的embedding基本能够区分不同类的节点。

References

CS224W - Colab1

cs224w(图机器学习)2021冬季课程学习笔记5 Colab 1:Node Embeddings

Contents
  1. 1. Graph Basics
    1. 1.1. Karate Club Network节点的平均的度
    2. 1.2. Karate Club Network节点的平均聚类系数
    3. 1.3. 节点0一轮迭代后的PageRank
    4. 1.4. 节点5的邻近中心度(closeness centrality)
  2. 2. 用Tensor表示图
    1. 2.1. 将Karate Club Network的边的列表用torch.LongTensor表示
    2. 2.2. 实现负采样函数,讨论给出的五个边是否为负边(原图不存在的边)
  3. 3. Node Embedding Learning
  • References
  • |