G1
Lecture Slides
In-class Exercises
Set up a graph
We will work with a very simple example here just to get things going and to demonstrate some of the graph concepts discussed in the lecture.
Preliminaries
First, import graphframes
. You could do from graphframes import *
, but this is bad form and should be avoided.
import graphframes as GF
Now import some useful methods to set up our datastructures.
from pyspark.sql import SQLContext
sql_context = SQLContext(sc)
The example is a tiny social network consisting of 7 people. Each person is given an id
, a name
, and an age
. The
vertices of the graph are the people. The vertex labels are the attributes (id
, name
, age
). The edges indicate how the
people are connected. The edges consist of a src
(the starting vertex), a dst
(an ending vertex), and a relationship
(how the people are connected).
Create a dataframe of graph vertices
# Vertex DataFrame
v = sql_context.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)
], ["id", "name", "age"])
Create a dataframe of graph edges
# Edge DataFrame
e = sql_context.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
], ["src", "dst", "relationship"])
Create a graph from the vertices and the edges
Create the graph using graphframes
is easy once you have the dataframes set up.
# g = GF.GraphFrame(v,e)
We can even save the graph
g.vertices.write.parquet("vertices")
g.edges.write.parquet("edges")
If you want to load the saved graph, simply do the following:
# Load the vertices and edges.
# v = sqlContext.read.parquet("hdfs://myLocation/vertices")
# e = sqlContext.read.parquet("hdfs://myLocation/edges")
# Create a graph
# g = GF.GraphFrame(v, e)
Note that the above code assumes that you have a SQL context.
Exercise
Draw by hand the directed multigraph that you just created.
Exploring the Graph
Visualize the vertices
g.vertices.show()
Visualize the edges
g.edges.show()
See the degree of each vertex based on what's coming in
vertexInDegrees = g.inDegrees
vertexInDegrees.show()
Now check the vertex degree based on what's going out
vertexOutDegrees = g.outDegrees
vertexOutDegrees.show()
Group and filter vertices based on specified criteria
# Minimum age in network
g.vertices.groupBy().min("age").show()
# Find out who is doing the following and count them
numFollows = g.edges.filter("relationship = 'follow'").count()
print("There are {0} people following other people.".format(numFollows))
We can also create subgraphs
To create a subgraph from our graph, just filter the edges and vertices according to some criterion. Then you'll have some new vertices and edges. You can create a new graph from those vertices and edges. This new graph is a subgraph.
Exercise
Create a subgraph of the graph that we just created where the vertices are people over 32 and the edges are followers
.
Introduction to Some Graph Algorithms
Several algorithms were mentioned in class. GraphFrames provides support for many algorithms. In this section, we walk through a handful of the algorithms.
Algorithm 1: Breadth-First Search
First we'll search the graph for users of age less than 32 start with people named Esther.
# Search from "Esther" for users of age < 32.
paths = g.bfs("name = 'Esther'", "age < 32")
paths.show()
We can also search based on the edge attributes and the maximum path length.
# Specify edge filters or max path lengths.
paths = g.bfs("name = 'Esther'", "age < 32",\
edgeFilter="relationship != 'friend'", maxPathLength=3)
paths.show()
Algorithm 2: Label Propagation
result = g.labelPropagation(maxIter=3)
result.select("id", "label").show()
Algorithm 3: PageRank
Try on your own. See the API docs for details.