G1



Lecture Slides

Lecture G1 slides

In-class Exercises

Set up a graph

We will work with a very simple example here just to get things going and to demonstrate some of the graph concepts discussed in the lecture.

Preliminaries

First, import graphframes. You could do from graphframes import *, but this is bad form and should be avoided.

import graphframes as GF

Now import some useful methods to set up our datastructures.

from pyspark.sql import SQLContext
sql_context = SQLContext(sc)

The example is a tiny social network consisting of 7 people. Each person is given an id, a name, and an age. The vertices of the graph are the people. The vertex labels are the attributes (id, name, age). The edges indicate how the people are connected. The edges consist of a src (the starting vertex), a dst (an ending vertex), and a relationship (how the people are connected).

Create a dataframe of graph vertices

# Vertex DataFrame
v = sql_context.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)
], ["id", "name", "age"])

Create a dataframe of graph edges

# Edge DataFrame                                                                                                       
e = sql_context.createDataFrame([                                                                                      
  ("a", "b", "friend"),                                                                                                
  ("b", "c", "follow"),                                                                                                
  ("c", "b", "follow"),                                                                                                
  ("f", "c", "follow"),                                                                                                
  ("e", "f", "follow"),                                                                                                
  ("e", "d", "friend"),                                                                                                
  ("d", "a", "friend"),                                                                                                
  ("a", "e", "friend")                                                                                                 
], ["src", "dst", "relationship"])

Create a graph from the vertices and the edges

Create the graph using graphframes is easy once you have the dataframes set up.

# g = GF.GraphFrame(v,e) 

We can even save the graph

g.vertices.write.parquet("vertices")
g.edges.write.parquet("edges")

If you want to load the saved graph, simply do the following:

# Load the vertices and edges.
# v = sqlContext.read.parquet("hdfs://myLocation/vertices")
# e = sqlContext.read.parquet("hdfs://myLocation/edges")
# Create a graph
# g = GF.GraphFrame(v, e)

Note that the above code assumes that you have a SQL context.

Exercise

Draw by hand the directed multigraph that you just created.

Exploring the Graph

Visualize the vertices

g.vertices.show()

Visualize the edges

g.edges.show()

See the degree of each vertex based on what's coming in

vertexInDegrees = g.inDegrees
vertexInDegrees.show()

Now check the vertex degree based on what's going out

vertexOutDegrees = g.outDegrees
vertexOutDegrees.show()

Group and filter vertices based on specified criteria

# Minimum age in network
g.vertices.groupBy().min("age").show()
# Find out who is doing the following and count them
numFollows = g.edges.filter("relationship = 'follow'").count()
print("There are {0} people following other people.".format(numFollows))

We can also create subgraphs

To create a subgraph from our graph, just filter the edges and vertices according to some criterion. Then you'll have some new vertices and edges. You can create a new graph from those vertices and edges. This new graph is a subgraph.

Exercise

Create a subgraph of the graph that we just created where the vertices are people over 32 and the edges are followers.

Introduction to Some Graph Algorithms

Several algorithms were mentioned in class. GraphFrames provides support for many algorithms. In this section, we walk through a handful of the algorithms.

Algorithm 1: Breadth-First Search

First we'll search the graph for users of age less than 32 start with people named Esther.

# Search from "Esther" for users of age < 32.
paths = g.bfs("name = 'Esther'", "age < 32")
paths.show()

We can also search based on the edge attributes and the maximum path length.

# Specify edge filters or max path lengths.
paths = g.bfs("name = 'Esther'", "age < 32",\
  edgeFilter="relationship != 'friend'", maxPathLength=3)
paths.show()

Algorithm 2: Label Propagation

result = g.labelPropagation(maxIter=3)
result.select("id", "label").show()

Algorithm 3: PageRank

Try on your own. See the API docs for details.