The Graph Advantage

Comparing ML using tabular vs graph data models with the Cora dataset

Author

Andrew Wang

This demo compares classification using a tabular dataset vs using a graph dataset. Using Neo4j and the Graph Data Science libraries, we show that modelling the data using a graph allows us to achieve better classification accuracy!

For the full notebook and other demos, see the repo.

Let’s get started by importing Python libraries:

!pip install git+https://github.com/hyperopt/hyperopt-sklearn
!pip install graphdatascience
import numpy as np
import pandas as pd

# For performing downstream classification
from sklearn.model_selection import train_test_split
from hpsklearn import HyperoptEstimator, random_forest_classifier

# For connecting to Neo4j instance
from graphdatascience import GraphDataScience

1. Problem statement

Classify dataset of academic papers into categories using the Cora dataset.

  • A research paper citation network
  • 2708 Paper nodes
  • 5429 CITES relationships
  • Each Paper node has a length 1433 binary feature vector
    • Each dimension represents a keyword
    • Value 1 if paper has keyword, 0 otherwise
  • Each paper node belongs to a subject

nodes = pd.read_csv("data/nodes.csv", converters={"features": eval})
nodes.set_index("id")[["subject", "features"]]
subject features
id
31336 Neural_Networks [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1061127 Rule_Learning [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
1106406 Reinforcement_Learning [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
13195 Reinforcement_Learning [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
37879 Probabilistic_Methods [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
... ... ...
1128975 Genetic_Algorithms [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1128977 Genetic_Algorithms [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1128978 Genetic_Algorithms [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
117328 Case_Based [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
24043 Neural_Networks [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

2708 rows × 2 columns

Papers also have citations to other papers; citations connect source papers to target papers:

edges = pd.read_csv("data/edges.csv")
edges[["source", "target"]]
source target
0 1033 35
1 103482 35
2 103515 35
3 1050679 35
4 1103960 35
... ... ...
5424 19621 853116
5425 853155 853116
5426 1140289 853118
5427 853118 853155
5428 1155073 954315

5429 rows × 2 columns

Look at classes:

nodes["subject"].value_counts().plot.bar(title="Number of papers per class");

2. Tabular classification

In a traditional ML system, we use features of each sample independently of other nodes to classify. We cannot easily model each paper’s citations in a tabular data format.

Therefore we will only use text features for classifying our papers dataset.

Note: We could construct aggregate numerical features such as how many times each paper cites another paper in a particular category. However this loses information about the spatial structure of the graph. It is hard to randomly split into train-test whilst preventing data leakage.

Get features and classes from dataset:

X = nodes["features"].apply(pd.Series).to_numpy()
y = nodes["subject"].astype("category").cat.codes.to_numpy()

Perform train-test split using scikit-learn and use hyperopt-sklearn to train a Random Forest Classifier and automatically train hyperparameters:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
classifier = HyperoptEstimator(classifier=random_forest_classifier("myclf"), trial_timeout=10, seed=1)
classifier.fit(X_train, y_train)
100%|████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.37s/trial, best loss: 0.7272727272727273]
100%|█████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  3.25s/trial, best loss: 0.371007371007371]
100%|████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  4.53s/trial, best loss: 0.3390663390663391]
100%|████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  2.76s/trial, best loss: 0.3390663390663391]
100%|████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.30s/trial, best loss: 0.3390663390663391]
100%|████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00,  5.01s/trial, best loss: 0.2628992628992629]
100%|████████████████████████████████████████████████████████████████| 7/7 [00:11<00:00, 11.70s/trial, best loss: 0.2628992628992629]
100%|████████████████████████████████████████████████████████████████| 8/8 [00:10<00:00, 10.32s/trial, best loss: 0.2628992628992629]
100%|████████████████████████████████████████████████████████████████| 9/9 [00:11<00:00, 11.63s/trial, best loss: 0.2628992628992629]
100%|██████████████████████████████████████████████████████████████| 10/10 [00:06<00:00,  6.63s/trial, best loss: 0.2604422604422605]

Calculate the classifier accuracy on the test set:

f"Accuracy: {format(classifier.score(X_test, y_test), '.0%')}"
'Accuracy: 71%'

3. Graph data classification

This problem is a classic graph data modelling problem. We now can use the citations data since they can be modelled by a graph data format.

We can leverage this data model to visualise and extract useful information about the data:

How do we use the graph for classification? We create spatial features (“embeddings”) for each data point based on their position in the graph. These features can be used for classification.

What are embeddings?

Node embedding algorithms compute low-dimensional vector representations of nodes in a graph. These vectors, also called embeddings, can be used for machine learning.

The FastRP algorithm operates on graphs, in which case we care about preserving similarity between nodes and their neighbors. This means that two nodes that have similar neighborhoods should be assigned similar embedding vectors. Conversely, two nodes that are not similar should be not be assigned similar embedding vectors.

We create the embeddings using the FastRP algorithm.

Note: Both FastRP and the downstream classification are statistical techniques and are analogous to simple NLP techniques. We don’t touch on GNNs here (analogous to transformers for text, see here for interesting commentary).

3.1 Load dataset into Neo4j

  1. In Neo4j Desktop, start a local DBMS and install the Graph Data Science Library and APOC plugins.
  2. Move the data (source) from this repo cora/data/*.csv to the database’s import folder. You can find this under Open/Open folder/Import in Neo4j Desktop.
  3. Load the csvs into Neo4j using the following Cypher commands:
Connect to Neo4j instance
from getpass import getpass
auth = ("neo4j", getpass("Password:"))
bolt = "bolt://localhost:7687/neo4j"
gds = GraphDataScience(bolt, auth=auth)
Password: ········
Load data from CSV into Neo4j
gds.run_cypher("""
LOAD CSV WITH HEADERS FROM "file:///nodes.csv" AS row
WITH toInteger(row.id) AS paperId, row.subject AS subject, row.features AS features
MERGE (p:Paper {paper_Id: paperId})
SET p.subject = subject, p.features = apoc.convert.fromJsonList(features)
RETURN count(p)
""")

gds.run_cypher("""
LOAD CSV WITH HEADERS FROM "file:///edges.csv" AS row
MATCH(source: Paper {paper_Id: toInteger(row.source)})
MATCH(target: Paper {paper_Id: toInteger(row.target)})
MERGE (source)-[r:CITES]->(target)
""")

Visualise graph using cypher queries in Neo4j Browser, such as (after colouring nodes with Subject):

MATCH (p:Paper) -[:CITES]-> (q:Paper) RETURN p,q LIMIT 200

3.2. Preprocess and project

Use the gds GraphDataScience client to connect to Neo4j and process the graph. Firstly run simple Cypher query to make categorical label from subject:

gds.run_cypher("""
MATCH (p:Paper)
SET p.subjectClass = {
    Neural_Networks: 0, Rule_Learning: 1, Reinforcement_Learning: 2, Probabilistic_Methods: 3, Theory: 4, Genetic_Algorithms: 5, Case_Based: 6
}[p.subject];
""")

Then project graph to in-memory and return Python object:

G, _ = gds.graph.project(
   "cora-graph",
   {"Paper": {"properties": ["subjectClass"]} },
   {"CITES": {"orientation": "UNDIRECTED", "aggregation": "SINGLE"}}
);

{.python .cell-code code-fold="true" code-summary="Check graph memory usage""} G.memory_usage()

'901 KiB'

3.3. Create node embedding vectors

We use embedding vector dimension of 128. We can directly call Neo4j’s fastRP algorithm from the Python client.

result = gds.fastRP.mutate(
   G,
   featureProperties=None,
   embeddingDimension=128,
   iterationWeights=[0, 0, 1.0, 1.0],
   normalizationStrength=0.05,
   mutateProperty="fastRP_Extended_Embedding"
)

After calling FastRP, stream the results back into Python. Each row of X is a 128-dimensional vector representing one paper.

X = gds.graph.streamNodeProperty(G, 'fastRP_Extended_Embedding')["propertyValue"].apply(pd.Series)
y = gds.graph.streamNodeProperty(G, 'subjectClass')["propertyValue"]

To visualise the embeddings, we use t-SNE algorithm to compress the 128-dimensional vectors into 2D.

from sklearn.manifold import TSNE
X_tsne = TSNE(n_components=2, random_state=0).fit_transform(X)

Plot the 2D embeddings of the papers, coloured by subject:

import seaborn as sns
df_tsne = pd.DataFrame({"x_1": X_tsne[:, 0], "x_2": X_tsne[:, 1], "y": nodes["subject"]})
sns.scatterplot(df_tsne, x="x_1", y="x_2", hue="y", edgecolor="none").legend(bbox_to_anchor=(1, 1));

We see that in a severely compressed 2D view, papers are clustered into their classes.

3.4 Perform classification with embedding vectors

Note GDS offers complete ML pipelines for data splitting, training classification models, hyperparameter tuning and inferencing. This is advantageous as it acts directly on embeddings which are stored in the in-memory graph, reducing need to move data. However here we export the embeddings and demonstrate using scikit-learn for familiarity.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
classifier = HyperoptEstimator(classifier=random_forest_classifier("myclf"), trial_timeout=10, seed=1)
classifier.fit(X_train, y_train)
100%|████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.20s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 11.57s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  2.17s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 4/4 [00:12<00:00, 12.17s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 5/5 [00:11<00:00, 11.69s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 6/6 [00:11<00:00, 11.56s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  1.93s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  2.51s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 9/9 [00:03<00:00,  3.29s/trial, best loss: 0.1744471744471745]
100%|██████████████████████████████████████████████████████████████| 10/10 [00:02<00:00,  2.86s/trial, best loss: 0.1744471744471745]

Calculate classifier accuracy on test set:

f"Accuracy: {format(classifier.score(X_test, y_test), '.0%')}"
'Accuracy: 84%'

References

Following tutorial. Some images sourced from here. Embedding images source and source.