!pip install git+https://github.com/hyperopt/hyperopt-sklearn
!pip install graphdatascience
The Graph Advantage
Comparing ML using tabular vs graph data models with the Cora dataset
This demo compares classification using a tabular dataset vs using a graph dataset. Using Neo4j and the Graph Data Science libraries, we show that modelling the data using a graph allows us to achieve better classification accuracy!
For the full notebook and other demos, see the repo.
Let’s get started by importing Python libraries:
import numpy as np
import pandas as pd
# For performing downstream classification
from sklearn.model_selection import train_test_split
from hpsklearn import HyperoptEstimator, random_forest_classifier
# For connecting to Neo4j instance
from graphdatascience import GraphDataScience
1. Problem statement
Classify dataset of academic papers into categories using the Cora dataset.
- A research paper citation network
- 2708 Paper nodes
- 5429 CITES relationships
- Each Paper node has a length 1433 binary feature vector
- Each dimension represents a keyword
- Value 1 if paper has keyword, 0 otherwise
- Each paper node belongs to a subject
= pd.read_csv("data/nodes.csv", converters={"features": eval})
nodes "id")[["subject", "features"]] nodes.set_index(
subject | features | |
---|---|---|
id | ||
31336 | Neural_Networks | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
1061127 | Rule_Learning | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ... |
1106406 | Reinforcement_Learning | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
13195 | Reinforcement_Learning | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
37879 | Probabilistic_Methods | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
... | ... | ... |
1128975 | Genetic_Algorithms | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
1128977 | Genetic_Algorithms | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
1128978 | Genetic_Algorithms | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
117328 | Case_Based | [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
24043 | Neural_Networks | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
2708 rows × 2 columns
Papers also have citations to other papers; citations connect source
papers to target
papers:
= pd.read_csv("data/edges.csv")
edges "source", "target"]] edges[[
source | target | |
---|---|---|
0 | 1033 | 35 |
1 | 103482 | 35 |
2 | 103515 | 35 |
3 | 1050679 | 35 |
4 | 1103960 | 35 |
... | ... | ... |
5424 | 19621 | 853116 |
5425 | 853155 | 853116 |
5426 | 1140289 | 853118 |
5427 | 853118 | 853155 |
5428 | 1155073 | 954315 |
5429 rows × 2 columns
Look at classes:
"subject"].value_counts().plot.bar(title="Number of papers per class"); nodes[
2. Tabular classification
In a traditional ML system, we use features of each sample independently of other nodes to classify. We cannot easily model each paper’s citations in a tabular data format.
Therefore we will only use text features for classifying our papers dataset.
Note: We could construct aggregate numerical features such as how many times each paper cites another paper in a particular category. However this loses information about the spatial structure of the graph. It is hard to randomly split into train-test whilst preventing data leakage.
Get features and classes from dataset:
= nodes["features"].apply(pd.Series).to_numpy()
X = nodes["subject"].astype("category").cat.codes.to_numpy() y
Perform train-test split using scikit-learn
and use hyperopt-sklearn
to train a Random Forest Classifier and automatically train hyperparameters:
= train_test_split(X, y, random_state=0) X_train, X_test, y_train, y_test
= HyperoptEstimator(classifier=random_forest_classifier("myclf"), trial_timeout=10, seed=1) classifier
classifier.fit(X_train, y_train)
100%|████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.37s/trial, best loss: 0.7272727272727273]
100%|█████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 3.25s/trial, best loss: 0.371007371007371]
100%|████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 4.53s/trial, best loss: 0.3390663390663391]
100%|████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 2.76s/trial, best loss: 0.3390663390663391]
100%|████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 2.30s/trial, best loss: 0.3390663390663391]
100%|████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00, 5.01s/trial, best loss: 0.2628992628992629]
100%|████████████████████████████████████████████████████████████████| 7/7 [00:11<00:00, 11.70s/trial, best loss: 0.2628992628992629]
100%|████████████████████████████████████████████████████████████████| 8/8 [00:10<00:00, 10.32s/trial, best loss: 0.2628992628992629]
100%|████████████████████████████████████████████████████████████████| 9/9 [00:11<00:00, 11.63s/trial, best loss: 0.2628992628992629]
100%|██████████████████████████████████████████████████████████████| 10/10 [00:06<00:00, 6.63s/trial, best loss: 0.2604422604422605]
Calculate the classifier accuracy on the test set:
f"Accuracy: {format(classifier.score(X_test, y_test), '.0%')}"
'Accuracy: 71%'
3. Graph data classification
This problem is a classic graph data modelling problem. We now can use the citations data since they can be modelled by a graph data format.
We can leverage this data model to visualise and extract useful information about the data:
How do we use the graph for classification? We create spatial features (“embeddings”) for each data point based on their position in the graph. These features can be used for classification.
What are embeddings?
Node embedding algorithms compute low-dimensional vector representations of nodes in a graph. These vectors, also called embeddings, can be used for machine learning.
The FastRP algorithm operates on graphs, in which case we care about preserving similarity between nodes and their neighbors. This means that two nodes that have similar neighborhoods should be assigned similar embedding vectors. Conversely, two nodes that are not similar should be not be assigned similar embedding vectors.
We create the embeddings using the FastRP algorithm.
Note: Both FastRP and the downstream classification are statistical techniques and are analogous to simple NLP techniques. We don’t touch on GNNs here (analogous to transformers for text, see here for interesting commentary).
3.1 Load dataset into Neo4j
- In Neo4j Desktop, start a local DBMS and install the Graph Data Science Library and APOC plugins.
- Move the data (source) from this repo
cora/data/*.csv
to the database’simport
folder. You can find this underOpen/Open folder/Import
in Neo4j Desktop. - Load the
csv
s into Neo4j using the following Cypher commands:
Connect to Neo4j instance
from getpass import getpass
= ("neo4j", getpass("Password:"))
auth = "bolt://localhost:7687/neo4j"
bolt = GraphDataScience(bolt, auth=auth) gds
Password: ········
Load data from CSV into Neo4j
"""
gds.run_cypher(LOAD CSV WITH HEADERS FROM "file:///nodes.csv" AS row
WITH toInteger(row.id) AS paperId, row.subject AS subject, row.features AS features
MERGE (p:Paper {paper_Id: paperId})
SET p.subject = subject, p.features = apoc.convert.fromJsonList(features)
RETURN count(p)
""")
"""
gds.run_cypher(LOAD CSV WITH HEADERS FROM "file:///edges.csv" AS row
MATCH(source: Paper {paper_Id: toInteger(row.source)})
MATCH(target: Paper {paper_Id: toInteger(row.target)})
MERGE (source)-[r:CITES]->(target)
""")
Visualise graph using cypher queries in Neo4j Browser, such as (after colouring nodes with Subject):
MATCH (p:Paper) -[:CITES]-> (q:Paper) RETURN p,q LIMIT 200
3.2. Preprocess and project
Use the gds
GraphDataScience client to connect to Neo4j and process the graph. Firstly run simple Cypher query to make categorical label from subject:
"""
gds.run_cypher(MATCH (p:Paper)
SET p.subjectClass = {
Neural_Networks: 0, Rule_Learning: 1, Reinforcement_Learning: 2, Probabilistic_Methods: 3, Theory: 4, Genetic_Algorithms: 5, Case_Based: 6
}[p.subject];
""")
Then project graph to in-memory and return Python object:
= gds.graph.project(
G, _ "cora-graph",
"Paper": {"properties": ["subjectClass"]} },
{"CITES": {"orientation": "UNDIRECTED", "aggregation": "SINGLE"}}
{; )
{.python .cell-code code-fold="true" code-summary="Check graph memory usage""} G.memory_usage()
'901 KiB'
3.3. Create node embedding vectors
We use embedding vector dimension of 128. We can directly call Neo4j’s fastRP algorithm from the Python client.
= gds.fastRP.mutate(
result
G,=None,
featureProperties=128,
embeddingDimension=[0, 0, 1.0, 1.0],
iterationWeights=0.05,
normalizationStrength="fastRP_Extended_Embedding"
mutateProperty )
After calling FastRP, stream the results back into Python. Each row of X
is a 128-dimensional vector representing one paper.
= gds.graph.streamNodeProperty(G, 'fastRP_Extended_Embedding')["propertyValue"].apply(pd.Series) X
= gds.graph.streamNodeProperty(G, 'subjectClass')["propertyValue"] y
To visualise the embeddings, we use t-SNE algorithm to compress the 128-dimensional vectors into 2D.
from sklearn.manifold import TSNE
= TSNE(n_components=2, random_state=0).fit_transform(X) X_tsne
Plot the 2D embeddings of the papers, coloured by subject:
import seaborn as sns
= pd.DataFrame({"x_1": X_tsne[:, 0], "x_2": X_tsne[:, 1], "y": nodes["subject"]})
df_tsne ="x_1", y="x_2", hue="y", edgecolor="none").legend(bbox_to_anchor=(1, 1)); sns.scatterplot(df_tsne, x
We see that in a severely compressed 2D view, papers are clustered into their classes.
3.4 Perform classification with embedding vectors
Note GDS offers complete ML pipelines for data splitting, training classification models, hyperparameter tuning and inferencing. This is advantageous as it acts directly on embeddings which are stored in the in-memory graph, reducing need to move data. However here we export the embeddings and demonstrate using scikit-learn
for familiarity.
= train_test_split(X, y, random_state=0) X_train, X_test, y_train, y_test
= HyperoptEstimator(classifier=random_forest_classifier("myclf"), trial_timeout=10, seed=1) classifier
classifier.fit(X_train, y_train)
100%|████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00, 6.20s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 11.57s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 2.17s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 4/4 [00:12<00:00, 12.17s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 5/5 [00:11<00:00, 11.69s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 6/6 [00:11<00:00, 11.56s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 1.93s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 2.51s/trial, best loss: 0.1744471744471745]
100%|████████████████████████████████████████████████████████████████| 9/9 [00:03<00:00, 3.29s/trial, best loss: 0.1744471744471745]
100%|██████████████████████████████████████████████████████████████| 10/10 [00:02<00:00, 2.86s/trial, best loss: 0.1744471744471745]
Calculate classifier accuracy on test set:
f"Accuracy: {format(classifier.score(X_test, y_test), '.0%')}"
'Accuracy: 84%'
References
Following tutorial. Some images sourced from here. Embedding images source and source.