Getting started with SimSpread.jl

SimSpread is a novel approach for predicting interactions between two distinct set of nodes, query and target nodes, using a similarity measure vector between query nodes as a meta-description in combination with the network-based inference for link prediction.

In this tutorial, we will skim through the basic workflow for using SimSpread.jl using as an example the prediction of dug-target interactions for the Nuclear Receptor dataset from Yamanishi, et al (2008).

Preparing our problem

First, we will download the known drug-target interaction matrix and drug-drug SIMCOMP similarity matrix for the 'Nuclear Receptor' dataset. The package provides a helper for easy download and preparation for this group of datasets (refer to ??getyamanishi for more information).

using SimSpread

DT, DD = getyamanishi("nr")

Let's visualize our data as heatmaps:

Data splitting

Next, we will train a model using SimSpread to predict the targets for a subset of drugs in the dataset. For this, we will split our dataset in 2 groups: training set, which will correspond to 90% of the data, and testing set, which will correspond to the remaining 10%:

N = size(DT, 1)

train = rand(N) .< 0.9
test = .!train

ytrain = DT[train, :]
ytest = DT[test, :]

Training set size: (49, 26)
Testing set size:  (5, 26)

As seen here, around 90% of the dataset corresponds to training and the remaining to testing sets. From this splitting we will proceed to construct our query network and predict the interactions for the testing set.

Similarity-based meta-description preparation

SimSpread uses a meta-description constructed from the similarity between source nodes (drugs in the working example). For this, a similarity threshold (denoted with α) is employed to keep links between source nodes that have a weight greater or equal to this threshold.

This procedure encodes the question "Is drug i similar to drug j?", which is later used by the resource spreading algorithm for link prediction.

α = 0.35
Xtrain = featurize(DD[train, train], α, false)
Xtest = featurize(DD[test, train], α, false)

Let's compare the similarity matrices before and after the featurization procedure:

Training set:

Testing set:

As seen here, all comparisons with a weight lower than our threshold α are eliminated (i.e. filled with a zero, 0) and a structure arises from this new featurized matrix.

Predicting DTIs with SimSpread

Now that we have all the information necessary for SimSpread, we can construct the query graph that is used to predict links using network-based-inference resource allocation algorithm.

In first place, we need to construct the query network for label prediction:

G = construct(ytrain, ytest, Xtrain, Xtest)

From this, we can predict the labels as follows:

ŷtrain = predict(G, ytrain)
ŷtest = predict(G, ytest)

Finally, we assess the performance of our model using the area under ROC curve, denoted as AuROC:

AuROC training set: 0.989
AuROC testing set:  0.729

Let's visualize the predictions obtained from our model:

Training set:

Testing set:

This wraps up our tutorial. The following tutorial provided (i) a more in-depth use case of the core utilities of SimSpread.jl and (ii) how to optimize a SimSpread model and evaluate its predictions. Adittionally, recipes for common ML tasks are provided in the next sections, specifically, common corss-validation scenarios.

This page was generated using Literate.jl.