API

Core

SimSpread.spread — Function

spread(G::AbstractMatrix{Float64})

Calculate the transfer matrix for the adyacency matrix of the trilayered feature-source-target network.

Arguments

G::AbstractMatrix{Float64}: Trilayered feature-source-target network adjacency matrix.

Extended help

Potential interactions between nodes in a graph can be identified by using resource diffusion processes in the feature-source-target network, namely aforementioned graph G. For each node nᵢ in the network, it has initial resources located in both its neighboring nodes and its features. Initially, each feature and each neighboring node of nᵢ equally spread their resources to neighboring nodes. Subsequently, each of those nodes equally spreads its resources to neighbor nodes. Thus, nᵢ will obtain final resources located in several neighboring nodes, suggesting that nᵢ may have potential interactions with these nodes.

References

Wu, et al (2016). SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning. Briefings in Bioinformatics, bbw012. https://doi.org/10.1093/bib/bbw012
Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666

SimSpread.cutoff — Function

cutoff(x::T, α::T, weighted::Bool=false) where {T<:AbstractFloat}

Transform x based in SimSpread's similarity cutoff function.

Arguments

x::AbstractFloat : Value to transform
α::AbstractFloat : Similarity cutoff
weighted::Bool : Apply weighting function to outcome (default = false)

cutoff(M::AbstractVecOrMat{T}, α::T, weighted::Bool=false) where {T<:AbstractFloat}

Transform the vector or matrix X based in SimSpread's similarity cutoff function.

Arguments

X::AbstractVecOrMat{AbstractFloat} : Matrix or Vector to transform
α::AbstractFloat : Similarity cutoff
weighted::Bool : Apply weighting function to outcome (default = false)

SimSpread.cutoff! — Function

cutoff!(x::T, α::T, weighted::Bool=false) where {T<:AbstractFloat}

Transform, in place, x based in SimSpread's similarity cutoff function.

Arguments

x::AbstractFloat : Value to transform
α::AbstractFloat : Similarity cutoff
weighted::Bool : Apply weighting function to outcome (default = false)

cutoff!(X::AbstractVecOrMat{T}, α::T, weighted::Bool=false) where {T<:AbstractFloat}

Transform, in place, the vector or matrix X based in SimSpread's similarity cutoff function.

Arguments

X::AbstractVecOrMat{AbstractFloat} : Matrix or Vector to transform
α::AbstractFloat : Similarity threshold
weighted::Bool : Apply weighting function to outcome (default = false)

SimSpread.featurize — Function

featurize(X::NamedArray, α::AbstractFloat, weighted::Bool=true)

Transform the feature matrix X into a SimSpread feature matrix.

Arguments

X::NamedArray: Continuous feature matrix
α::AbstractFloat: Featurization cutoff
weighted::Bool : Apply weighting function to outcome (default = true)

References

Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666

SimSpread.featurize! — Function

featurize!(X::NamedArray, α::AbstractFloat, weighted::Bool=true)

Transform, in place, the feature matrix X into a SimSpread feature matrix.

Arguments

X::NamedArray : Continuous feature matrix
α::AbstractFloat : Featurization cutoff
weighted::Bool : Apply weighting function to outcome (default = true)

References

Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666

SimSpread.construct — Function

construct(y::NamedMatrix, X::NamedMatrix, queries::AbstractVector)

Construct the query-feature-source-target network for de novo network-based inference prediction and return adjacency matrix.

Arguments

y::NamedMatrix: Source-target bipartite network adjacency matrix
X::NamedMatrix: Source-feature bipartite adjacency matrix
queries::AbstractVector: Source nodes to use as query

Extended help

This implementation is intended for k-fold or leave-one-out cross-validation.

construct(ys::T, Xs::T) where {T<:Tuple{NamedMatrix,NamedMatrix}}

Construct the query-feature-source-target network for de novo network-based inference prediction and return adjacency matrix.

Arguments

dts::Tuple{NamedMatrix,NamedMatrix} : Source-target bipartite graph adjacency matrices
dfs::Tuple{NamedMatrix,NamedMatrix} : Source-feature bipartite graph adjacency matrices

Extended help

This implementations is intended for time-split cross-validation or manual construction of query network.

construct(ytrain::T, ytest::T, Xtrain::T, Xtest::T) where {T<:NamedMatrix}

Construct the query-feature-source-target network for de novo network-based inference prediction and return adjacency matrix.

Arguments

ytrain::NamedMatrix : Training source-target bipartite graph adjacency matrix
ytest::NamedMatrix : Test source-target bipartite graph adjacency matrix
Xtrain::NamedMatrix : Training source-feature bipartite graph adjacency matrix
Xtest::NamedMatrix : Test source-feature bipartite graph adjacency matrix

Extended help

This implementations is intended for time-split cross-validation or manual construction of query network.

construct(y::NamedMatrix, X::NamedMatrix)

Construct the feature-source-target network for network-based inference prediction and return adjacency matrix.

Arguments

y::NamedMatrix : Source-target bipartite graph adjacency matrix
X::NamedMatrix : Source-feature bipartite graph adjacency matrix

SimSpread.predict — Function

predict(I::Tuple{T,T}, ytest::T; GPU::Bool=false, returnweights::Bool=false) where {T<:NamedMatrix}

Predict interactions between query and target nodes using de novo network-based inference model proposed by Wu, et al (2016).

Arguments

I::Tuple{NamedMatrix,NamedMatrix}: Feature-source-target trilayered adjacency matrices
ytest::NamedMatrix: Query-target bipartite adjacency matrix
GPU::Bool: Use GPU acceleration for calculation (default = false)
returnweights::Bool: Return the weighting matrix employed for prediction

References

Wu, et al (2016). SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning. Briefings in Bioinformatics, bbw012. https://doi.org/10.1093/bib/bbw012
Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666

predict(A::T, ytrain::T; GPU::Bool=false, returnweights::Bool=false) where {T<:NamedMatrix}

Predict interactions between query and target nodes using de novo network-based inference model proposed by Wu, et al (2016).

Arguments

A::NamedMatrix: Feature-source-target trilayered adjacency matrix
ytrain::NamedMatrix: Source-target bipartite adjacency matrix
GPU::Bool: Use GPU acceleration for calculation (default = false)
returnweights::Bool: Return the weighting matrix employed for prediction

References

Wu, et al (2016). SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning. Briefings in Bioinformatics, bbw012. https://doi.org/10.1093/bib/bbw012
Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666

SimSpread.save — Function

save(filepath::String, yhat::NamedMatrix, y::NamedMatrix; delimiter::Char='	')

Store predictions as a table in the given file path.

Arguments

filepath::String: Output file path
yhat::NamedArray: Predicted source-target bipartite adjacency matrix
y::NamedArray: Ground-truth source-target bipartite adjacency matrix
delimiter::Char: Delimiter used to write table (default = '\t')

Extended help

Table format is:

fold, source, target, score, label

save(filepath::String, fidx::Int64, yhat::NamedMatrix, y::NamedMatrix; delimiter::Char='	')

Store cross-valudation predictions as a table in the given file path.

Arguments

filepath::String: Output file path
fidx::Int64: Numeric fold ID
yhat::NamedArray: Predicted source-target bipartite adjacency matrix
y::NamedArray: Ground-truth source-target bipartite adjacency matrix
delimiter::Char: Delimiter used to write table (default = '\t')

Extended help

Table format is:

fold, source, target, score, label

Cross-validation

Base.split — Function

Base.split(y::NamedArray, k::Int64; seed::Int64=1)

Split source nodes in y into k groups for cross-validation.

Arguments

y::AbstractMatrix: Drug-Target rectangular adjacency matrix.
k::Int64: Number of groups to use in data splitting.
seed::Int64: Seed used for data splitting.

SimSpread.clean! — Function

clean!(yhat::NamedArray, A::NamedArray, y::NamedArray)

Flag, in place, erroneous prediction from cross-validation splitting.

Arguments

yhat::NamedArray: Predicted source-target bipartite adjacency matrix
A::NamedArray: Initial resource source-target resources adjacency matrix
y::NamedArray: Ground-truth source-target bipartite adjacency matrix

Performance assessment

Several evaluation metrics are implemented in the package, which can be that can be classified into three groups: (i) overall performance, (ii) early recognition, and (iii) binary prediction performance.

Overall performance

This metrics represent classical evaluation metrics that make use of the complete list of prediction to assess predictive performance.

SimSpread.AuPRC — Function

AuPRC(y::AbstractVector{Bool}, yhat::AbstractVector)

Area under the Precision-Recall curve using the trapezoidal rule.

Arguments

y::AbstractArray: Binary class labels. 1 for positive class, 0 otherwise.
̂yhat::AbstractArray: Prediction scores.

SimSpread.AuROC — Function

AuROC(y::AbstractVector{Bool}, yhat::AbstractVector)

Area under the Receiver Operator Characteristic curve using the trapezoidal rule.

Arguments

y::AbstractArray: Binary class labels. 1 for positive class, 0 otherwise.
̂yhat::AbstractArray: Prediction scores.

Early recognition performance

Due to the roots of SimSpread (target prediction in drug discovery), we include evaluation metrics that aim to assess predictive performance of the best predictions obtained from a model.

In virtual screening, only the best predictions obtained from a model are selected for posterior experimental validation. Therefore, understanding the predictive performance of a model for these predictions is essential to (1) make accurate predictions that will translate to biological activity and (2) understand the limitations of the model. The metrics discussed here can be evaluated at a given cut-off rank, considering only the topmost results returned by the predictive method, hence informing of the predictive performance of the model for only the best predictions.

SimSpread.recallatL — Function

recallatL(y, yhat, L)

Get recall@L as proposed by Wu, et al (2017).

Arguments

y::AbstractVector: Binary class labels. 1 for positive class, 0 otherwise.
̂yhat::AbstractVector: Prediction score.
L::Integer: Length to consider to calculate metrics (default = 20).

recallatL(y, yhat, L)

Get mean recall@L per group as proposed by Wu, et al (2017).

Arguments

y::AbstractVector: Binary class labels. 1 for positive class, 0 otherwise.
̂yhat::AbstractVector: Prediction score.
̂grouping::AbstractVector: Group labels.
L::Integer: Length to consider to calculate metrics (default = 20).

SimSpread.precisionatL — Function

precisionatL(y, yhat, L::Integer=20)

Get precision@L as proposed by Wu, et al (2017).

Arguments

y::AbstractVector: Binary class labels. 1 for positive class, 0 otherwise.
̂yhat::AbstractVector: Prediction score.
L::Integer: Length to consider to calculate metrics (default = 20).

precisionatL(y, yhat, grouping, L)

Get mean precision@L per group as proposed by Wu, et al (2017).

Arguments

y::AbstractVector: Binary class labels. 1 for positive class, 0 otherwise.
̂yhat::AbstractVector: Prediction score.
grouping::AbstractVector: Group labels.
L::Integer: Length to consider to calculate metrics (default = 20).

SimSpread.BEDROC — Function

BEDROC(y::AbstractVector{Bool}, yhat::AbstractVector; rev::Bool=true, α::AbstractFloat=20.0)

The Boltzmann Enhanced Descrimination of the Receiver Operator Characteristic (BEDROC) score is a modification of the Receiver Operator Characteristic (ROC) score that allows for a factor of early recognition.

Score takes a value in interval [0, 1] indicating degree to which the predictive model employed detects (early) the positive class.

Arguments

y::AbstractArray: Binary class labels. 1 for positive class, 0 otherwise.
̂yhat::AbstractArray: Prediction scores.
rev::Bool: True if high values of $yhat$ correlates to positive class (default = true).
α::AbstractFloat: Early recognition parameter (default = 20.0).

References

Truchon, J.-F., & Bayly, C. I. (2007). Evaluating Virtual Screening Methods: Good and

Bad Metrics for the “Early Recognition” Problem. Journal of Chemical Information and Modeling, 47(2), 488–508. https://doi.org/10.1021/ci600426e

Binary prediction performance

A common practice in predictive modelling is to assign a score or probability threshold for the predictions obtained from a model and manually select or cherry-pick predictions for validation. In order to evaluate the predictive performance under this paradigm, we implement a series of metrics that are meant for binary classification, that is, a link exists or not based in a given threshold, from which statistical moments can be calculated to retrieve a notion of predictive performance as a decision boundary changes.

SimSpread.f1score — Function

f1score(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}

The harmonic mean between precision and recall

Arguments

tn::Integer True negatives
fp::Integer False postives
fn::Integer False negatives
tp::Integer True positives

SimSpread.mcc — Function

mcc(a::T, b::T, ϵ::AbstractFloat = 0.0001) where {T<:Integer}

Matthews correlation coefficient using calculus approximation for when FN+TN, FP+TN, TP+FN or TP+FP equals zero.

Arguments

a::Integer = Value of position a in confusion matrix
b::Integer = Value of position b in confusion matrix
ϵ::AbstractFloat = Approximation coefficient (default = floatmin(Float64))

Extended help

The confusion matrix in a binary prediction is comprised of 4 distinct positions:

                    | Predicted positive     Predicted negative
    ----------------+--------------------------------------------
    Actual positive |  True positives (TP)   False negatives (FN)
    Actual negative | False positives (FP)    True negatives (TN)

In the case a row or column of the confusion matrix equals zero, MCC is undefined. Therefore, to correctly use MCC with this approximation, arguments a and b are defined as follows:

If "Predictive positive" column is zero, a is TN and b is FN
If "Predictive negative" column is zero, a is TP and b is FP
If "Actual positive" row is zero, a is TN and b is FP
If "Actual negative" row is zero, a is TP and b is FN

Reference

1.Chicco, D., Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020).

mcc(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}

Matthews correlation coefficient, a special case of the phi coeficient Performance metric used for overcoming the class imbalance issues

Arguments

tn::Integer True negatives
fp::Integer False postives
fn::Integer False negatives
tp::Integer True positives

Reference

1.Chicco, D., Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020).

SimSpread.accuracy — Function

accuracy(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}

The number of all correct predictions divided by the total predicitions

Arguments

tn::Integer True negatives
fp::Integer False postives
fn::Integer False negatives
tp::Integer True positives

SimSpread.balancedaccuracy — Function

balancedaccuracy(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}

The arithmetic mean of sensitivity and specificity, its use case is when dealing with imbalanced data

Arguments

tn::Integer True negatives
fp::Integer False postives
fn::Integer False negatives
tp::Integer True positives

SimSpread.recall — Function

recall(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}

The fraction of positive samples correctly predicted as postive

Arguments

tn::Integer True negatives
fp::Integer False postives
fn::Integer False negatives
tp::Integer True positives

SimSpread.precision — Function

precision(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}

The fraction of positive predictions that are correct

Arguments

tn::Integer True negatives
fp::Integer False postives
fn::Integer False negatives
tp::Integer True positives

SimSpread.meanperformance — Function

meanperformance(confusion::AbstractVector{ROCNums{Int64}}, metric::Function)

Get mean performance of a given metric over a set of confusion matrices.

Arguments

confusion::AbstractVector{ROCNums{Int64}}: Confusion matrices
̂metric::Function: Performance metric function to use in evaluation.

meanperformance(y::AbstractVector{Bool}, yhat::AbstractVector{Float64}, metric::Function)

Get mean performance of a given metric over a pair of label-prediction vectors.

Arguments

y::AbstractVector: Binary class labels. 1 for positive class, 0 otherwise.
̂yhat::AbstractVector: Prediction score.
̂metric::Function: Performance metric function to use in evaluation.

SimSpread.meanstdperformance — Function

meanstdperformance(confusion::AbstractVector{ROCNums{Real}}, metric::Function)

Get mean and standard deviation performance of a given metric over a set of confusion matrices.

Arguments

confusion::AbstractVector{ROCNums{Real}}: Confusion matrix object from MLBase
̂metric::Function: Performance metric function to use in evaluation.

meanstdperformance(y::AbstractVector{Bool}, yhat::AbstractVector{Float64}, metric::Function)

Get mean and standard deviation performance of a given metric over a pair of label-prediction vectors.

Arguments

y::AbstractVector: Binary class labels. 1 for positive class, 0 otherwise.
̂yhat::AbstractVector: Prediction scores.
̂metric::Function: Performance metric function to use in evaluation.

SimSpread.maxperformance — Function

maxperformance(confusion::AbstractVector{ROCNums{Real}}, metric::Function)

Get maximum performance of a given metric over a set of confusion matrices.

Arguments

confusion::AbstractVector{ROCNums{Real}}: Confusion matrices.
̂metric::Function: Performance metric function to use in evaluation.

maxperformance(y::AbstractVector{Bool}, yhat::AbstractVector{Float64}, metric::Function)

Get maximum performance of a given metric over a pair of label-prediction vectors.

Arguments

y::AbstractVector{Bool}: Binary class labels. 1 for positive class, 0 otherwise.
̂yhat::AbstractVector{Float64}: Prediction score.
̂metric::Function: Performance metric function to use in evaluation.

Other metrics

SimSpread.validity_ratio — Function

validity_ratio(yhat::AbstractVector)

Ratio of valid predictions (score > 0) and all predictions. Allows to check if predictive performance is given for all predictions or a subset of the predictions.

Arguments

yhat::AbstractVector : Prediction scores.

Extended help

A limitation of SimSpread is that it is impossible to generate predictions for query nodes whose similarity to every source of the first network layer is below the similarity threshold α. In this case, the length of the feature vector is zero, resource spreading is not possible, and the predicted value is zero for all targets, therefore we need guardrails to correctly assess performance as a function of the threshold α.

This characteristic of SimSpread can be seen as an intrinsic notion of its application domain. No target predictions are generated for query nodes outside SimSpread’s application domain instead of returning likely meaningless targets.

References

Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666

Miscellaneous utilities

SimSpread.read_namedmatrix — Function

read_namedmatrix(filepath::String, valuetype::Type = FLoat64filepath::String, valuetype::Type)

Load a matrix with named indices as a NamedArray.

Arguments

filepath::String : File path of matrix to load
delimiter::Char : Delimiter character between values in matrix (default = ' ')
valuetype::Type : Type of values contained in matrix (default = Float64)

SimSpread.k — Function

k(G::AbstractMatrix)

Get node degrees from adjacency matrix

Arguments

M::AbstractMatrix : Matrix to parse

SimSpread.getyamanishi — Function

getyamanishi(db)

Get a tuple of matrices corresponding to the drug-target adjacency matrix and drug-drug similarity matrix for a given Yamanishi (2008) dataset.

Arguments

db: Dataset ID (any of the following: "nr", "ic", "gpcr" or "e")

Example

julia> dt, dd = getyamanishi("nr");

julia> dt[1:5, 1:5]
5×5 Named Matrix{Float64}
 A ╲ B │  hsa190  hsa2099  hsa2100  hsa2101  hsa2103
───────┼────────────────────────────────────────────
D00040 │     0.0      0.0      0.0      0.0      0.0
D00066 │     0.0      1.0      0.0      0.0      0.0
D00067 │     0.0      1.0      0.0      0.0      0.0
D00075 │     0.0      0.0      0.0      0.0      0.0
D00088 │     0.0      0.0      0.0      0.0      0.0

julia> dd[1:5, 1:5]
5×5 Named Matrix{Float64}
 A ╲ B │   D00040    D00066    D00067    D00075    D00088
───────┼─────────────────────────────────────────────────
D00040 │      1.0  0.545455  0.297297   0.53125  0.459459
D00066 │ 0.545455       1.0  0.387097  0.833333  0.689655
D00067 │ 0.297297  0.387097       1.0  0.464286  0.352941
D00075 │  0.53125  0.833333  0.464286       1.0  0.678571
D00088 │ 0.459459  0.689655  0.352941  0.678571       1.0

Extended help

The provided Yamanishi (2008) [1] datasets (ID) are:

'Nuclear Receptor' (nr)
'Ion Channels' (ic)
'GPCR' (gpcr)
'Enzyme' (e)

This function returns 2 distinct adjacency matrices:

Binary drug-target interaction matrix, obtained from biological annotations
Continuous drug-drug similarity matrix, obtained from SIMCOMP

References

Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W., & Kanehisa, M. (2008). Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 24(13), i232–i240. https://doi.org/10.1093/bioinformatics/btn162