API

Core

SimSpread.spreadFunction
spread(G::AbstractMatrix{Float64})

Calculate the transfer matrix for the adyacency matrix of the trilayered feature-source-target network.

Arguments

  • G::AbstractMatrix{Float64}: Trilayered feature-source-target network adjacency matrix.

Extended help

Potential interactions between nodes in a graph can be identified by using resource diffusion processes in the feature-source-target network, namely aforementioned graph G. For each node nᵢ in the network, it has initial resources located in both its neighboring nodes and its features. Initially, each feature and each neighboring node of nᵢ equally spread their resources to neighboring nodes. Subsequently, each of those nodes equally spreads its resources to neighbor nodes. Thus, nᵢ will obtain final resources located in several neighboring nodes, suggesting that nᵢ may have potential interactions with these nodes.

References

  1. Wu, et al (2016). SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning. Briefings in Bioinformatics, bbw012. https://doi.org/10.1093/bib/bbw012
  2. Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666
source
SimSpread.cutoffFunction
cutoff(x::T, α::T, weighted::Bool=false) where {T<:AbstractFloat}

Transform x based in SimSpread's similarity cutoff function.

Arguments

  • x::AbstractFloat : Value to transform
  • α::AbstractFloat : Similarity cutoff
  • weighted::Bool : Apply weighting function to outcome (default = false)
source
cutoff(M::AbstractVecOrMat{T}, α::T, weighted::Bool=false) where {T<:AbstractFloat}

Transform the vector or matrix X based in SimSpread's similarity cutoff function.

Arguments

  • X::AbstractVecOrMat{AbstractFloat} : Matrix or Vector to transform
  • α::AbstractFloat : Similarity cutoff
  • weighted::Bool : Apply weighting function to outcome (default = false)
source
SimSpread.cutoff!Function
cutoff!(x::T, α::T, weighted::Bool=false) where {T<:AbstractFloat}

Transform, in place, x based in SimSpread's similarity cutoff function.

Arguments

  • x::AbstractFloat : Value to transform
  • α::AbstractFloat : Similarity cutoff
  • weighted::Bool : Apply weighting function to outcome (default = false)
source
cutoff!(X::AbstractVecOrMat{T}, α::T, weighted::Bool=false) where {T<:AbstractFloat}

Transform, in place, the vector or matrix X based in SimSpread's similarity cutoff function.

Arguments

  • X::AbstractVecOrMat{AbstractFloat} : Matrix or Vector to transform
  • α::AbstractFloat : Similarity threshold
  • weighted::Bool : Apply weighting function to outcome (default = false)
source
SimSpread.featurizeFunction
featurize(X::NamedArray, α::AbstractFloat, weighted::Bool=true)

Transform the feature matrix X into a SimSpread feature matrix.

Arguments

  • X::NamedArray: Continuous feature matrix
  • α::AbstractFloat: Featurization cutoff
  • weighted::Bool : Apply weighting function to outcome (default = true)

References

  1. Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666
source
SimSpread.featurize!Function
featurize!(X::NamedArray, α::AbstractFloat, weighted::Bool=true)

Transform, in place, the feature matrix X into a SimSpread feature matrix.

Arguments

  • X::NamedArray : Continuous feature matrix
  • α::AbstractFloat : Featurization cutoff
  • weighted::Bool : Apply weighting function to outcome (default = true)

References

  1. Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666
source
SimSpread.constructFunction
construct(y::NamedMatrix, X::NamedMatrix, queries::AbstractVector)

Construct the query-feature-source-target network for de novo network-based inference prediction and return adjacency matrix.

Arguments

  • y::NamedMatrix: Source-target bipartite network adjacency matrix
  • X::NamedMatrix: Source-feature bipartite adjacency matrix
  • queries::AbstractVector: Source nodes to use as query

Extended help

This implementation is intended for k-fold or leave-one-out cross-validation.

source
construct(ys::T, Xs::T) where {T<:Tuple{NamedMatrix,NamedMatrix}}

Construct the query-feature-source-target network for de novo network-based inference prediction and return adjacency matrix.

Arguments

  • dts::Tuple{NamedMatrix,NamedMatrix} : Source-target bipartite graph adjacency matrices
  • dfs::Tuple{NamedMatrix,NamedMatrix} : Source-feature bipartite graph adjacency matrices

Extended help

This implementations is intended for time-split cross-validation or manual construction of query network.

source
construct(ytrain::T, ytest::T, Xtrain::T, Xtest::T) where {T<:NamedMatrix}

Construct the query-feature-source-target network for de novo network-based inference prediction and return adjacency matrix.

Arguments

  • ytrain::NamedMatrix : Training source-target bipartite graph adjacency matrix
  • ytest::NamedMatrix : Test source-target bipartite graph adjacency matrix
  • Xtrain::NamedMatrix : Training source-feature bipartite graph adjacency matrix
  • Xtest::NamedMatrix : Test source-feature bipartite graph adjacency matrix

Extended help

This implementations is intended for time-split cross-validation or manual construction of query network.

source
construct(y::NamedMatrix, X::NamedMatrix)

Construct the feature-source-target network for network-based inference prediction and return adjacency matrix.

Arguments

  • y::NamedMatrix : Source-target bipartite graph adjacency matrix
  • X::NamedMatrix : Source-feature bipartite graph adjacency matrix
source
SimSpread.predictFunction
predict(I::Tuple{T,T}, ytest::T; GPU::Bool=false, returnweights::Bool=false) where {T<:NamedMatrix}

Predict interactions between query and target nodes using de novo network-based inference model proposed by Wu, et al (2016).

Arguments

  • I::Tuple{NamedMatrix,NamedMatrix}: Feature-source-target trilayered adjacency matrices
  • ytest::NamedMatrix: Query-target bipartite adjacency matrix
  • GPU::Bool: Use GPU acceleration for calculation (default = false)
  • returnweights::Bool: Return the weighting matrix employed for prediction

References

  1. Wu, et al (2016). SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning. Briefings in Bioinformatics, bbw012. https://doi.org/10.1093/bib/bbw012
  2. Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666
source
predict(A::T, ytrain::T; GPU::Bool=false, returnweights::Bool=false) where {T<:NamedMatrix}

Predict interactions between query and target nodes using de novo network-based inference model proposed by Wu, et al (2016).

Arguments

  • A::NamedMatrix: Feature-source-target trilayered adjacency matrix
  • ytrain::NamedMatrix: Source-target bipartite adjacency matrix
  • GPU::Bool: Use GPU acceleration for calculation (default = false)
  • returnweights::Bool: Return the weighting matrix employed for prediction

References

  1. Wu, et al (2016). SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning. Briefings in Bioinformatics, bbw012. https://doi.org/10.1093/bib/bbw012
  2. Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666
source
SimSpread.saveFunction
save(filepath::String, yhat::NamedMatrix, y::NamedMatrix; delimiter::Char='	')

Store predictions as a table in the given file path.

Arguments

  • filepath::String: Output file path
  • yhat::NamedArray: Predicted source-target bipartite adjacency matrix
  • y::NamedArray: Ground-truth source-target bipartite adjacency matrix
  • delimiter::Char: Delimiter used to write table (default = '\t')

Extended help

Table format is:

fold, source, target, score, label
source
save(filepath::String, fidx::Int64, yhat::NamedMatrix, y::NamedMatrix; delimiter::Char='	')

Store cross-valudation predictions as a table in the given file path.

Arguments

  • filepath::String: Output file path
  • fidx::Int64: Numeric fold ID
  • yhat::NamedArray: Predicted source-target bipartite adjacency matrix
  • y::NamedArray: Ground-truth source-target bipartite adjacency matrix
  • delimiter::Char: Delimiter used to write table (default = '\t')

Extended help

Table format is:

fold, source, target, score, label
source

Cross-validation

Base.splitFunction
Base.split(y::NamedArray, k::Int64; seed::Int64=1)

Split source nodes in y into k groups for cross-validation.

Arguments

  • y::AbstractMatrix: Drug-Target rectangular adjacency matrix.
  • k::Int64: Number of groups to use in data splitting.
  • seed::Int64: Seed used for data splitting.
source
SimSpread.clean!Function
clean!(yhat::NamedArray, A::NamedArray, y::NamedArray)

Flag, in place, erroneous prediction from cross-validation splitting.

Arguments

  • yhat::NamedArray: Predicted source-target bipartite adjacency matrix
  • A::NamedArray: Initial resource source-target resources adjacency matrix
  • y::NamedArray: Ground-truth source-target bipartite adjacency matrix
source

Performance assessment

Several evaluation metrics are implemented in the package, which can be that can be classified into three groups: (i) overall performance, (ii) early recognition, and (iii) binary prediction performance.

Overall performance

This metrics represent classical evaluation metrics that make use of the complete list of prediction to assess predictive performance.

SimSpread.AuPRCFunction
AuPRC(y::AbstractVector{Bool}, yhat::AbstractVector)

Area under the Precision-Recall curve using the trapezoidal rule.

Arguments

  • y::AbstractArray: Binary class labels. 1 for positive class, 0 otherwise.
  • ̂yhat::AbstractArray: Prediction scores.
source
SimSpread.AuROCFunction
AuROC(y::AbstractVector{Bool}, yhat::AbstractVector)

Area under the Receiver Operator Characteristic curve using the trapezoidal rule.

Arguments

  • y::AbstractArray: Binary class labels. 1 for positive class, 0 otherwise.
  • ̂yhat::AbstractArray: Prediction scores.
source

Early recognition performance

Due to the roots of SimSpread (target prediction in drug discovery), we include evaluation metrics that aim to assess predictive performance of the best predictions obtained from a model.

In virtual screening, only the best predictions obtained from a model are selected for posterior experimental validation. Therefore, understanding the predictive performance of a model for these predictions is essential to (1) make accurate predictions that will translate to biological activity and (2) understand the limitations of the model. The metrics discussed here can be evaluated at a given cut-off rank, considering only the topmost results returned by the predictive method, hence informing of the predictive performance of the model for only the best predictions.

SimSpread.recallatLFunction
recallatL(y, yhat, L)

Get recall@L as proposed by Wu, et al (2017).

Arguments

  • y::AbstractVector: Binary class labels. 1 for positive class, 0 otherwise.
  • ̂yhat::AbstractVector: Prediction score.
  • L::Integer: Length to consider to calculate metrics (default = 20).
source
recallatL(y, yhat, L)

Get mean recall@L per group as proposed by Wu, et al (2017).

Arguments

  • y::AbstractVector: Binary class labels. 1 for positive class, 0 otherwise.
  • ̂yhat::AbstractVector: Prediction score.
  • ̂grouping::AbstractVector: Group labels.
  • L::Integer: Length to consider to calculate metrics (default = 20).
source
SimSpread.precisionatLFunction
precisionatL(y, yhat, L::Integer=20)

Get precision@L as proposed by Wu, et al (2017).

Arguments

  • y::AbstractVector: Binary class labels. 1 for positive class, 0 otherwise.
  • ̂yhat::AbstractVector: Prediction score.
  • L::Integer: Length to consider to calculate metrics (default = 20).
source
precisionatL(y, yhat, grouping, L)

Get mean precision@L per group as proposed by Wu, et al (2017).

Arguments

  • y::AbstractVector: Binary class labels. 1 for positive class, 0 otherwise.
  • ̂yhat::AbstractVector: Prediction score.
  • grouping::AbstractVector: Group labels.
  • L::Integer: Length to consider to calculate metrics (default = 20).
source
SimSpread.BEDROCFunction
BEDROC(y::AbstractVector{Bool}, yhat::AbstractVector; rev::Bool=true, α::AbstractFloat=20.0)

The Boltzmann Enhanced Descrimination of the Receiver Operator Characteristic (BEDROC) score is a modification of the Receiver Operator Characteristic (ROC) score that allows for a factor of early recognition.

Score takes a value in interval [0, 1] indicating degree to which the predictive model employed detects (early) the positive class.

Arguments

  • y::AbstractArray: Binary class labels. 1 for positive class, 0 otherwise.
  • ̂yhat::AbstractArray: Prediction scores.
  • rev::Bool: True if high values of $yhat$ correlates to positive class (default = true).
  • α::AbstractFloat: Early recognition parameter (default = 20.0).

References

  1. Truchon, J.-F., & Bayly, C. I. (2007). Evaluating Virtual Screening Methods:  Good and

Bad Metrics for the “Early Recognition” Problem. Journal of Chemical Information and Modeling, 47(2), 488–508. https://doi.org/10.1021/ci600426e

source

Binary prediction performance

A common practice in predictive modelling is to assign a score or probability threshold for the predictions obtained from a model and manually select or cherry-pick predictions for validation. In order to evaluate the predictive performance under this paradigm, we implement a series of metrics that are meant for binary classification, that is, a link exists or not based in a given threshold, from which statistical moments can be calculated to retrieve a notion of predictive performance as a decision boundary changes.

SimSpread.f1scoreFunction
f1score(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}

The harmonic mean between precision and recall

Arguments

  • tn::Integer True negatives
  • fp::Integer False postives
  • fn::Integer False negatives
  • tp::Integer True positives
source
SimSpread.mccFunction
mcc(a::T, b::T, ϵ::AbstractFloat = 0.0001) where {T<:Integer}

Matthews correlation coefficient using calculus approximation for when FN+TN, FP+TN, TP+FN or TP+FP equals zero.

Arguments

  • a::Integer = Value of position a in confusion matrix
  • b::Integer = Value of position b in confusion matrix
  • ϵ::AbstractFloat = Approximation coefficient (default = floatmin(Float64))

Extended help

The confusion matrix in a binary prediction is comprised of 4 distinct positions:

                    | Predicted positive     Predicted negative
    ----------------+--------------------------------------------
    Actual positive |  True positives (TP)   False negatives (FN)
    Actual negative | False positives (FP)    True negatives (TN)

In the case a row or column of the confusion matrix equals zero, MCC is undefined. Therefore, to correctly use MCC with this approximation, arguments a and b are defined as follows:

  • If "Predictive positive" column is zero, a is TN and b is FN
  • If "Predictive negative" column is zero, a is TP and b is FP
  • If "Actual positive" row is zero, a is TN and b is FP
  • If "Actual negative" row is zero, a is TP and b is FN

Reference

1.Chicco, D., Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020).

source
mcc(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}

Matthews correlation coefficient, a special case of the phi coeficient Performance metric used for overcoming the class imbalance issues

Arguments

  • tn::Integer True negatives
  • fp::Integer False postives
  • fn::Integer False negatives
  • tp::Integer True positives

Reference

1.Chicco, D., Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020).

source
SimSpread.accuracyFunction
accuracy(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}

The number of all correct predictions divided by the total predicitions

Arguments

  • tn::Integer True negatives
  • fp::Integer False postives
  • fn::Integer False negatives
  • tp::Integer True positives
source
SimSpread.balancedaccuracyFunction
balancedaccuracy(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}

The arithmetic mean of sensitivity and specificity, its use case is when dealing with imbalanced data

Arguments

  • tn::Integer True negatives
  • fp::Integer False postives
  • fn::Integer False negatives
  • tp::Integer True positives
source
SimSpread.recallFunction
recall(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}

The fraction of positive samples correctly predicted as postive

Arguments

  • tn::Integer True negatives
  • fp::Integer False postives
  • fn::Integer False negatives
  • tp::Integer True positives
source
SimSpread.precisionFunction
precision(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}

The fraction of positive predictions that are correct

Arguments

  • tn::Integer True negatives
  • fp::Integer False postives
  • fn::Integer False negatives
  • tp::Integer True positives
source
SimSpread.meanperformanceFunction
meanperformance(confusion::AbstractVector{ROCNums{Int64}}, metric::Function)

Get mean performance of a given metric over a set of confusion matrices.

Arguments

  • confusion::AbstractVector{ROCNums{Int64}}: Confusion matrices
  • ̂metric::Function: Performance metric function to use in evaluation.
source
meanperformance(y::AbstractVector{Bool}, yhat::AbstractVector{Float64}, metric::Function)

Get mean performance of a given metric over a pair of label-prediction vectors.

Arguments

  • y::AbstractVector: Binary class labels. 1 for positive class, 0 otherwise.
  • ̂yhat::AbstractVector: Prediction score.
  • ̂metric::Function: Performance metric function to use in evaluation.
source
SimSpread.meanstdperformanceFunction
meanstdperformance(confusion::AbstractVector{ROCNums{Real}}, metric::Function)

Get mean and standard deviation performance of a given metric over a set of confusion matrices.

Arguments

  • confusion::AbstractVector{ROCNums{Real}}: Confusion matrix object from MLBase
  • ̂metric::Function: Performance metric function to use in evaluation.
source
meanstdperformance(y::AbstractVector{Bool}, yhat::AbstractVector{Float64}, metric::Function)

Get mean and standard deviation performance of a given metric over a pair of label-prediction vectors.

Arguments

  • y::AbstractVector: Binary class labels. 1 for positive class, 0 otherwise.
  • ̂yhat::AbstractVector: Prediction scores.
  • ̂metric::Function: Performance metric function to use in evaluation.
source
SimSpread.maxperformanceFunction

maxperformance(confusion::AbstractVector{ROCNums{Real}}, metric::Function)

Get maximum performance of a given metric over a set of confusion matrices.

Arguments

  • confusion::AbstractVector{ROCNums{Real}}: Confusion matrices.
  • ̂metric::Function: Performance metric function to use in evaluation.
source
maxperformance(y::AbstractVector{Bool}, yhat::AbstractVector{Float64}, metric::Function)

Get maximum performance of a given metric over a pair of label-prediction vectors.

Arguments

  • y::AbstractVector{Bool}: Binary class labels. 1 for positive class, 0 otherwise.
  • ̂yhat::AbstractVector{Float64}: Prediction score.
  • ̂metric::Function: Performance metric function to use in evaluation.
source

Other metrics

SimSpread.validity_ratioFunction
validity_ratio(yhat::AbstractVector)

Ratio of valid predictions (score > 0) and all predictions. Allows to check if predictive performance is given for all predictions or a subset of the predictions.

Arguments

  • yhat::AbstractVector : Prediction scores.

Extended help

A limitation of SimSpread is that it is impossible to generate predictions for query nodes whose similarity to every source of the first network layer is below the similarity threshold α. In this case, the length of the feature vector is zero, resource spreading is not possible, and the predicted value is zero for all targets, therefore we need guardrails to correctly assess performance as a function of the threshold α.

This characteristic of SimSpread can be seen as an intrinsic notion of its application domain. No target predictions are generated for query nodes outside SimSpread’s application domain instead of returning likely meaningless targets.

References

  1. Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666
source

Miscellaneous utilities

SimSpread.read_namedmatrixFunction
read_namedmatrix(filepath::String, valuetype::Type = FLoat64filepath::String, valuetype::Type)

Load a matrix with named indices as a NamedArray.

Arguments

  • filepath::String : File path of matrix to load
  • delimiter::Char : Delimiter character between values in matrix (default = ' ')
  • valuetype::Type : Type of values contained in matrix (default = Float64)
source
SimSpread.kFunction
k(G::AbstractMatrix)

Get node degrees from adjacency matrix

Arguments

  • M::AbstractMatrix : Matrix to parse
source
SimSpread.getyamanishiFunction
getyamanishi(db)

Get a tuple of matrices corresponding to the drug-target adjacency matrix and drug-drug similarity matrix for a given Yamanishi (2008) dataset.

Arguments

  • db: Dataset ID (any of the following: "nr", "ic", "gpcr" or "e")

Example

julia> dt, dd = getyamanishi("nr");

julia> dt[1:5, 1:5]
5×5 Named Matrix{Float64}
 A ╲ B │  hsa190  hsa2099  hsa2100  hsa2101  hsa2103
───────┼────────────────────────────────────────────
D00040 │     0.0      0.0      0.0      0.0      0.0
D00066 │     0.0      1.0      0.0      0.0      0.0
D00067 │     0.0      1.0      0.0      0.0      0.0
D00075 │     0.0      0.0      0.0      0.0      0.0
D00088 │     0.0      0.0      0.0      0.0      0.0

julia> dd[1:5, 1:5]
5×5 Named Matrix{Float64}
 A ╲ B │   D00040    D00066    D00067    D00075    D00088
───────┼─────────────────────────────────────────────────
D00040 │      1.0  0.545455  0.297297   0.53125  0.459459
D00066 │ 0.545455       1.0  0.387097  0.833333  0.689655
D00067 │ 0.297297  0.387097       1.0  0.464286  0.352941
D00075 │  0.53125  0.833333  0.464286       1.0  0.678571
D00088 │ 0.459459  0.689655  0.352941  0.678571       1.0

Extended help

The provided Yamanishi (2008) [1] datasets (ID) are:

  • 'Nuclear Receptor' (nr)
  • 'Ion Channels' (ic)
  • 'GPCR' (gpcr)
  • 'Enzyme' (e)

This function returns 2 distinct adjacency matrices:

  • Binary drug-target interaction matrix, obtained from biological annotations
  • Continuous drug-drug similarity matrix, obtained from SIMCOMP

References

  1. Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W., & Kanehisa, M. (2008). Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 24(13), i232–i240. https://doi.org/10.1093/bioinformatics/btn162
source