API
Core
SimSpread.spread
— Functionspread(G::AbstractMatrix{Float64})
Calculate the transfer matrix for the adyacency matrix of the trilayered feature-source-target network.
Arguments
G::AbstractMatrix{Float64}
: Trilayered feature-source-target network adjacency matrix.
Extended help
Potential interactions between nodes in a graph can be identified by using resource diffusion processes in the feature-source-target network, namely aforementioned graph G
. For each node nᵢ in the network, it has initial resources located in both its neighboring nodes and its features. Initially, each feature and each neighboring node of nᵢ equally spread their resources to neighboring nodes. Subsequently, each of those nodes equally spreads its resources to neighbor nodes. Thus, nᵢ will obtain final resources located in several neighboring nodes, suggesting that nᵢ may have potential interactions with these nodes.
References
- Wu, et al (2016). SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning. Briefings in Bioinformatics, bbw012. https://doi.org/10.1093/bib/bbw012
- Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666
SimSpread.cutoff
— Functioncutoff(x::T, α::T, weighted::Bool=false) where {T<:AbstractFloat}
Transform x
based in SimSpread's similarity cutoff function.
Arguments
x::AbstractFloat
: Value to transformα::AbstractFloat
: Similarity cutoffweighted::Bool
: Apply weighting function to outcome (default = false)
cutoff(M::AbstractVecOrMat{T}, α::T, weighted::Bool=false) where {T<:AbstractFloat}
Transform the vector or matrix X
based in SimSpread's similarity cutoff function.
Arguments
X::AbstractVecOrMat{AbstractFloat}
: Matrix or Vector to transformα::AbstractFloat
: Similarity cutoffweighted::Bool
: Apply weighting function to outcome (default = false)
SimSpread.cutoff!
— Functioncutoff!(x::T, α::T, weighted::Bool=false) where {T<:AbstractFloat}
Transform, in place, x
based in SimSpread's similarity cutoff function.
Arguments
x::AbstractFloat
: Value to transformα::AbstractFloat
: Similarity cutoffweighted::Bool
: Apply weighting function to outcome (default = false)
cutoff!(X::AbstractVecOrMat{T}, α::T, weighted::Bool=false) where {T<:AbstractFloat}
Transform, in place, the vector or matrix X
based in SimSpread's similarity cutoff function.
Arguments
X::AbstractVecOrMat{AbstractFloat}
: Matrix or Vector to transformα::AbstractFloat
: Similarity thresholdweighted::Bool
: Apply weighting function to outcome (default = false)
SimSpread.featurize
— Functionfeaturize(X::NamedArray, α::AbstractFloat, weighted::Bool=true)
Transform the feature matrix X
into a SimSpread feature matrix.
Arguments
X::NamedArray
: Continuous feature matrixα::AbstractFloat
: Featurization cutoffweighted::Bool
: Apply weighting function to outcome (default = true)
References
- Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666
SimSpread.featurize!
— Functionfeaturize!(X::NamedArray, α::AbstractFloat, weighted::Bool=true)
Transform, in place, the feature matrix X
into a SimSpread feature matrix.
Arguments
X::NamedArray
: Continuous feature matrixα::AbstractFloat
: Featurization cutoffweighted::Bool
: Apply weighting function to outcome (default = true)
References
- Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666
SimSpread.construct
— Functionconstruct(y::NamedMatrix, X::NamedMatrix, queries::AbstractVector)
Construct the query-feature-source-target network for de novo network-based inference prediction and return adjacency matrix.
Arguments
y::NamedMatrix
: Source-target bipartite network adjacency matrixX::NamedMatrix
: Source-feature bipartite adjacency matrixqueries::AbstractVector
: Source nodes to use as query
Extended help
This implementation is intended for k-fold or leave-one-out cross-validation.
construct(ys::T, Xs::T) where {T<:Tuple{NamedMatrix,NamedMatrix}}
Construct the query-feature-source-target network for de novo network-based inference prediction and return adjacency matrix.
Arguments
dts::Tuple{NamedMatrix,NamedMatrix}
: Source-target bipartite graph adjacency matricesdfs::Tuple{NamedMatrix,NamedMatrix}
: Source-feature bipartite graph adjacency matrices
Extended help
This implementations is intended for time-split cross-validation or manual construction of query network.
construct(ytrain::T, ytest::T, Xtrain::T, Xtest::T) where {T<:NamedMatrix}
Construct the query-feature-source-target network for de novo network-based inference prediction and return adjacency matrix.
Arguments
ytrain::NamedMatrix
: Training source-target bipartite graph adjacency matrixytest::NamedMatrix
: Test source-target bipartite graph adjacency matrixXtrain::NamedMatrix
: Training source-feature bipartite graph adjacency matrixXtest::NamedMatrix
: Test source-feature bipartite graph adjacency matrix
Extended help
This implementations is intended for time-split cross-validation or manual construction of query network.
construct(y::NamedMatrix, X::NamedMatrix)
Construct the feature-source-target network for network-based inference prediction and return adjacency matrix.
Arguments
y::NamedMatrix
: Source-target bipartite graph adjacency matrixX::NamedMatrix
: Source-feature bipartite graph adjacency matrix
SimSpread.predict
— Functionpredict(I::Tuple{T,T}, ytest::T; GPU::Bool=false, returnweights::Bool=false) where {T<:NamedMatrix}
Predict interactions between query and target nodes using de novo network-based inference model proposed by Wu, et al (2016).
Arguments
I::Tuple{NamedMatrix,NamedMatrix}
: Feature-source-target trilayered adjacency matricesytest::NamedMatrix
: Query-target bipartite adjacency matrixGPU::Bool
: Use GPU acceleration for calculation (default = false)returnweights::Bool
: Return the weighting matrix employed for prediction
References
- Wu, et al (2016). SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning. Briefings in Bioinformatics, bbw012. https://doi.org/10.1093/bib/bbw012
- Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666
predict(A::T, ytrain::T; GPU::Bool=false, returnweights::Bool=false) where {T<:NamedMatrix}
Predict interactions between query and target nodes using de novo network-based inference model proposed by Wu, et al (2016).
Arguments
A::NamedMatrix
: Feature-source-target trilayered adjacency matrixytrain::NamedMatrix
: Source-target bipartite adjacency matrixGPU::Bool
: Use GPU acceleration for calculation (default = false)returnweights::Bool
: Return the weighting matrix employed for prediction
References
- Wu, et al (2016). SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning. Briefings in Bioinformatics, bbw012. https://doi.org/10.1093/bib/bbw012
- Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666
SimSpread.save
— Functionsave(filepath::String, yhat::NamedMatrix, y::NamedMatrix; delimiter::Char=' ')
Store predictions as a table in the given file path.
Arguments
filepath::String
: Output file pathyhat::NamedArray
: Predicted source-target bipartite adjacency matrixy::NamedArray
: Ground-truth source-target bipartite adjacency matrixdelimiter::Char
: Delimiter used to write table (default = '\t')
Extended help
Table format is:
fold, source, target, score, label
save(filepath::String, fidx::Int64, yhat::NamedMatrix, y::NamedMatrix; delimiter::Char=' ')
Store cross-valudation predictions as a table in the given file path.
Arguments
filepath::String
: Output file pathfidx::Int64
: Numeric fold IDyhat::NamedArray
: Predicted source-target bipartite adjacency matrixy::NamedArray
: Ground-truth source-target bipartite adjacency matrixdelimiter::Char
: Delimiter used to write table (default = '\t')
Extended help
Table format is:
fold, source, target, score, label
Cross-validation
Base.split
— FunctionBase.split(y::NamedArray, k::Int64; seed::Int64=1)
Split source nodes in y
into k
groups for cross-validation.
Arguments
y::AbstractMatrix
: Drug-Target rectangular adjacency matrix.k::Int64
: Number of groups to use in data splitting.seed::Int64
: Seed used for data splitting.
SimSpread.clean!
— Functionclean!(yhat::NamedArray, A::NamedArray, y::NamedArray)
Flag, in place, erroneous prediction from cross-validation splitting.
Arguments
yhat::NamedArray
: Predicted source-target bipartite adjacency matrixA::NamedArray
: Initial resource source-target resources adjacency matrixy::NamedArray
: Ground-truth source-target bipartite adjacency matrix
Performance assessment
Several evaluation metrics are implemented in the package, which can be that can be classified into three groups: (i) overall performance, (ii) early recognition, and (iii) binary prediction performance.
Overall performance
This metrics represent classical evaluation metrics that make use of the complete list of prediction to assess predictive performance.
SimSpread.AuPRC
— FunctionAuPRC(y::AbstractVector{Bool}, yhat::AbstractVector)
Area under the Precision-Recall curve using the trapezoidal rule.
Arguments
y::AbstractArray
: Binary class labels. 1 for positive class, 0 otherwise.̂yhat::AbstractArray
: Prediction scores.
SimSpread.AuROC
— FunctionAuROC(y::AbstractVector{Bool}, yhat::AbstractVector)
Area under the Receiver Operator Characteristic curve using the trapezoidal rule.
Arguments
y::AbstractArray
: Binary class labels. 1 for positive class, 0 otherwise.̂yhat::AbstractArray
: Prediction scores.
Early recognition performance
Due to the roots of SimSpread (target prediction in drug discovery), we include evaluation metrics that aim to assess predictive performance of the best predictions obtained from a model.
In virtual screening, only the best predictions obtained from a model are selected for posterior experimental validation. Therefore, understanding the predictive performance of a model for these predictions is essential to (1) make accurate predictions that will translate to biological activity and (2) understand the limitations of the model. The metrics discussed here can be evaluated at a given cut-off rank, considering only the topmost results returned by the predictive method, hence informing of the predictive performance of the model for only the best predictions.
SimSpread.recallatL
— FunctionrecallatL(y, yhat, L)
Get recall@L as proposed by Wu, et al (2017).
Arguments
y::AbstractVector
: Binary class labels. 1 for positive class, 0 otherwise.̂yhat::AbstractVector
: Prediction score.L::Integer
: Length to consider to calculate metrics (default = 20).
recallatL(y, yhat, L)
Get mean recall@L per group as proposed by Wu, et al (2017).
Arguments
y::AbstractVector
: Binary class labels. 1 for positive class, 0 otherwise.̂yhat::AbstractVector
: Prediction score.̂grouping::AbstractVector
: Group labels.L::Integer
: Length to consider to calculate metrics (default = 20).
SimSpread.precisionatL
— FunctionprecisionatL(y, yhat, L::Integer=20)
Get precision@L as proposed by Wu, et al (2017).
Arguments
y::AbstractVector
: Binary class labels. 1 for positive class, 0 otherwise.̂yhat::AbstractVector
: Prediction score.L::Integer
: Length to consider to calculate metrics (default = 20).
precisionatL(y, yhat, grouping, L)
Get mean precision@L per group as proposed by Wu, et al (2017).
Arguments
y::AbstractVector
: Binary class labels. 1 for positive class, 0 otherwise.̂yhat::AbstractVector
: Prediction score.grouping::AbstractVector
: Group labels.L::Integer
: Length to consider to calculate metrics (default = 20).
SimSpread.BEDROC
— FunctionBEDROC(y::AbstractVector{Bool}, yhat::AbstractVector; rev::Bool=true, α::AbstractFloat=20.0)
The Boltzmann Enhanced Descrimination of the Receiver Operator Characteristic (BEDROC) score is a modification of the Receiver Operator Characteristic (ROC) score that allows for a factor of early recognition.
Score takes a value in interval [0, 1] indicating degree to which the predictive model employed detects (early) the positive class.
Arguments
y::AbstractArray
: Binary class labels. 1 for positive class, 0 otherwise.̂yhat::AbstractArray
: Prediction scores.rev::Bool
: True if high values of $yhat$ correlates to positive class (default = true).α::AbstractFloat
: Early recognition parameter (default = 20.0).
References
- Truchon, J.-F., & Bayly, C. I. (2007). Evaluating Virtual Screening Methods: Good and
Bad Metrics for the “Early Recognition” Problem. Journal of Chemical Information and Modeling, 47(2), 488–508. https://doi.org/10.1021/ci600426e
Binary prediction performance
A common practice in predictive modelling is to assign a score or probability threshold for the predictions obtained from a model and manually select or cherry-pick predictions for validation. In order to evaluate the predictive performance under this paradigm, we implement a series of metrics that are meant for binary classification, that is, a link exists or not based in a given threshold, from which statistical moments can be calculated to retrieve a notion of predictive performance as a decision boundary changes.
SimSpread.f1score
— Functionf1score(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}
The harmonic mean between precision and recall
Arguments
tn::Integer
True negativesfp::Integer
False postivesfn::Integer
False negativestp::Integer
True positives
SimSpread.mcc
— Functionmcc(a::T, b::T, ϵ::AbstractFloat = 0.0001) where {T<:Integer}
Matthews correlation coefficient using calculus approximation for when FN+TN, FP+TN, TP+FN or TP+FP equals zero.
Arguments
a::Integer
= Value of positiona
in confusion matrixb::Integer
= Value of positionb
in confusion matrixϵ::AbstractFloat
= Approximation coefficient (default = floatmin(Float64))
Extended help
The confusion matrix in a binary prediction is comprised of 4 distinct positions:
| Predicted positive Predicted negative
----------------+--------------------------------------------
Actual positive | True positives (TP) False negatives (FN)
Actual negative | False positives (FP) True negatives (TN)
In the case a row or column of the confusion matrix equals zero, MCC is undefined. Therefore, to correctly use MCC with this approximation, arguments a
and b
are defined as follows:
- If "Predictive positive" column is zero,
a
is TN andb
is FN - If "Predictive negative" column is zero,
a
is TP andb
is FP - If "Actual positive" row is zero,
a
is TN andb
is FP - If "Actual negative" row is zero,
a
is TP andb
is FN
Reference
1.Chicco, D., Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020).
mcc(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}
Matthews correlation coefficient, a special case of the phi coeficient Performance metric used for overcoming the class imbalance issues
Arguments
tn::Integer
True negativesfp::Integer
False postivesfn::Integer
False negativestp::Integer
True positives
Reference
1.Chicco, D., Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020).
SimSpread.accuracy
— Functionaccuracy(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}
The number of all correct predictions divided by the total predicitions
Arguments
tn::Integer
True negativesfp::Integer
False postivesfn::Integer
False negativestp::Integer
True positives
SimSpread.balancedaccuracy
— Functionbalancedaccuracy(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}
The arithmetic mean of sensitivity and specificity, its use case is when dealing with imbalanced data
Arguments
tn::Integer
True negativesfp::Integer
False postivesfn::Integer
False negativestp::Integer
True positives
SimSpread.recall
— Functionrecall(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}
The fraction of positive samples correctly predicted as postive
Arguments
tn::Integer
True negativesfp::Integer
False postivesfn::Integer
False negativestp::Integer
True positives
SimSpread.precision
— Functionprecision(tn::T, fp::T, fn::T, tp::T) where {T<:Integer}
The fraction of positive predictions that are correct
Arguments
tn::Integer
True negativesfp::Integer
False postivesfn::Integer
False negativestp::Integer
True positives
SimSpread.meanperformance
— Functionmeanperformance(confusion::AbstractVector{ROCNums{Int64}}, metric::Function)
Get mean performance of a given metric over a set of confusion matrices.
Arguments
confusion::AbstractVector{ROCNums{Int64}}
: Confusion matriceŝmetric::Function
: Performance metric function to use in evaluation.
meanperformance(y::AbstractVector{Bool}, yhat::AbstractVector{Float64}, metric::Function)
Get mean performance of a given metric over a pair of label-prediction vectors.
Arguments
y::AbstractVector
: Binary class labels. 1 for positive class, 0 otherwise.̂yhat::AbstractVector
: Prediction score.̂metric::Function
: Performance metric function to use in evaluation.
SimSpread.meanstdperformance
— Functionmeanstdperformance(confusion::AbstractVector{ROCNums{Real}}, metric::Function)
Get mean and standard deviation performance of a given metric over a set of confusion matrices.
Arguments
confusion::AbstractVector{ROCNums{Real}}
: Confusion matrix object from MLBasêmetric::Function
: Performance metric function to use in evaluation.
meanstdperformance(y::AbstractVector{Bool}, yhat::AbstractVector{Float64}, metric::Function)
Get mean and standard deviation performance of a given metric over a pair of label-prediction vectors.
Arguments
y::AbstractVector
: Binary class labels. 1 for positive class, 0 otherwise.̂yhat::AbstractVector
: Prediction scores.̂metric::Function
: Performance metric function to use in evaluation.
SimSpread.maxperformance
— Functionmaxperformance(confusion::AbstractVector{ROCNums{Real}}, metric::Function)
Get maximum performance of a given metric over a set of confusion matrices.
Arguments
confusion::AbstractVector{ROCNums{Real}}
: Confusion matrices.̂metric::Function
: Performance metric function to use in evaluation.
maxperformance(y::AbstractVector{Bool}, yhat::AbstractVector{Float64}, metric::Function)
Get maximum performance of a given metric over a pair of label-prediction vectors.
Arguments
y::AbstractVector{Bool}
: Binary class labels. 1 for positive class, 0 otherwise.̂yhat::AbstractVector{Float64}
: Prediction score.̂metric::Function
: Performance metric function to use in evaluation.
Other metrics
SimSpread.validity_ratio
— Functionvalidity_ratio(yhat::AbstractVector)
Ratio of valid predictions (score > 0) and all predictions. Allows to check if predictive performance is given for all predictions or a subset of the predictions.
Arguments
yhat::AbstractVector
: Prediction scores.
Extended help
A limitation of SimSpread is that it is impossible to generate predictions for query nodes whose similarity to every source of the first network layer is below the similarity threshold α. In this case, the length of the feature vector is zero, resource spreading is not possible, and the predicted value is zero for all targets, therefore we need guardrails to correctly assess performance as a function of the threshold α.
This characteristic of SimSpread can be seen as an intrinsic notion of its application domain. No target predictions are generated for query nodes outside SimSpread’s application domain instead of returning likely meaningless targets.
References
- Vigil-Vásquez & Schüller (2022). De Novo Prediction of Drug Targets and Candidates by Chemical Similarity-Guided Network-Based Inference. International Journal of Molecular Sciences, 23(17), 9666. https://doi.org/10.3390/ijms23179666
Miscellaneous utilities
SimSpread.read_namedmatrix
— Functionread_namedmatrix(filepath::String, valuetype::Type = FLoat64filepath::String, valuetype::Type)
Load a matrix with named indices as a NamedArray.
Arguments
filepath::String
: File path of matrix to loaddelimiter::Char
: Delimiter character between values in matrix (default = ' ')valuetype::Type
: Type of values contained in matrix (default = Float64)
SimSpread.k
— Functionk(G::AbstractMatrix)
Get node degrees from adjacency matrix
Arguments
M::AbstractMatrix
: Matrix to parse
SimSpread.getyamanishi
— Functiongetyamanishi(db)
Get a tuple of matrices corresponding to the drug-target adjacency matrix and drug-drug similarity matrix for a given Yamanishi (2008) dataset.
Arguments
db
: Dataset ID (any of the following: "nr", "ic", "gpcr" or "e")
Example
julia> dt, dd = getyamanishi("nr");
julia> dt[1:5, 1:5]
5×5 Named Matrix{Float64}
A ╲ B │ hsa190 hsa2099 hsa2100 hsa2101 hsa2103
───────┼────────────────────────────────────────────
D00040 │ 0.0 0.0 0.0 0.0 0.0
D00066 │ 0.0 1.0 0.0 0.0 0.0
D00067 │ 0.0 1.0 0.0 0.0 0.0
D00075 │ 0.0 0.0 0.0 0.0 0.0
D00088 │ 0.0 0.0 0.0 0.0 0.0
julia> dd[1:5, 1:5]
5×5 Named Matrix{Float64}
A ╲ B │ D00040 D00066 D00067 D00075 D00088
───────┼─────────────────────────────────────────────────
D00040 │ 1.0 0.545455 0.297297 0.53125 0.459459
D00066 │ 0.545455 1.0 0.387097 0.833333 0.689655
D00067 │ 0.297297 0.387097 1.0 0.464286 0.352941
D00075 │ 0.53125 0.833333 0.464286 1.0 0.678571
D00088 │ 0.459459 0.689655 0.352941 0.678571 1.0
Extended help
The provided Yamanishi (2008) [1] datasets (ID) are:
- 'Nuclear Receptor' (nr)
- 'Ion Channels' (ic)
- 'GPCR' (gpcr)
- 'Enzyme' (e)
This function returns 2 distinct adjacency matrices:
- Binary drug-target interaction matrix, obtained from biological annotations
- Continuous drug-drug similarity matrix, obtained from SIMCOMP
References
- Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W., & Kanehisa, M. (2008). Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 24(13), i232–i240. https://doi.org/10.1093/bioinformatics/btn162