Biweight midcorrelation

What is biweight midcorrelation

In statistics, biweight midcorrelation (also called bicor) is a measure of similarity between samples. It is median-based, rather than mean-based, thus is less sensitive to outliers, and can be a robust alternative to other similarity metrics, such as Pearson correlation or mutual information.

This correlation was suggested by the author of Weighted Gene Coexpression Network Analysis (WGCNA)

Derivation

$\underline{x}, \underline{y}\in\R^{1\times m}$

u_i={{x_i-med(x)}\over{9mad(x)}}

v_i={{y_i-med(y)}\over{9mad(y)}}

where,

$med(x)$ : median of $x$

$mad(x)$ : median absolute deviance of $x$

About the constant factor 9…
Mosteller and Tukey suggest utilizing the MAD or interquartile range for preliminary analysis where moderate efficiency in diverse circumstances is satisfactory.

Define weights

w_i^{(x)}=(1-u_i^2)^2I(1-|u_i|)

w_i^{(y)}=(1-v_i^2)^2I(1-|v_i|)

About $w_i$
$w_i\propto{1\over Deviation}$
The weights $w^{(x)}_i$ goes to 1, if $x_i$ is near $med(x)$
The weights $w^{(x)}_i$ goes to 0, if $x_i$ differs from $med(x)$ more than $9mad(x)$
The element of i is outlier when $w_i=0$

where,

$I(x)=1$ if $x>0$ else $0$

Then the normalized vector so that the sum of the weights is 1,

$\tilde{x}_i={{(x_i-med(x))w_i^{(x)}}\over{\sqrt{\sum_{j=1}^m{[(x_j-med(x))w_j^{(x)}]^2}}}}$

$\tilde{y}_i={{(y_i-med(y))w_i^{(y)}}\over{\sqrt{\sum_{j=1}^m{[(y_j-med(y))w_j^{(y)}]^2}}}}$

→ $bicor(x, y)=\sum_{i=1}^m{\tilde{x}_i\tilde{y}_i}$

Since biweight midcovariance estimator is both resistant and robust of efficiency, it is a robust statistic.

Half-thresholding method (BMHT)

The soft thresholding method in WGCNA is good for considering the continuity of connectivity, but it is not a good approach when there are so many noise values in betweeness. Ultimately, we want to check the two data set, normal data and disease data so that we can use the two informations.

Calculate the bicors separtely from the 2 data set under normal condition and disease condition. After calculating all the pairs of each two data set, we can apply thresholding to each pair two times. If there is no greater value than the threshold among the two correlation coefficients, then the connectivity is non-informative correlation pair.

After filtering the non-informative correlation pairs, we can caculate the differentical coexpression (dc) value for the two conditions using the following equation.

dc_i(BMHT)= \sqrt{{{(x_{i1}-y_{i1})^2+(x_{i2}-y_{i2})^2+...+(x_{in}-y_{in})^2}\over{n}}}

Where $n$ is the number of the module filtered out the non-informative correlation pairs. This calculates the average coexpression change between a gene and its informative coexpression genes. Then we can use the dc values to rank genes.

References

NCBI - WWW Error Blocked Diagnostic

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4271563/

Biweight Midcovariance

https://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/biwmidc.htm

Biweight Scale

https://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/biwscale.htm