Outlier detection¶
Functions:

Compute the leverage score of the given factor matrix. 

Compute the leverage score and (normalised) slabwise SSE along one axis. 

Compute the (normalised) slabwise SSE along the given mode(s). 

Compute threshold for detecting possible outliers based on leverage. 

Compute ruleofthumb threshold values for suspicious residuals. 
 tlviz.outliers.compute_leverage(factor_matrix)[source]¶
Compute the leverage score of the given factor matrix.
The leverage score is a measure of how much “influence” a slab (often representing a sample) has on a tensor factorisation model. To compute the leverage score for the different slabs, we only need the factor matrix for the selected mode. If the selected mode is represented by \(\mathbf{A}\), then the leverage score is defined as
\[h_i = \left[\mathbf{A} \left(\mathbf{A}^T \mathbf{A}\right)^{1} \mathbf{A}^T\right]_{ii},\]that is, the \(i\)th diagonal entry of the matrix \(\mathbf{A} \left(\mathbf{A}^T \mathbf{A}\right)^{1} \mathbf{A}^T\). If a given slab, \(i\), has a high leverage score, then it likely has a strong influence on the model. A good overview of the leverage score is [VW81].
The leverage scores sums to the number of components for our model and is always between 0 and 1. Moreover, if a data point has a leverage score equal to 1, then one component is solely “devoted” to modelling that data point, and removing the corresponding row from \(A\) will reduce the rank of \(A\) by 1 [BKW80].
A way of interpreting the leverage score is as a measure of how “similar” a data point is to the rest. If a row of \(A\) is equal to the average row of \(A\), then its leverage score would be equal to \(\frac{1}{I}\). Likewise, if a data point has a leverage of 1, then no other data points have a similar model representation. If a data point has a leverage of 0.5, then there is one other data point (in some weighted sense) with a similar model representation, and a leverage of 0.2 means that there are five other data points with a similar model representation [HR09].
If the factor matrix is a dataframe, then the output is also a dataframe with that index. Otherwise, the output is a NumPy array.
 Parameters:
 factor_matrixDataFrame or numpy array
The factor matrix whose leverage we compute
 Returns:
 leverageDataFrame or numpy array
The leverage scores, if the input is a dataframe, then the index is preserved.
Note
The leverage score is related to the Hotelling T2statistic (or Dstatistic), which is equal to a scaled version of leverage computed based on centered factor matrices.
Examples
In this example, we compute the leverage of a random factor matrix
>>> import numpy as np >>> from tlviz.outliers import compute_leverage >>> rng = np.random.default_rng(0) >>> A = rng.standard_normal(size=(5, 2)) >>> leverage_scores = compute_leverage(A) >>> for index, leverage in enumerate(leverage_scores): ... print(f"Sample {index} has leverage score {leverage:.2f}") Sample 0 has leverage score 0.04 Sample 1 has leverage score 0.23 Sample 2 has leverage score 0.50 Sample 3 has leverage score 0.59 Sample 4 has leverage score 0.64
 tlviz.outliers.compute_outlier_info(cp_tensor, true_tensor, normalise_sse=True, mode=0, axis=None)[source]¶
Compute the leverage score and (normalised) slabwise SSE along one axis.
These metrics are often plotted against each other to discover outliers.
 Parameters:
 cp_tensorCPTensor or tuple
TensorLystyle CPTensor object or tuple with weights as first argument and a tuple of components as second argument
 true_tensorxarray or numpy array
Dataset that cp_tensor is fitted against.
 normalise_ssebool
If true, the slabwise SSE is scaled so it sums to one.
 modeint
The mode to compute the outlier info across.
 axisint (optional)
Alias for mode. If this is set, then no value for mode can be given.
 Returns:
 DataFrame
Dataframe with two columns, “Leverage score” and “Slabwise SSE”.
See also
compute_leverage
More information about the leverage score is given in this docstring
compute_slabwise_sse
More information about the slabwise SSE is given in this docstring
get_leverage_outlier_threshold
Cutoff for selecting potential outliers based on the leverage
compute_slabwise_sse
Cutoff for selecting potential outliers based on the slabwise SSE
 tlviz.outliers.compute_slabwise_sse(estimated, true, normalise=True, mode=0, axis=None)[source]¶
Compute the (normalised) slabwise SSE along the given mode(s).
For a tensor, \(\mathcal{X}\), and an estimated tensor \(\hat{\mathcal{X}}\), we compute the \(i\)th normalised slabwise residual as
\[r_i = \frac{\sum_{jk} \left(x_{ijk}  \hat{x}_{ijk}\right)^2} {\sum_{ijk} \left(x_{ijk}  \hat{x}_{ijk}\right)^2}.\]The residuals can measure how well our decomposition fits the different sample. If a sample, \(i\), has a high residual, then that indicates that the model is not able to describe its behaviour.
 Parameters:
 estimatedxarray or numpy array
Estimated dataset, if this is an xarray, then the output is too.
 truexarray or numpy array
True dataset, if this is an xarray, then the output is too.
 normalisebool
Whether the SSE should be scaled so the vector sums to one.
 modeint or iterable of ints
Mode (or modes) that the SSE is computed across (i.e. these are not the ones summed over). The output will still have these axes.
 axisint or iterable of ints (optional)
Alias for mode. If this is set, then no value for mode can be given
 Returns:
 slab_ssexarray or numpy array
The (normalised) slabwise SSE, if true tensor input is an xarray array, then the returned tensor is too.
 tlviz.outliers.get_leverage_outlier_threshold(leverage_scores, method='p_value', p_value=0.05)[source]¶
Compute threshold for detecting possible outliers based on leverage.
Huber’s heuristic for selecting outliers
In Robust Statistics, Huber [HR09] shows that that if the leverage score, \(h_i\), of a sample is equal to \(1/r\) and we duplicate that sample, then its leverage score will be equal to \(1/(1+r)\). We can therefore, think of of the reciprocal of the leverage score, \(1/h_i\), as the number of similar samples in the dataset. Following this logic, Huber recommends two thresholds for selecting outliers: 0.2 (which we name
"huber low"
) and 0.5 (which we name"huber high"
).Hoaglin and Welch’s heuristic for selecting outliers
In [HW78], Hoaglin and Welsch state that \(2r/n\) is a good cutoff for selecting samples that may be outliers. This choice is elaborated in [BKW80] (page 17), where Belsley, Kuh, and Welsch also propose \(3r/n\) as a cutoff when \(r < 6\) and \(nr > 12\). They also defend thee cutoffs by proving that if the factor matrices are normally distributed, then \((n  r)[h_i  (1/n)]/[(1  h_i)(r  1)]\) follows a Fisher distribution with \((r1)\) and \((nr)\) degrees of freedom. While the factor matrix seldomly follows a normal distribution, Belsley, Kuh, and Welsch still argues that this can be a good starting point for cutoff values of suspicious data points. Based on reasonable choices for \(n\) and \(r\), they arive at the heuristics above.
Leverage pvalue
Another way to select ouliers is also based on the findings by Belsley, Kuh, and Welsch. We can use the transformation into a Fisher distributed variable (assuming that the factor elements are drawn from a normal distribution), to compute cutoff values based on a pvalue. The elements of the factor matrices are seldomly normally distributed, so this is also just a ruleofthumb.
Note
Note also that we, with bootstrap estimation, have found that this pvalue is only valid for large number of components. For smaller number of components, the false positive rate will be higher than the specified pvalue, even if the components follow a standard normal distribution (see example below).
Hotelling’s T2 statistic
Yet another way to estimate a pvalue is via Hotelling’s Tsquared statistic [Jac80] (see also [NM95]). The key here is to notice that if the factor matrices are normally distributed with zero mean, then the leverage is equivalent to a scaled version of the Hotelling’s Tsquared statistic. This is commonly used in PCA, where the data often is centered beforehand, which leads to components with zero mean (in the mode the data is centered across). Again, note that the elements of the factor matrices are seldomly normally distributed, so this is also just a ruleofthumb.
Note
Note also that we, with bootstrap estimation, have found that this pvalue is not valid for large numbers of components. In that case, the false positive rate will be higher than the specified pvalue, even if the components follow a standard normal distribution (see example below).
 Parameters:
 leverage_scoresnp.ndarray or pd.DataFrame
 method{“huber lower”, “huber higher”, “hw lower”, “hw higher”, “pvalue”, “hotelling”}
 p_valuefloat (optional, default=0.05)
If
method="pvalue"
, then this is the pvalue used for the cutoff.
 Returns:
 thresholdfloat
Threshold value, data points with a leverage score larger than the threshold are suspicious and may be outliers.
Examples
The leverage pvalue is only accurate with many components: Here, we use MonteCarlo estimation to demonstrate that the pvalue derived in [BKW80] is valid only for large number of components.
We start by importing some utilities
>>> import numpy as np >>> from scipy.stats import bootstrap >>> from tlviz.outliers import compute_leverage, get_leverage_outlier_threshold
Here, we create a function that computes the false positive rate
>>> def compute_false_positive_rate(n, d, p_value): ... X = np.random.standard_normal((n, d)) ... ... h = compute_leverage(X) ... th = get_leverage_outlier_threshold(h, method="pvalue", p_value=p_value) ... return (h > th).mean()
>>> np.random.seed(0) >>> n_samples = 2_000 >>> leverages = [compute_false_positive_rate(10, 2, 0.05) for _ in range(n_samples)], >>> fpr_low, fpr_high = bootstrap(leverages, np.mean).confidence_interval >>> print(f"95% confidence interval for the false positive rate: [{fpr_low:.3f}, {fpr_high:.3f}]") 95% confidence interval for the false positive rate: [0.083, 0.089]
We see that the false positive rate is almost twice what we prescribe (0.05). However, if we increase the number of components, then the false positive rate improves
>>> leverages = [compute_false_positive_rate(10, 9, 0.05) for _ in range(n_samples)], >>> fpr_low, fpr_high = bootstrap(leverages, np.mean).confidence_interval >>> print(f"95% confidence interval for the false positive rate: [{fpr_low:.3f}, {fpr_high:.3f}]") 95% confidence interval for the false positive rate: [0.049, 0.056]
This indicates that the false positive rate is most accurate when the number of components is equal to the number of samples  1. We can increase the number of samples to assess this conjecture
>>> leverages = [compute_false_positive_rate(100, 9, 0.05) for _ in range(n_samples)], >>> fpr_low, fpr_high = bootstrap(leverages, np.mean).confidence_interval >>> print(f"95% confidence interval for the false positive rate: [{fpr_low:.3f}, {fpr_high:.3f}]") 95% confidence interval for the false positive rate: [0.055, 0.056]
The increase in the false positive rate supports the conjecture that Belsley et al.’s method for computing the pvalue is accurate only when the number of components is high. Still, it is important to remember that the original assumptions (normally distributed components) is seldomly satisfied also, so this method is only a ruleofthumb and can still be useful.
Hotelling’s Tsquared statistic requires few components or many samples: Here, we use MonteCarlo estimation to demonstrate that the Hotelling Tsquared statistic is only valid with many samples.
>>> def compute_hotelling_false_positive_rate(n, d, p_value): ... X = np.random.standard_normal((n, d)) ... ... h = compute_leverage(X) ... th = get_leverage_outlier_threshold(h, method="hotelling", p_value=p_value) ... return (h > th).mean()
We set the simulation parameters and the seed
>>> np.random.seed(0) >>> n_samples = 2_000 >>> fprs = [compute_hotelling_false_positive_rate(10, 2, 0.05) for _ in range(n_samples)], >>> fpr_low, fpr_high = bootstrap(fprs, np.mean).confidence_interval >>> print(f"95% confidence interval for the false positive rate: [{fpr_low:.3f}, {fpr_high:.3f}]") 95% confidence interval for the false positive rate: [0.052, 0.058]
However, if we increase the number of components, then the false positive rate becomes to large
>>> fprs = [compute_hotelling_false_positive_rate(10, 5, 0.05) for _ in range(n_samples)], >>> fpr_low, fpr_high = bootstrap(fprs, np.mean).confidence_interval >>> print(f"95% confidence interval for the false positive rate: [{fpr_low:.3f}, {fpr_high:.3f}]") 95% confidence interval for the false positive rate: [0.078, 0.084]
But if we increase the number of samples, then the estimate is good again
>>> fprs = [compute_hotelling_false_positive_rate(100, 5, 0.05) for _ in range(n_samples)], >>> fpr_low, fpr_high = bootstrap(fprs, np.mean).confidence_interval >>> print(f"95% confidence interval for the false positive rate: [{fpr_low:.3f}, {fpr_high:.3f}]") 95% confidence interval for the false positive rate: [0.049, 0.051]
 tlviz.outliers.get_slabwise_sse_outlier_threshold(slab_sse, method='pvalue', p_value=0.05, ddof=1)[source]¶
Compute ruleofthumb threshold values for suspicious residuals.
One way to determine possible outliers is to examine how well the model describes the different data points. A standard way of measuring this, is by the slabwise sum of squared errors (slabwise SSE), which is the sum of squared error for each data point.
There is, unfortunately, no guaranteed way to detect outliers automatically based on the residuals. However, if the noise is normally distributed, then the residuals follow a scaled chisquared distribution. Specifically, we have that \(\text{SSE}_i^2 \sim g\chi^2_h\), where \(g = \frac{\sigma^2}{2\mu}\), \(h = \frac{\mu}{g} = \frac{2\mu^2}{\sigma^2}\), and \(\mu\) is the average slabwise SSE and \(\sigma^2\) is the variance of the slabwise SSE [Box54].
Another ruleofthumb follows from [NaesIFD02] (p. 187), which states that two times the standard deviation of the slabwise SSE can be used for determining data points with a suspiciously high residual.
 Parameters:
 slab_ssenp.ndarray or pd.DataFrame
 method{“two_sigma”, “pvalue”}
 p_valuefloat (optional, default=0.05)
If
method="pvalue"
, then this is the pvalue used for the cutoff.
 Returns:
 thresholdfloat
Threshold value, data points with a higher SSE than the threshold are suspicious and may be outliers.
Examples
Here, we see that the pvalue gives a good cutoff if the noise is normally distributed
We start by importing the tools we’ll need
>>> import numpy as np >>> from scipy.stats import bootstrap >>> from tlviz.outliers import compute_slabwise_sse, get_slabwise_sse_outlier_threshold >>> from tlviz.utils import cp_to_tensor
Then, we create a function to compute the false positive rate. This will be useful for our bootstrap estimate for the true false positive rate.
>>> def compute_false_positive_rate(shape, num_components, p_value): ... A = np.random.standard_normal((shape[0], num_components)) ... B = np.random.standard_normal((shape[1], num_components)) ... C = np.random.standard_normal((shape[2], num_components)) ... ... X = cp_to_tensor((None, [A, B, C])) ... noisy_X = X + np.random.standard_normal(shape)*5 ... ... ... ... sse = compute_slabwise_sse(X, noisy_X) ... th = get_slabwise_sse_outlier_threshold(sse, method="pvalue", p_value=p_value) ... return (sse > th).mean()
Finally, we estimate the 95% confidence interval of the false positive rate to validate that it is approximately correct.
>>> np.random.seed(0) >>> n_samples = 2_000 >>> slab_sse = [compute_false_positive_rate((20, 20, 10), 5, 0.05) for _ in range(n_samples)], >>> fpr_low, fpr_high = bootstrap(slab_sse, np.mean).confidence_interval >>> print(f"95% confidence interval for the false positive rate: [{fpr_low:.3f}, {fpr_high:.3f}]") 95% confidence interval for the false positive rate: [0.044, 0.047]
We see that the 95% confidence interval lies just below our goal of 0.05! Let’s also try with a false positive rate of 0.1
>>> slab_sse = [compute_false_positive_rate((20, 20, 10), 5, 0.1) for _ in range(n_samples)], >>> fpr_low, fpr_high = bootstrap(slab_sse, np.mean).confidence_interval >>> print(f"95% confidence interval for the false positive rate: [{fpr_low:.3f}, {fpr_high:.3f}]") 95% confidence interval for the false positive rate: [0.097, 0.100]
Here we see that the false positive rate is sufficiently estimated. It may have been too low above since we either did not have enough samples in the first mode (which we compute) the false positive rate for). With only 20 samples, it will be difficult to correctly estimate a false positive rate of 0.05. If we increase the number of samples to 200 instead, we see that the false positive rate is within our expected bounds.
>>> slab_sse = [compute_false_positive_rate((200, 20, 10), 5, 0.05) for _ in range(n_samples)], >>> fpr_low, fpr_high = bootstrap(slab_sse, np.mean).confidence_interval >>> print(f"95% confidence interval for the false positive rate: [{fpr_low:.3f}, {fpr_high:.3f}]") 95% confidence interval for the false positive rate: [0.049, 0.050]