{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n\n# Outlier detection with PARAFAC\n\nThere are two metrics that are commonly used for detecting outliers in PARAFAC models: *leverage* and *residuals*.\n\n## Residuals\n\nThe residuals measure how well the PARAFAC model represents the data. For detecting outliers, it is common\nto compute the sum of squared error for each sample. Since each sample corresponds to a slab in the data tensor,\nwe use the name *slabwise SSE* in TLViz. For a third-order tensor, $\\mathcal{X}$, and its PARAFAC\nestimate, $\\hat{\\mathcal{X}}$, we compute the $i$-th residual in the first mode as\n\n\\begin{align}r_i = \\sum_{jk} \\left(x_{ijk} - \\hat{x}_{ijk}\\right)^2.\\end{align}\n\nThese residuals measure how well our decomposition fits the different samples. If a sample, $i$, has a high\nresidual, then that indicates that our PARAFAC model cannot describe its behaviour.\n\n## Leverage\nThe leverage score measures how \"influential\" the different samples are for the PARAFAC model. There\nare several interpretations of the leverage score :cite:p:`velleman1981efficient`, one of them is the\nnumber of components we devote to a given data point. To compute the leverage score for the different samples,\nwe only need the factor matrix for the sample mode. If the sample mode is represented by $\\mathbf{A}$,\nthen the leverage score is defined as\n\n\\begin{align}h_i = \\left[\\mathbf{A} \\left(\\mathbf{A}^T \\mathbf{A}\\right)^{-1} \\mathbf{A}^T\\right]_{ii},\\end{align}\n\nthat is, the $i$-th diagonal entry of the matrix\n$\\mathbf{A} \\left(\\mathbf{A}^T \\mathbf{A}\\right)^{-1} \\mathbf{A}^T$.\n\n## Examples with code\nBelow, we show some examples with simulated data to illustrate how we can use the leverage and residuals to\nfind and remove the outliers in a dataset.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom tensorly.decomposition import parafac\n\nimport tlviz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Utility for fitting PARAFAC models and sampling CP tensors\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def fit_parafac(X, num_components, num_inits=5):\n model_candidates = [\n parafac(\n X.data,\n num_components,\n n_iter_max=1000,\n tol=1e-8,\n init=\"random\",\n orthogonalise=True,\n linesearch=True,\n random_state=i,\n )\n for i in range(num_inits)\n ]\n model = tlviz.multimodel_evaluation.get_model_with_lowest_error(model_candidates, X)\n return tlviz.postprocessing.postprocess(model, X)\n\n\ndef index_cp_tensor(cp_tensor, indices, mode):\n weights, factors = cp_tensor\n indices = set(indices)\n new_factors = []\n for i, fm in enumerate(factors):\n if i == mode:\n new_factors.append(fm[fm.index.isin(indices)])\n else:\n new_factors.append(fm)\n\n return weights, new_factors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate simulated dataset\n\nWe start by generating a random CP Tensor where the first mode represents the sample mode,\nand the other two modes represent some signal we have measured.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "I, J, K = 20, 30, 40\nrank = 5\noffset = np.arange(1, rank + 1)[np.newaxis]\n\nrng = np.random.default_rng(0)\nA = pd.DataFrame(rng.standard_normal((I, rank)))\nA.index.name = \"Mode 0\"\nB = pd.DataFrame(np.sin(offset * np.linspace(0, 2 * np.pi, J)[:, np.newaxis] / rank + offset))\nB.index.name = \"Mode 1\"\nC = pd.DataFrame(np.cos(offset * np.linspace(0, 2 * np.pi, K)[:, np.newaxis] / rank + offset))\nC.index.name = \"Mode 2\"\n\n\ntrue_model = (None, [A, B, C])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting the components\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tlviz.visualisation.components_plot(true_model)\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding artificial noise\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X = tlviz.utils.cp_to_tensor(true_model)\nnoise = rng.standard_normal(X.shape)\nX += np.linalg.norm(X.data) * 0.1 * noise / np.linalg.norm(noise)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding artificial outliers\nWe need three outliers to find; let's make three of the first four samples into outliers.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X[0] = np.random.standard_normal(X[0].shape)\nX[2] *= 10\nX[3] = X[3] * 3 + 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fitting a PARAFAC model to this data and plotting the components\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "first_attempt = fit_parafac(X, rank, num_inits=5)\ntlviz.visualisation.component_comparison_plot({\"True\": true_model, \"First attempt\": first_attempt})\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that some of the components are wrong, and to quantify this, we can look at the factor match score:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fms = tlviz.factor_tools.factor_match_score(true_model, first_attempt)\nprint(f\"The FMS is {fms:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we can create a scatter plot where we plot the leverage and residuals for each sample.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tlviz.visualisation.outlier_plot(first_attempt, X)\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this plot, we see that there are three potential outliers.\nFirst, sample 2 has a high leverage and a high residual.\nThe leverage is close to one, so we almost devote a whole component to modelling it.\nAnd still, it is poorly described by the model.\nWe also see that sample 3 has a high leverage and low error,\nand sample 0 has a low leverage and a high error.\nLet's start by removing the most problematic sample: Sample 2.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "samples_to_remove = {2}\nselected_samples = [i for i in range(I) if i not in samples_to_remove]\nsampled_X = X.loc[{\"Mode 0\": selected_samples}]\nsecond_attempt = fit_parafac(sampled_X, rank, num_inits=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we want to compare the true and estimated components.\nHowever, since we have removed a slab from the dataset,\nwe also need to remove the corresponding row from $\\mathbf{A}$\nbefore we can compare the components.\nOtherwise, we could not align the components.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sampled_true_cp = index_cp_tensor(true_model, selected_samples, mode=0)\ntlviz.visualisation.component_comparison_plot({\"True\": sampled_true_cp, \"Second attempt\": second_attempt})\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Still, we have not successfully recovered the correct components,\nwhich is apparent from the factor match score.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fms = tlviz.factor_tools.factor_match_score(sampled_true_cp, second_attempt)\nprint(f\"The FMS is {fms:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also create the outlier plot again to see which samples are suspicious.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tlviz.visualisation.outlier_plot(second_attempt, sampled_X)\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that sample 0 and 3 still show the same behaviour as before. Sample 0 has a high error, but does not affect\nthe model much, while sample 3 has a low error, but does affect the model. Let us therefore remove sample 3 now.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "samples_to_remove = {2, 3}\nselected_samples = [i for i in range(I) if i not in samples_to_remove]\nsampled_X = X.loc[{\"Mode 0\": selected_samples}]\nthird_attempt = fit_parafac(sampled_X, rank, num_inits=5)\n\nsampled_true_cp = index_cp_tensor(true_model, selected_samples, mode=0)\ntlviz.visualisation.component_comparison_plot({\"True\": sampled_true_cp, \"Third attempt\": third_attempt})\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we see that we finally uncovered the true components. This is also apparent from the FMS\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fms = tlviz.factor_tools.factor_match_score(sampled_true_cp, third_attempt)\nprint(f\"The FMS is {fms:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Here, we remove the outlier with a very high leverage.\n However, sometimes, it is better to include the samples with a high leverage\n and low residual and increase the number of components.\n Then the additional components can model the samples with the high leverage scores.\n In this example, we could have increased the number of components by one\n and recovered the correct components without removing this sample.