{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\n# Outlier detection with PARAFAC\n\nThere are two metrics that are commonly used for detecting outliers in PARAFAC models: *leverage* and *residuals*.\n\n## Residuals\n\nThe residuals measure how well the PARAFAC model represents the data. For detecting outliers, it is common\nto compute the sum of squared error for each sample. Since each sample corresponds to a slab in the data tensor,\nwe use the name *slabwise SSE* in TLViz. For a third-order tensor, $\\mathcal{X}$, and its PARAFAC\nestimate, $\\hat{\\mathcal{X}}$, we compute the $i$-th residual in the first mode as\n\n\\begin{align}r_i = \\sum_{jk} \\left(x_{ijk} - \\hat{x}_{ijk}\\right)^2.\\end{align}\n\nThese residuals measure how well our decomposition fits the different samples. If a sample, $i$, has a high\nresidual, then that indicates that our PARAFAC model cannot describe its behaviour.\n\n## Leverage\nThe leverage score measures how \"influential\" the different samples are for the PARAFAC model. There\nare several interpretations of the leverage score :cite:p:`velleman1981efficient`, one of them is the\nnumber of components we devote to a given data point. To compute the leverage score for the different samples,\nwe only need the factor matrix for the sample mode. If the sample mode is represented by $\\mathbf{A}$,\nthen the leverage score is defined as\n\n\\begin{align}h_i = \\left[\\mathbf{A} \\left(\\mathbf{A}^T \\mathbf{A}\\right)^{-1} \\mathbf{A}^T\\right]_{ii},\\end{align}\n\nthat is, the $i$-th diagonal entry of the matrix\n$\\mathbf{A} \\left(\\mathbf{A}^T \\mathbf{A}\\right)^{-1} \\mathbf{A}^T$.\n\n## Examples with code\nBelow, we show some examples with simulated data to illustrate how we can use the leverage and residuals to\nfind and remove the outliers in a dataset.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom tensorly.decomposition import parafac\n\nimport tlviz"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Utility for fitting PARAFAC models and sampling CP tensors\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def fit_parafac(X, num_components, num_inits=5):\n    model_candidates = [\n        parafac(\n            X.data,\n            num_components,\n            n_iter_max=1000,\n            tol=1e-8,\n            init=\"random\",\n            orthogonalise=True,\n            linesearch=True,\n            random_state=i,\n        )\n        for i in range(num_inits)\n    ]\n    model = tlviz.multimodel_evaluation.get_model_with_lowest_error(model_candidates, X)\n    return tlviz.postprocessing.postprocess(model, X)\n\n\ndef index_cp_tensor(cp_tensor, indices, mode):\n    weights, factors = cp_tensor\n    indices = set(indices)\n    new_factors = []\n    for i, fm in enumerate(factors):\n        if i == mode:\n            new_factors.append(fm[fm.index.isin(indices)])\n        else:\n            new_factors.append(fm)\n\n    return weights, new_factors"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Generate simulated dataset\n\nWe start by generating a random CP Tensor where the first mode represents the sample mode,\nand the other two modes represent some signal we have measured.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "I, J, K = 20, 30, 40\nrank = 5\noffset = np.arange(1, rank + 1)[np.newaxis]\n\nrng = np.random.default_rng(0)\nA = pd.DataFrame(rng.standard_normal((I, rank)))\nA.index.name = \"Mode 0\"\nB = pd.DataFrame(np.sin(offset * np.linspace(0, 2 * np.pi, J)[:, np.newaxis] / rank + offset))\nB.index.name = \"Mode 1\"\nC = pd.DataFrame(np.cos(offset * np.linspace(0, 2 * np.pi, K)[:, np.newaxis] / rank + offset))\nC.index.name = \"Mode 2\"\n\n\ntrue_model = (None, [A, B, C])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Plotting the components\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tlviz.visualisation.components_plot(true_model)\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Adding artificial noise\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X = tlviz.utils.cp_to_tensor(true_model)\nnoise = rng.standard_normal(X.shape)\nX += np.linalg.norm(X.data) * 0.1 * noise / np.linalg.norm(noise)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Adding artificial outliers\nWe need three outliers to find; let's make three of the first four samples into outliers.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X[0] = np.random.standard_normal(X[0].shape)\nX[2] *= 10\nX[3] = X[3] * 3 + 5"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Fitting a PARAFAC model to this data and plotting the components\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "first_attempt = fit_parafac(X, rank, num_inits=5)\ntlviz.visualisation.component_comparison_plot({\"True\": true_model, \"First attempt\": first_attempt})\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We see that some of the components are wrong, and to quantify this, we can look at the factor match score:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fms = tlviz.factor_tools.factor_match_score(true_model, first_attempt)\nprint(f\"The FMS is {fms:.2f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we can create a scatter plot where we plot the leverage and residuals for each sample.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tlviz.visualisation.outlier_plot(first_attempt, X)\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From this plot, we see that there are three potential outliers.\nFirst, sample 2 has a high leverage and a high residual.\nThe leverage is close to one, so we almost devote a whole component to modelling it.\nAnd still, it is poorly described by the model.\nWe also see that sample 3 has a high leverage and low error,\nand sample 0 has a low leverage and a high error.\nLet's start by removing the most problematic sample: Sample 2.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "samples_to_remove = {2}\nselected_samples = [i for i in range(I) if i not in samples_to_remove]\nsampled_X = X.loc[{\"Mode 0\": selected_samples}]\nsecond_attempt = fit_parafac(sampled_X, rank, num_inits=5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we want to compare the true and estimated components.\nHowever, since we have removed a slab from the dataset,\nwe also need to remove the corresponding row from $\\mathbf{A}$\nbefore we can compare the components.\nOtherwise, we could not align the components.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "sampled_true_cp = index_cp_tensor(true_model, selected_samples, mode=0)\ntlviz.visualisation.component_comparison_plot({\"True\": sampled_true_cp, \"Second attempt\": second_attempt})\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Still, we have not successfully recovered the correct components,\nwhich is apparent from the factor match score.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fms = tlviz.factor_tools.factor_match_score(sampled_true_cp, second_attempt)\nprint(f\"The FMS is {fms:.2f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can also create the outlier plot again to see which samples are suspicious.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tlviz.visualisation.outlier_plot(second_attempt, sampled_X)\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We see that sample 0 and 3 still show the same behaviour as before. Sample 0 has a high error, but does not affect\nthe model much, while sample 3 has a low error, but does affect the model. Let us therefore remove sample 3 now.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "samples_to_remove = {2, 3}\nselected_samples = [i for i in range(I) if i not in samples_to_remove]\nsampled_X = X.loc[{\"Mode 0\": selected_samples}]\nthird_attempt = fit_parafac(sampled_X, rank, num_inits=5)\n\nsampled_true_cp = index_cp_tensor(true_model, selected_samples, mode=0)\ntlviz.visualisation.component_comparison_plot({\"True\": sampled_true_cp, \"Third attempt\": third_attempt})\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here, we see that we finally uncovered the true components. This is also apparent from the FMS\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fms = tlviz.factor_tools.factor_match_score(sampled_true_cp, third_attempt)\nprint(f\"The FMS is {fms:.2f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>Here, we remove the outlier with a very high leverage.\n    However, sometimes, it is better to include the samples with a high leverage\n    and low residual and increase the number of components.\n    Then the additional components can model the samples with the high leverage scores.\n    In this example, we could have increased the number of components by one\n    and recovered the correct components without removing this sample.</p></div>\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can also create the outlier plot again to see if there still are any suspicious samples\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tlviz.visualisation.outlier_plot(third_attempt, sampled_X)\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's see what happens when we also remove the final suspicious sample\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "samples_to_remove = {0, 2, 3}\nselected_samples = [i for i in range(I) if i not in samples_to_remove]\nsampled_X = X.loc[{\"Mode 0\": selected_samples}]\nfourth_attempt = fit_parafac(sampled_X, rank, num_inits=5)\n\nsampled_true_cp = index_cp_tensor(true_model, selected_samples, mode=0)\ntlviz.visualisation.component_comparison_plot({\"True\": sampled_true_cp, \"Fourth attempt\": fourth_attempt})\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The components didn't change much.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We didn't need to remove the final suspicious outlier in this case,\nand after removing it, the components didn't change much.\nThis behaviour makes sense because the sample had a very low leverage,\nso it didn't affect the model very much.\nSo, in cases where a data point has high residual but not high leverage,\nit might be sensible not to remove it.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Rules-of-thumb for selecting the samples that may be outliers\n\nUnfortunately, there are no hard rules for selecting which samples to remove as outliers.\nInstead, we generally have to consider each suspected outlier individually.\nHowever, there are some useful rules-of-thumb discussed in the literature.\nThese methods are detailed in :meth:`tlviz.outliers.get_slabwise_sse_outlier_threshold`\nand :meth:`tlviz.outliers.get_slabwise_sse_outlier_threshold`.\nThere are two ways to select outliers based on the leverage score.\nIf we interpret the leverage score as the number of components devoted to each sample,\nwhich Huber does in :cite:p:`huber2009robust`,\nthen it makes sense to use a constant cut-off equal to 0.5\n(we devote more than half a component to model a single sample) or 0.2\n(we devote 20% of a component for a single sample).\nAlternatively, we can assume that the components are normally distributed\nand use probability theory to estimate the cut-off values.\nThis approach is what Hoaglin and Welch use in :cite:p:`belsley1980regression`\nto obtain their easy-to-compute thresholds.\nFinally, we can compute the thresholds using p-values by assuming that the data\neither comes from a normal distribution :cite:p:`belsley1980regression`,\nor from a zero-mean normal distribution  :cite:p:`nomikos1995multivariate`\n(Hotelling T2 p-values).\nFor selecting the outliers based on the residuals,\nwe can either assume that the noise is normally distributed\nand compute the cut-off values with p-values via a chi-squared distribution\n:cite:p:`nomikos1995multivariate`,\nor we can set the cut-off equal to twice the standard deviation of the SSE :cite:p:`naes2002user`.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Visualising the rules-of-thumb on our first attempted PARAFAC model\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tlviz.visualisation.outlier_plot(\n    first_attempt,\n    X,\n    leverage_rules_of_thumb=[\"p-value\", \"hotelling\", \"huber higher\", \"huber lower\", \"hw lower\", \"hw higher\"],\n    residual_rules_of_thumb=[\"p-value\", \"two sigma\"],\n)\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We see that in this case,\nall thresholds except ``\"huber lower\"`` successfully detected the outliers.\nHowever, with real data, it is usually not this clear.\nDifferent cut-off values can give different outliers.\nTherefore, it is crucial to look at the data and evaluate which data points\nto remove and not on a case-by-case basis.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.16"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}