Tools: Time Series Analysis 53/73753/1
authorSridhar K. N. Rao <srao@linuxfoundation.org>
Mon, 2 Jan 2023 05:34:44 +0000 (11:04 +0530)
committerSridhar K. N. Rao <srao@linuxfoundation.org>
Mon, 2 Jan 2023 05:40:35 +0000 (11:10 +0530)
This patch adds time series analysis

Signed-off-by: Sridhar K. N. Rao <srao@linuxfoundation.org>
Change-Id: I162ab447167aa21c19b9056883748a42328db43f

tools/timeseriesanalysis/TimeSeriesAnalysis.ipynb [new file with mode: 0644]

diff --git a/tools/timeseriesanalysis/TimeSeriesAnalysis.ipynb b/tools/timeseriesanalysis/TimeSeriesAnalysis.ipynb
new file mode 100644 (file)
index 0000000..81d960b
--- /dev/null
@@ -0,0 +1,825 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "10f11159",
+   "metadata": {},
+   "source": [
+    "Contributors: Sridhar K. N. Rao\n",
+    "\n",
+    "Copyright 2022-23 The Linux Foundation\n",
+    "\n",
+    "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
+    "\n",
+    "http://www.apache.org/licenses/LICENSE-2.0\n",
+    "\n",
+    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02ef34c2",
+   "metadata": {},
+   "source": [
+    "# Time-Series Analysis\n",
+    "\n",
+    "This Notebook includes time-series analysis of:\n",
+    "\n",
+    "1. CPU\n",
+    "2. Memory\n",
+    "3. System Load\n",
+    "\n",
+    "The analysis includes both individual and comparative analysis "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f524f5e",
+   "metadata": {},
+   "source": [
+    "## Data Preparation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "88b1a616",
+   "metadata": {},
+   "source": [
+    "### CPU"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7f4b9f76",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "\n",
+    "# If there are separate cpu-<n> files for each core, then we will have combine them.\n",
+    "# Here combining is done by taking the average value.\n",
+    "\n",
+    "paths_one = []\n",
+    "for path, currentDirectory, files in os.walk(\"/data/pod18-node4/\"):\n",
+    "    for file in files:\n",
+    "        if file.startswith(\"percent-idle\"):\n",
+    "            #print(file)\n",
+    "            paths_one.append(os.path.join(path, file))\n",
+    "\n",
+    "dfs_one = (pd.read_csv(f, index_col=False) for f in paths_one)\n",
+    "data_one = pd.concat(dfs_one).groupby(level=0).mean()\n",
+    "\n",
+    "# if the data is already in a single file\n",
+    "data_two = pd.read_csv('/data/node5/percent-idle.csv')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dc4f81f7",
+   "metadata": {},
+   "source": [
+    "### Memory"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b89e54f2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "data_one = pd.read_csv('/data/pod18-node4/memory/memory-used-2022-06-19', index_col=False)\n",
+    "data_two = pd.read_csv('/data/pod18-node5/memory/memory-used.csv', index_col=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "be90260a",
+   "metadata": {},
+   "source": [
+    "### System Load"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "334a1799",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "data_one = pd.read_csv('/data/pod18-node4/load/load-2022-06-19', index_col=False)\n",
+    "data_two = pd.read_csv('/data/pod18-node5/load/load.csv', index_col=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9fc3fac7",
+   "metadata": {},
+   "source": [
+    "### Convert Epoch, Set Index make the size of two dataset equal"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "775d6e5a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_one['epoch'] = pd.to_datetime(data_one['epoch'],unit='s')\n",
+    "data_one.set_index('epoch', inplace=True)\n",
+    "data_two['epoch'] = pd.to_datetime(data_two['epoch'],unit='s')\n",
+    "data_two.set_index('epoch', inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ad5fd126",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if data_one.shape[0] > data_two.shape[0]:\n",
+    "    diff = data_one.shape[0] - data_two.shape[0]\n",
+    "    data_one = data_one.drop(data_one.index[(data_one.shape[0] - diff):])\n",
+    "else:\n",
+    "    diff2 = data_two.shape[0] - data_one.shape[0]\n",
+    "    data_two = data_two.drop(data_two.index[(data_two.shape[0] - diff2):])\n",
+    "\n",
+    "print(data_one.shape[0])\n",
+    "print(data_two.shape[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8857cb43",
+   "metadata": {},
+   "source": [
+    "## Autocorrelation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "03624971",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pandas.plotting import autocorrelation_plot\n",
+    "from statsmodels.tsa.arima.model import ARIMA"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dbb2c76b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "autocorrelation_plot(data_one)\n",
+    "pyplot.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "99fb7a81",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "autocorrelation_plot(data_two)\n",
+    "pyplot.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fadd1df4",
+   "metadata": {},
+   "source": [
+    "## ARIMA"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "95a925f5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c1_copy = data_one.copy(deep=True)\n",
+    "c2_copy = data_two.copy(deep=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f7b197a9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c1_copy.index = c1_copy.index.to_period('s')\n",
+    "c1_copy.index"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d1eed816",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#concat_two.index = concat_two.index.to_timestamp(freq='s')\n",
+    "c2_copy.index = c2_copy.index.to_period('s')\n",
+    "c2_copy.index"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "95124eb2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = ARIMA(c2_copy, order=(1,1,0))\n",
+    "model_fit = model.fit()\n",
+    "print(model_fit.summary())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bd8314cf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = ARIMA(c1_copy, order=(1,1,0))\n",
+    "model_fit = model.fit()\n",
+    "print(model_fit.summary())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b5771d3",
+   "metadata": {},
+   "source": [
+    "## Histograms"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ef9daa03",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_two.hist(column=\"value\", bins=20)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c4d30477",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_one.hist(column=\"value\", bins=20)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b15015e",
+   "metadata": {},
+   "source": [
+    "### Load"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cb0adbc7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_one.hist(column=\"longterm\", bins=20)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e19191b6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_two.hist(column=\"longterm\", bins=20)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "700d9d01",
+   "metadata": {},
+   "source": [
+    "# Compute Probabilities"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2b906c19",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import seaborn as sns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "740fa874",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_one['value'] = data_one['value'].round(decimals=0)\n",
+    "probabilities_one = data_one['value'].value_counts(normalize=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4fecd994",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_two['value'] = data_two['value'].round(decimals=0)\n",
+    "probabilities_two = data_two['value'].value_counts(normalize=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "916cb16d",
+   "metadata": {},
+   "source": [
+    "### Load"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a9ab7583",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_one['longterm'] = data_one['longterm'].round(decimals=0)\n",
+    "probabilities_one = data_one['longterm'].value_counts(normalize=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dfe0f350",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_two['longterm'] = data_two['longterm'].round(decimals=0)\n",
+    "probabilities_two = data_two['longterm'].value_counts(normalize=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4887cdcc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sns.barplot(x=probabilities_one.index, y=probabilities_one.values, color='blue')\n",
+    "sns.barplot(x=probabilities_two.index, y=probabilities_two.values, color='blue')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4fcfe0f0",
+   "metadata": {},
+   "source": [
+    "# Comparative Analysis"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ae4a8fe7",
+   "metadata": {},
+   "source": [
+    "## Dynamic Time Warping"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b7765d1c",
+   "metadata": {},
+   "source": [
+    "## Interpretation\n",
+    "Dynamic time warping is a seminal time series comparison technique.The objective of time series comparison methods is to produce a distance metric between two input time series. The similarity or dissimilarity of two-time series is typically calculated by converting the data into vectors and calculating the Euclidean distance between those points in vector space."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d2a9001d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np \n",
+    "from scipy.spatial.distance import euclidean\n",
+    "from fastdtw import fastdtw\n",
+    "x = data_one.to_numpy()\n",
+    "y = data_two.to_numpy()\n",
+    "distance, path = fastdtw(x,y, dist=euclidean)\n",
+    "print(distance)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2beb0134",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from scipy.spatial.distance import chebyshev, cityblock\n",
+    "values_one = data_one[['value']].to_numpy()\n",
+    "values_two = data_two[['value']].to_numpy()\n",
+    "dist, path = fastdtw(values_one, values_two, dist=cityblock)\n",
+    "print(dist)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "91943e69",
+   "metadata": {},
+   "source": [
+    "### Load"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e79422c2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from scipy.spatial.distance import chebyshev, cityblock\n",
+    "values_one = data_one[['longterm']].to_numpy()\n",
+    "values_two = data_two[['longterm']].to_numpy()\n",
+    "dist, path = fastdtw(values_one, values_two, dist=cityblock)\n",
+    "print(dist)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1a6cccb1",
+   "metadata": {},
+   "source": [
+    "# Wasserstein Distance"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2fb0612c",
+   "metadata": {},
+   "source": [
+    "## Interpretation\n",
+    "Wasserstein distance provide a meaningful and smooth representation of the distance between distributions. They measure the minimal effort required to reconfigure the probability mass of one distribution in order to recover the other distribution.\n",
+    "Expectation: Less than 2."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c5e9c598",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from scipy.stats import wasserstein_distance\n",
+    "wd = wasserstein_distance (data_one['value'], data_two['value'])\n",
+    "wd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b9991c19",
+   "metadata": {},
+   "source": [
+    "### Load"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1270dbc7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from scipy.stats import wasserstein_distance\n",
+    "wd = wasserstein_distance (data_one['longterm'], data_two['longterm'])\n",
+    "wd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c2fff2b3",
+   "metadata": {},
+   "source": [
+    "# Maximum Mean Discrepancy"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b533aed",
+   "metadata": {},
+   "source": [
+    "## Reference\n",
+    "https://www.kaggle.com/code/onurtunali/maximum-mean-discrepancy/notebook"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2539bbc4",
+   "metadata": {},
+   "source": [
+    "## Interpretation\n",
+    "MMD is defined by the idea of representing distances between distributions as distances between mean embeddings of features. Two distributions are similar if their moments are similar. By applying a kernel, we can transform the variable such that all moments (first, second, third etc.) are computed. In the latent space we can compute the difference between the moments and average it. This gives a measure of the similarity/dissimilarity between the datasets."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b58e4b2c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "def MMD(x, y, kernel):\n",
+    "    \"\"\"Emprical maximum mean discrepancy. The lower the result\n",
+    "       the more evidence that distributions are the same.\n",
+    "\n",
+    "    Args:\n",
+    "        x: first sample, distribution P\n",
+    "        y: second sample, distribution Q\n",
+    "        kernel: kernel type such as \"multiscale\" or \"rbf\"\n",
+    "    \"\"\"\n",
+    "    xx, yy, zz = torch.mm(x, x.t()), torch.mm(y, y.t()), torch.mm(x, y.t())\n",
+    "    rx = (xx.diag().unsqueeze(0).expand_as(xx))\n",
+    "    ry = (yy.diag().unsqueeze(0).expand_as(yy))\n",
+    "    \n",
+    "    dxx = rx.t() + rx - 2. * xx # Used for A in (1)\n",
+    "    dyy = ry.t() + ry - 2. * yy # Used for B in (1)\n",
+    "    dxy = rx.t() + ry - 2. * zz # Used for C in (1)\n",
+    "    \n",
+    "    XX, YY, XY = (torch.zeros(xx.shape).to(device),\n",
+    "                  torch.zeros(xx.shape).to(device),\n",
+    "                  torch.zeros(xx.shape).to(device))\n",
+    "    \n",
+    "    if kernel == \"multiscale\":\n",
+    "        \n",
+    "        bandwidth_range = [0.2, 0.5, 0.9, 1.3]\n",
+    "        for a in bandwidth_range:\n",
+    "            XX += a**2 * (a**2 + dxx)**-1\n",
+    "            YY += a**2 * (a**2 + dyy)**-1\n",
+    "            XY += a**2 * (a**2 + dxy)**-1\n",
+    "            \n",
+    "    if kernel == \"rbf\":\n",
+    "      \n",
+    "        bandwidth_range = [10, 15, 20, 50]\n",
+    "        for a in bandwidth_range:\n",
+    "            XX += torch.exp(-0.5*dxx/a)\n",
+    "            YY += torch.exp(-0.5*dyy/a)\n",
+    "            XY += torch.exp(-0.5*dxy/a)\n",
+    "      \n",
+    "      \n",
+    "\n",
+    "    return torch.mean(XX + YY - 2. * XY)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d9af1425",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "from scipy.stats import multivariate_normal\n",
+    "from scipy.stats import dirichlet \n",
+    "from torch.distributions.multivariate_normal import MultivariateNormal\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c5bd9f09",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "device = \"cpu\"\n",
+    "\n",
+    "m = 20 # sample size\n",
+    "x_mean = torch.zeros(2)+1\n",
+    "y_mean = torch.zeros(2)\n",
+    "x_cov = 2*torch.eye(2) # IMPORTANT: Covariance matrices must be positive definite\n",
+    "y_cov = 3*torch.eye(2) - 1\n",
+    "\n",
+    "px = MultivariateNormal(x_mean, x_cov)\n",
+    "qy = MultivariateNormal(y_mean, y_cov)\n",
+    "#x = px.sample([m]).to(device)\n",
+    "#y = qy.sample([m]).to(device)\n",
+    "\n",
+    "x = torch.from_numpy(data_one.values).float().to(device)\n",
+    "y = torch.from_numpy(data_two.values).float().to(device)\n",
+    "print(x.t())\n",
+    "print(y.t())\n",
+    "print(type(x))\n",
+    "result = MMD(x, y, kernel=\"multiscale\")\n",
+    "\n",
+    "print(f\"MMD result of X and Y is {result.item()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "70d7b830",
+   "metadata": {},
+   "source": [
+    "### Load"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "52827106",
+   "metadata": {},
+   "source": [
+    "May have to change the device value and then test."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0d6ea3ae",
+   "metadata": {},
+   "source": [
+    "# Root mean square difference"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "24a16d7c",
+   "metadata": {},
+   "source": [
+    "## Interpretation\n",
+    "The closer RMSE is to 0, the more similar the generated data is to the real-data. But RMSE is returned on the same scale as the target we are emulating for and therefore there isn’t a general rule for how to interpret ranges of values.\n",
+    "For CPU usage,  If it is more than 30, then we may have to relook at it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "de532612",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import mean_squared_error\n",
+    "rmse = mean_squared_error(data_one['value'], data_two['value'], squared = False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "377e78e9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(rmse)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8ec2a78d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rmse2 = mean_squared_error(data_one['value'], data_two['value'], squared = False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1430a84e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(rmse2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3a8a91b1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rmse3 = mean_squared_error(data_one.sort_values(by='value')['value'], data_two.sort_values(by='value')['value'], squared = False)\n",
+    "print(rmse3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c59358b1",
+   "metadata": {},
+   "source": [
+    "# Mutual Information"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "82a7aab2",
+   "metadata": {},
+   "source": [
+    "## Interpretation\n",
+    "Mutual information describes relationships in terms of uncertainty. The mutual information (MI) between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other. High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between two random variables means the variables are independent. The least possible mutual information between quantities is 0.0. When MI is zero, the quantities are independent: neither can tell you anything about the other. Conversely, in theory there's no upper bound to what MI can be. In practice though values above 2.0 or so are uncommon. (Mutual information is a logarithmic quantity, so it increases very slowly.)\n",
+    "\n",
+    "Should be 1.5+"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "20e1bf93",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.feature_selection import mutual_info_regression\n",
+    "\n",
+    "def make_mi_scores(X, y, discrete_features):\n",
+    "    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)\n",
+    "    mi_scores = pd.Series(mi_scores, name=\"MI Scores\", index=X.columns)\n",
+    "    mi_scores = mi_scores.sort_values(ascending=False)\n",
+    "    return mi_scores"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "030f953f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "discrete_features = data_one.dtypes == int"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b13c38bc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mi_scores = make_mi_scores(data_one, data_two['value'], discrete_features)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7b5ca42b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(mi_scores)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}