Add meaningful multivariate subsequence search tutorial#1138
Conversation
|
Found 1 changed notebook. Review the changes at https://app.gitnotebooks.com/stumpy-dev/stumpy/pull/1138 |
|
@omkar-334 Thank you for your PR. Please allow me some time to review and provide comments |
seanlaw
left a comment
There was a problem hiding this comment.
@omkar-334 Thank you for taking the time to submit this PR. The goal of the original issue is to exactly reproduce the Whale example from Keogh's tutorial as it is the most clear and illustrative (rather than using a synthetic dataset). If that is not possible (e.g., if the data is not publicly available) then we should reach out to the author for the data or, otherwise, we should close this issue.
| "import stumpy\n", | ||
| "import matplotlib.pyplot as plt\n", | ||
| "\n", | ||
| "for style in (\"./stumpy.mplstyle\", \"docs/stumpy.mplstyle\"):\n", |
There was a problem hiding this comment.
This is inconsistent with all other notebooks. Please follow the one-liner from other notebooks
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Creating Flight-like Sensor Data\n", |
There was a problem hiding this comment.
For an overwhelming majority of our tutorials, we always aim to reproduce the exact published work from the original authors rather than using a synthetic dataset. The primary goal is to reproduce the published figures exactly and while demonstrating how to achieve this with STUMPY is secondary, maybe tertiary. Based on Keogh's tutorial, I would target the Whale example as it is more realistic and truly multidimensional. If that example is not available then we should conclude that it is not possible to reproduce the work and close this issue.
| "shade_known_windows(axs, known_events, m)\n", | ||
| "axs[-1].set_xlabel(\"Time\")\n", | ||
| "axs[0].set_title(\"Synthetic Multivariate Sensor Data\")\n", | ||
| "plt.show()" |
There was a problem hiding this comment.
Have you actually executed all cells of this notebook? Without the plots, the tutorial is meaningless and difficult (impossible) to review
| "\n", | ||
| "Suppose that we are searching through a multivariate time series collected from a flight. The channels have different physical units: altitude, airspeed, outside temperature, and hydraulic pressure. Three short maneuver windows share the same shape in the first three channels. The pressure channel, however, contains a large pressure pulse in the query window and in one unrelated window.\n", | ||
| "\n", | ||
| "This is intentionally constructed so that raw distance can prefer the pressure-only distractor over the true repeated maneuvers." |
There was a problem hiding this comment.
This (generating synthetic data) is making the tutorial too complicated to comprehend for our average user
| "\n", | ||
| "axs[-1].set_xlabel(\"Relative time\")\n", | ||
| "axs[0].set_title(\"Multivariate Query Window\")\n", | ||
| "plt.show()" |
This PR adds a new tutorial inspired by Eamonn Keogh's multivariate top-k subsequence search example. The tutorial uses a self-contained synthetic sensor dataset to show why raw multivariate distances can be misleading, then demonstrates z-normalized stumpy.match and query-time channel selection for meaningful top-k matches.
Validation:
Fixes #1137