Lossy compression of time-series sensor data using cubic spline interpolation.
Instead of storing every sample, Splinter fits a cubic spline through a small set of knot points and stores only the knot coordinates. Decompression evaluates the spline at the original timestamps. You control the quality-size tradeoff by choosing how many knots to use.
On a smooth 20,000-sample synthetic signal (multi-harmonic sine wave):
| Knots | Compression vs raw | Compression vs gzip | RMSE |
|---|---|---|---|
| 10 | 800x | 725x | 0.825 |
| 25 | 364x | 330x | 0.499 |
| 50 | 190x | 173x | 0.316 |
| 100 | 98x | 88x | 0.177 |
| 200 | 49x | 45x | 0.010 |
| 400 | 25x | 23x | 0.0004 |
On real datasets:
| Dataset | Samples | Knots | Ratio vs raw | RMSE |
|---|---|---|---|---|
| UCI Household Power | 100,000 | 100 | 488x | 1.53 W |
| UCI Household Power | 100,000 | 500 | 100x | 1.17 W |
| Beijing PM2.5 | 40,000 | 100 | 195x | 119 ug/m3 |
| Synthetic ECG | 50,000 | 500 | 50x | 0.22 |
| Synthetic ECG | 50,000 | 1000 | 25x | 0.08 |
pip install -e .# Compress a CSV to .spl format (100 knots, ~488x compression on smooth data)
splinter compress sensor_log.csv compressed.spl --knots 100 --column value --time-column timestamp
# Decompress back to CSV
splinter decompress compressed.spl reconstructed.csvimport numpy as np
from splinter import SplinterCompressor
t = np.linspace(0, 100, 50_000)
y = np.sin(t * 0.3) + 0.5 * np.sin(t * 1.1)
comp = SplinterCompressor(n_knots=200, strategy="uniform")
reconstructed = comp.fit_transform(t, y)
print(f"RMSE: {np.sqrt(np.mean((y - reconstructed) ** 2)):.6f}")
print(f"Compression ratio: {comp.compression_ratio():.1f}x")Splinter is a good fit when all three of these are true:
- Your signal is smooth-ish (temperature, power draw, heart rate, audio envelopes, financial prices). Cubic splines model gradual change well. Sudden step changes or random noise are a poor fit.
- You can tolerate small reconstruction error. Splinter is lossy by design. The original samples are not recoverable exactly. If you need bit-perfect storage, use a lossless format.
- You need significant size reduction beyond what gzip alone offers. Splinter beats gzip by 10-700x on smooth signals because it exploits domain structure (smoothness) rather than byte-level entropy.
| Domain | Example signals | Why Splinter helps |
|---|---|---|
| IoT and edge telemetry | Temperature, humidity, voltage, current | Sensors sample at high rates but physical quantities change slowly. 100 knots can represent hours of data. Reduces cellular upload cost and flash wear. |
| Industrial monitoring | Motor vibration envelopes, pressure traces, flow rates | Engineers care about trends and anomalies, not every individual sample. Smooth reconstruction preserves both. |
| Biomedical signals | ECG envelopes, respiration, blood pressure | Long-duration recordings (24-hour Holter monitors) generate gigabytes. Splinter compresses clean segments aggressively while the adaptive strategy allocates more knots to beat-by-beat variation. |
| Financial data archival | Intraday price curves, implied volatility surfaces | Tick data is expensive to store at full resolution for years. Splinter archives the shape of the curve at a fraction of the cost, suitable for backtesting and research workflows that do not require tick-exact replay. |
| Weather and climate logging | Station temperature, wind speed, solar irradiance | Measurements are taken every minute but physical processes evolve over hours. Seasonal archives compress extremely well. |
| ML preprocessing pipelines | Feature smoothing, dataset size reduction | Many ML models benefit from smoothed inputs. Splinter can replace a smoothing step and a compression step in one pass, reducing pipeline complexity. |
- Exact recovery is required. Financial audit trails, medical records, lossless audio.
- The signal is fundamentally noisy. Raw PM2.5 particle counts, high-frequency trading order books, or shot noise from a photodetector have no smooth structure to exploit.
- Very short signals. The overhead of storing knot coordinates only pays off when N (samples) is much larger than K (knots). Below a few hundred samples, just store the raw data.
A natural cubic spline is a piecewise cubic polynomial that passes exactly through a set of control points (knots) and minimizes overall curvature. Given K knot positions and the signal values at those positions, scipy.interpolate.CubicSpline computes the polynomial coefficients for each segment. Splinter stores only the K (timestamp, value) knot pairs in a compact binary format. Decompression reconstructs the spline and evaluates it at any desired timestamps. For a smooth signal with N samples, storing K knots yields a compression ratio of roughly N/K on the value array, minus a small fixed header overhead.
git clone https://github.com/ike10/splinter.git
cd splinter
pip install -e ".[dev]"Requirements: Python 3.10+, numpy, scipy, pandas, matplotlib, click.
splinter compress INPUT.csv OUTPUT.spl \
--knots 100 \
--column value \
--time-column timestamp \
--strategy uniform \ # or adaptive, error_bounded
--gzip # optional additional gzip layersplinter decompress INPUT.spl OUTPUT.csvsplinter benchmark INPUT.csv \
--knot-sweep 10,50,100,500,1000 \
--strategy uniform \
--output report.mdsplinter plot INPUT.spl \
--original ORIGINAL.csv \
--save preview.pngfrom splinter import SplinterCompressor, write_spl, read_spl
from splinter.io import SplHeader
import numpy as np
# Compress
comp = SplinterCompressor(n_knots=100, strategy="adaptive")
comp.fit(timestamps, values)
header = SplHeader(
n_knots=len(comp.knot_timestamps),
n_samples=comp.original_n,
t_start=comp.t_start,
t_end=comp.t_end,
strategy_flag=comp.strategy_flag,
)
write_spl("signal.spl", header, comp.knot_timestamps, comp.knot_values)
# Decompress
from scipy.interpolate import CubicSpline
header, knots_t, knots_y = read_spl("signal.spl")
cs = CubicSpline(knots_t, knots_y)
reconstructed = cs(np.linspace(header.t_start, header.t_end, header.n_samples))| Strategy | Description | Best for |
|---|---|---|
uniform |
Evenly spaced knots (default) | General use, smooth signals |
adaptive |
Curvature-weighted placement | Signals with localized sharp features |
error_bounded |
Iterative greedy insertion | When you need a max-error guarantee |
from splinter.benchmark import knot_sweep, plot_sweep, rmse, mae, max_error
results = knot_sweep(timestamps, values, knot_counts=[10, 50, 100, 500])
print(results)
plot_sweep(results, title="My signal", save_path="sweep.png")Custom binary format designed for compact storage and fast parsing.
Bytes 0-7 : Magic "SPLNTR01"
Bytes 8-11 : Number of knots K (uint32, little-endian)
Bytes 12-19 : Original sample count N (uint64)
Bytes 20-27 : Time range start (float64)
Bytes 28-35 : Time range end (float64)
Byte 36 : Knot strategy flag (0=uniform, 1=adaptive, 2=error_bounded)
Bytes 37-39 : Reserved
Bytes 40+ : K knot timestamps (K * 8 bytes, float64)
Then : K knot values (K * 8 bytes, float64)
Total size: 40 + 16 * K bytes. A 100-knot file is 1,640 bytes.
- Signals with abrupt discontinuities compress poorly. Step changes, impulse spikes (like raw PM2.5 data), and clipped signals violate the smoothness assumption of cubic splines and produce large reconstruction errors.
- Lossy, not lossless. Splinter cannot reconstruct the exact original signal. If exact recovery is required, use a lossless format.
- Python-speed. The current implementation is pure Python with NumPy. Fitting and evaluating 100,000-sample signals takes tens of milliseconds. A C extension would be 10-100x faster.
- Single-channel only. Multi-channel compression requires calling the compressor once per channel.
- C extension for the hot path (knot selection and spline evaluation).
- SIMD-accelerated batch evaluation for decompression of many channels.
- Learned knot selection using a small neural network trained on signal statistics.
- 2D bicubic spline compression for image data.
- Streaming compression for live sensor feeds without buffering the full signal.
pytest tests/ --cov=splinterAll 33 tests pass with over 90% coverage on core modules.
python examples/01_household_power.py # UCI household power, downloads ~20 MB
python examples/02_air_quality.py # Beijing PM2.5, downloads ~500 KB
python examples/03_ecg_signal.py # Synthetic ECG, no download neededOr run all benchmarks:
python benchmarks/run_all.pyMIT. See LICENSE.
