Splinter

Lossy compression of time-series sensor data using cubic spline interpolation.

Instead of storing every sample, Splinter fits a cubic spline through a small set of knot points and stores only the knot coordinates. Decompression evaluates the spline at the original timestamps. You control the quality-size tradeoff by choosing how many knots to use.

Benchmark Results

On a smooth 20,000-sample synthetic signal (multi-harmonic sine wave):

Knots	Compression vs raw	Compression vs gzip	RMSE
10	800x	725x	0.825
25	364x	330x	0.499
50	190x	173x	0.316
100	98x	88x	0.177
200	49x	45x	0.010
400	25x	23x	0.0004

On real datasets:

Dataset	Samples	Knots	Ratio vs raw	RMSE
UCI Household Power	100,000	100	488x	1.53 W
UCI Household Power	100,000	500	100x	1.17 W
Beijing PM2.5	40,000	100	195x	119 ug/m3
Synthetic ECG	50,000	500	50x	0.22
Synthetic ECG	50,000	1000	25x	0.08

Quick Start

pip install -e .

# Compress a CSV to .spl format (100 knots, ~488x compression on smooth data)
splinter compress sensor_log.csv compressed.spl --knots 100 --column value --time-column timestamp

# Decompress back to CSV
splinter decompress compressed.spl reconstructed.csv

import numpy as np
from splinter import SplinterCompressor

t = np.linspace(0, 100, 50_000)
y = np.sin(t * 0.3) + 0.5 * np.sin(t * 1.1)

comp = SplinterCompressor(n_knots=200, strategy="uniform")
reconstructed = comp.fit_transform(t, y)
print(f"RMSE: {np.sqrt(np.mean((y - reconstructed) ** 2)):.6f}")
print(f"Compression ratio: {comp.compression_ratio():.1f}x")

When to Use Splinter

Splinter is a good fit when all three of these are true:

Your signal is smooth-ish (temperature, power draw, heart rate, audio envelopes, financial prices). Cubic splines model gradual change well. Sudden step changes or random noise are a poor fit.
You can tolerate small reconstruction error. Splinter is lossy by design. The original samples are not recoverable exactly. If you need bit-perfect storage, use a lossless format.
You need significant size reduction beyond what gzip alone offers. Splinter beats gzip by 10-700x on smooth signals because it exploits domain structure (smoothness) rather than byte-level entropy.

Use case breakdown

Domain	Example signals	Why Splinter helps
IoT and edge telemetry	Temperature, humidity, voltage, current	Sensors sample at high rates but physical quantities change slowly. 100 knots can represent hours of data. Reduces cellular upload cost and flash wear.
Industrial monitoring	Motor vibration envelopes, pressure traces, flow rates	Engineers care about trends and anomalies, not every individual sample. Smooth reconstruction preserves both.
Biomedical signals	ECG envelopes, respiration, blood pressure	Long-duration recordings (24-hour Holter monitors) generate gigabytes. Splinter compresses clean segments aggressively while the adaptive strategy allocates more knots to beat-by-beat variation.
Financial data archival	Intraday price curves, implied volatility surfaces	Tick data is expensive to store at full resolution for years. Splinter archives the shape of the curve at a fraction of the cost, suitable for backtesting and research workflows that do not require tick-exact replay.
Weather and climate logging	Station temperature, wind speed, solar irradiance	Measurements are taken every minute but physical processes evolve over hours. Seasonal archives compress extremely well.
ML preprocessing pipelines	Feature smoothing, dataset size reduction	Many ML models benefit from smoothed inputs. Splinter can replace a smoothing step and a compression step in one pass, reducing pipeline complexity.

When NOT to use Splinter

Exact recovery is required. Financial audit trails, medical records, lossless audio.
The signal is fundamentally noisy. Raw PM2.5 particle counts, high-frequency trading order books, or shot noise from a photodetector have no smooth structure to exploit.
Very short signals. The overhead of storing knot coordinates only pays off when N (samples) is much larger than K (knots). Below a few hundred samples, just store the raw data.

The Math

A natural cubic spline is a piecewise cubic polynomial that passes exactly through a set of control points (knots) and minimizes overall curvature. Given K knot positions and the signal values at those positions, scipy.interpolate.CubicSpline computes the polynomial coefficients for each segment. Splinter stores only the K (timestamp, value) knot pairs in a compact binary format. Decompression reconstructs the spline and evaluates it at any desired timestamps. For a smooth signal with N samples, storing K knots yields a compression ratio of roughly N/K on the value array, minus a small fixed header overhead.

Installation

git clone https://github.com/ike10/splinter.git
cd splinter
pip install -e ".[dev]"

Requirements: Python 3.10+, numpy, scipy, pandas, matplotlib, click.

CLI Reference

compress

splinter compress INPUT.csv OUTPUT.spl \
    --knots 100 \
    --column value \
    --time-column timestamp \
    --strategy uniform \   # or adaptive, error_bounded
    --gzip                 # optional additional gzip layer

decompress

splinter decompress INPUT.spl OUTPUT.csv

benchmark

splinter benchmark INPUT.csv \
    --knot-sweep 10,50,100,500,1000 \
    --strategy uniform \
    --output report.md

plot

splinter plot INPUT.spl \
    --original ORIGINAL.csv \
    --save preview.png

Python API

from splinter import SplinterCompressor, write_spl, read_spl
from splinter.io import SplHeader
import numpy as np

# Compress
comp = SplinterCompressor(n_knots=100, strategy="adaptive")
comp.fit(timestamps, values)

header = SplHeader(
    n_knots=len(comp.knot_timestamps),
    n_samples=comp.original_n,
    t_start=comp.t_start,
    t_end=comp.t_end,
    strategy_flag=comp.strategy_flag,
)
write_spl("signal.spl", header, comp.knot_timestamps, comp.knot_values)

# Decompress
from scipy.interpolate import CubicSpline
header, knots_t, knots_y = read_spl("signal.spl")
cs = CubicSpline(knots_t, knots_y)
reconstructed = cs(np.linspace(header.t_start, header.t_end, header.n_samples))

Knot selection strategies

Strategy	Description	Best for
`uniform`	Evenly spaced knots (default)	General use, smooth signals
`adaptive`	Curvature-weighted placement	Signals with localized sharp features
`error_bounded`	Iterative greedy insertion	When you need a max-error guarantee

Benchmarking utilities

from splinter.benchmark import knot_sweep, plot_sweep, rmse, mae, max_error

results = knot_sweep(timestamps, values, knot_counts=[10, 50, 100, 500])
print(results)

plot_sweep(results, title="My signal", save_path="sweep.png")

File Format (.spl)

Custom binary format designed for compact storage and fast parsing.

Bytes 0-7    : Magic "SPLNTR01"
Bytes 8-11   : Number of knots K  (uint32, little-endian)
Bytes 12-19  : Original sample count N  (uint64)
Bytes 20-27  : Time range start  (float64)
Bytes 28-35  : Time range end  (float64)
Byte 36      : Knot strategy flag  (0=uniform, 1=adaptive, 2=error_bounded)
Bytes 37-39  : Reserved
Bytes 40+    : K knot timestamps  (K * 8 bytes, float64)
Then         : K knot values  (K * 8 bytes, float64)

Total size: 40 + 16 * K bytes. A 100-knot file is 1,640 bytes.

Limitations

Signals with abrupt discontinuities compress poorly. Step changes, impulse spikes (like raw PM2.5 data), and clipped signals violate the smoothness assumption of cubic splines and produce large reconstruction errors.
Lossy, not lossless. Splinter cannot reconstruct the exact original signal. If exact recovery is required, use a lossless format.
Python-speed. The current implementation is pure Python with NumPy. Fitting and evaluating 100,000-sample signals takes tens of milliseconds. A C extension would be 10-100x faster.
Single-channel only. Multi-channel compression requires calling the compressor once per channel.

Future Work

C extension for the hot path (knot selection and spline evaluation).
SIMD-accelerated batch evaluation for decompression of many channels.
Learned knot selection using a small neural network trained on signal statistics.
2D bicubic spline compression for image data.
Streaming compression for live sensor feeds without buffering the full signal.

Running Tests

pytest tests/ --cov=splinter

All 33 tests pass with over 90% coverage on core modules.

Running Examples

python examples/01_household_power.py  # UCI household power, downloads ~20 MB
python examples/02_air_quality.py       # Beijing PM2.5, downloads ~500 KB
python examples/03_ecg_signal.py        # Synthetic ECG, no download needed

Or run all benchmarks:

python benchmarks/run_all.py

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
docs/plots		docs/plots
examples		examples
splinter		splinter
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Splinter

Benchmark Results

Quick Start

When to Use Splinter

Use case breakdown

When NOT to use Splinter

The Math

Installation

CLI Reference

compress

decompress

benchmark

plot

Python API

Knot selection strategies

Benchmarking utilities

File Format (.spl)

Limitations

Future Work

Running Tests

Running Examples

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Splinter

Benchmark Results

Quick Start

When to Use Splinter

Use case breakdown

When NOT to use Splinter

The Math

Installation

CLI Reference

compress

decompress

benchmark

plot

Python API

Knot selection strategies

Benchmarking utilities

File Format (.spl)

Limitations

Future Work

Running Tests

Running Examples

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages