Skip to content

ike10/Splinter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Splinter

Lossy compression of time-series sensor data using cubic spline interpolation.

Instead of storing every sample, Splinter fits a cubic spline through a small set of knot points and stores only the knot coordinates. Decompression evaluates the spline at the original timestamps. You control the quality-size tradeoff by choosing how many knots to use.


Benchmark Results

Headline benchmark: compression ratio vs RMSE

On a smooth 20,000-sample synthetic signal (multi-harmonic sine wave):

Knots Compression vs raw Compression vs gzip RMSE
10 800x 725x 0.825
25 364x 330x 0.499
50 190x 173x 0.316
100 98x 88x 0.177
200 49x 45x 0.010
400 25x 23x 0.0004

On real datasets:

Dataset Samples Knots Ratio vs raw RMSE
UCI Household Power 100,000 100 488x 1.53 W
UCI Household Power 100,000 500 100x 1.17 W
Beijing PM2.5 40,000 100 195x 119 ug/m3
Synthetic ECG 50,000 500 50x 0.22
Synthetic ECG 50,000 1000 25x 0.08

Quick Start

pip install -e .
# Compress a CSV to .spl format (100 knots, ~488x compression on smooth data)
splinter compress sensor_log.csv compressed.spl --knots 100 --column value --time-column timestamp

# Decompress back to CSV
splinter decompress compressed.spl reconstructed.csv
import numpy as np
from splinter import SplinterCompressor

t = np.linspace(0, 100, 50_000)
y = np.sin(t * 0.3) + 0.5 * np.sin(t * 1.1)

comp = SplinterCompressor(n_knots=200, strategy="uniform")
reconstructed = comp.fit_transform(t, y)
print(f"RMSE: {np.sqrt(np.mean((y - reconstructed) ** 2)):.6f}")
print(f"Compression ratio: {comp.compression_ratio():.1f}x")

When to Use Splinter

Splinter is a good fit when all three of these are true:

  1. Your signal is smooth-ish (temperature, power draw, heart rate, audio envelopes, financial prices). Cubic splines model gradual change well. Sudden step changes or random noise are a poor fit.
  2. You can tolerate small reconstruction error. Splinter is lossy by design. The original samples are not recoverable exactly. If you need bit-perfect storage, use a lossless format.
  3. You need significant size reduction beyond what gzip alone offers. Splinter beats gzip by 10-700x on smooth signals because it exploits domain structure (smoothness) rather than byte-level entropy.

Use case breakdown

Domain Example signals Why Splinter helps
IoT and edge telemetry Temperature, humidity, voltage, current Sensors sample at high rates but physical quantities change slowly. 100 knots can represent hours of data. Reduces cellular upload cost and flash wear.
Industrial monitoring Motor vibration envelopes, pressure traces, flow rates Engineers care about trends and anomalies, not every individual sample. Smooth reconstruction preserves both.
Biomedical signals ECG envelopes, respiration, blood pressure Long-duration recordings (24-hour Holter monitors) generate gigabytes. Splinter compresses clean segments aggressively while the adaptive strategy allocates more knots to beat-by-beat variation.
Financial data archival Intraday price curves, implied volatility surfaces Tick data is expensive to store at full resolution for years. Splinter archives the shape of the curve at a fraction of the cost, suitable for backtesting and research workflows that do not require tick-exact replay.
Weather and climate logging Station temperature, wind speed, solar irradiance Measurements are taken every minute but physical processes evolve over hours. Seasonal archives compress extremely well.
ML preprocessing pipelines Feature smoothing, dataset size reduction Many ML models benefit from smoothed inputs. Splinter can replace a smoothing step and a compression step in one pass, reducing pipeline complexity.

When NOT to use Splinter

  • Exact recovery is required. Financial audit trails, medical records, lossless audio.
  • The signal is fundamentally noisy. Raw PM2.5 particle counts, high-frequency trading order books, or shot noise from a photodetector have no smooth structure to exploit.
  • Very short signals. The overhead of storing knot coordinates only pays off when N (samples) is much larger than K (knots). Below a few hundred samples, just store the raw data.

The Math

A natural cubic spline is a piecewise cubic polynomial that passes exactly through a set of control points (knots) and minimizes overall curvature. Given K knot positions and the signal values at those positions, scipy.interpolate.CubicSpline computes the polynomial coefficients for each segment. Splinter stores only the K (timestamp, value) knot pairs in a compact binary format. Decompression reconstructs the spline and evaluates it at any desired timestamps. For a smooth signal with N samples, storing K knots yields a compression ratio of roughly N/K on the value array, minus a small fixed header overhead.


Installation

git clone https://github.com/ike10/splinter.git
cd splinter
pip install -e ".[dev]"

Requirements: Python 3.10+, numpy, scipy, pandas, matplotlib, click.


CLI Reference

compress

splinter compress INPUT.csv OUTPUT.spl \
    --knots 100 \
    --column value \
    --time-column timestamp \
    --strategy uniform \   # or adaptive, error_bounded
    --gzip                 # optional additional gzip layer

decompress

splinter decompress INPUT.spl OUTPUT.csv

benchmark

splinter benchmark INPUT.csv \
    --knot-sweep 10,50,100,500,1000 \
    --strategy uniform \
    --output report.md

plot

splinter plot INPUT.spl \
    --original ORIGINAL.csv \
    --save preview.png

Python API

from splinter import SplinterCompressor, write_spl, read_spl
from splinter.io import SplHeader
import numpy as np

# Compress
comp = SplinterCompressor(n_knots=100, strategy="adaptive")
comp.fit(timestamps, values)

header = SplHeader(
    n_knots=len(comp.knot_timestamps),
    n_samples=comp.original_n,
    t_start=comp.t_start,
    t_end=comp.t_end,
    strategy_flag=comp.strategy_flag,
)
write_spl("signal.spl", header, comp.knot_timestamps, comp.knot_values)

# Decompress
from scipy.interpolate import CubicSpline
header, knots_t, knots_y = read_spl("signal.spl")
cs = CubicSpline(knots_t, knots_y)
reconstructed = cs(np.linspace(header.t_start, header.t_end, header.n_samples))

Knot selection strategies

Strategy Description Best for
uniform Evenly spaced knots (default) General use, smooth signals
adaptive Curvature-weighted placement Signals with localized sharp features
error_bounded Iterative greedy insertion When you need a max-error guarantee

Benchmarking utilities

from splinter.benchmark import knot_sweep, plot_sweep, rmse, mae, max_error

results = knot_sweep(timestamps, values, knot_counts=[10, 50, 100, 500])
print(results)

plot_sweep(results, title="My signal", save_path="sweep.png")

File Format (.spl)

Custom binary format designed for compact storage and fast parsing.

Bytes 0-7    : Magic "SPLNTR01"
Bytes 8-11   : Number of knots K  (uint32, little-endian)
Bytes 12-19  : Original sample count N  (uint64)
Bytes 20-27  : Time range start  (float64)
Bytes 28-35  : Time range end  (float64)
Byte 36      : Knot strategy flag  (0=uniform, 1=adaptive, 2=error_bounded)
Bytes 37-39  : Reserved
Bytes 40+    : K knot timestamps  (K * 8 bytes, float64)
Then         : K knot values  (K * 8 bytes, float64)

Total size: 40 + 16 * K bytes. A 100-knot file is 1,640 bytes.


Limitations

  • Signals with abrupt discontinuities compress poorly. Step changes, impulse spikes (like raw PM2.5 data), and clipped signals violate the smoothness assumption of cubic splines and produce large reconstruction errors.
  • Lossy, not lossless. Splinter cannot reconstruct the exact original signal. If exact recovery is required, use a lossless format.
  • Python-speed. The current implementation is pure Python with NumPy. Fitting and evaluating 100,000-sample signals takes tens of milliseconds. A C extension would be 10-100x faster.
  • Single-channel only. Multi-channel compression requires calling the compressor once per channel.

Future Work

  • C extension for the hot path (knot selection and spline evaluation).
  • SIMD-accelerated batch evaluation for decompression of many channels.
  • Learned knot selection using a small neural network trained on signal statistics.
  • 2D bicubic spline compression for image data.
  • Streaming compression for live sensor feeds without buffering the full signal.

Running Tests

pytest tests/ --cov=splinter

All 33 tests pass with over 90% coverage on core modules.


Running Examples

python examples/01_household_power.py  # UCI household power, downloads ~20 MB
python examples/02_air_quality.py       # Beijing PM2.5, downloads ~500 KB
python examples/03_ecg_signal.py        # Synthetic ECG, no download needed

Or run all benchmarks:

python benchmarks/run_all.py

License

MIT. See LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages