Towards Optimal Symbolisation of Time Series Data
Time series are an increasingly prevalent form of mass dataset, due in no small part to the upsurge in human behavioural data that is now being recorded in an unparalleled fashion. In large data sets algorithms that are efficient in both time and space are required.
Time series symbolisation is a common pre-processing step to speed up computation, reduce storage costs and/or enable the application of certain algorithms. In this work we show that current symbolisation techniques are sub-optimal in (at least) the broad application area of time series comparison leading to unnecessary data corruption and potential performance loss before any real data mining takes place.
Addressing this, we present two novel algorithms which are shown to be optimal under some broadly applicable assumptions.