Time series databases | Experience and Expectations

Time series databases | Experience and Expectations

IOT assets generate large amounts of data – both in terms of number of signals and in terms of the frequency of data. Time series databases have been continuously improving over time and in the last couple of years have undergone a sea change to accommodate the ever-growing demand for such databases

Unique characteristics of time series data

The ingestion and query characteristics for business use cases of time series data are quite different. The query characteristics usually vary and are more sporadic in nature as it is user driven, whereas the ingestion workloads are more continuous and grow linearly.

  1.     Ingestion characteristics –

  • 100K signal inserts/second and even reaching millions of inserts/second for large fleets
  • Support for high cardinality, as a single asset can have thousands of signals
  • Support for late data to the tune of weeks and months. Importing data from legacy systems can see year or more older timestamps.
  • No guarantee of time sequence (or order) of data, due to the distributed nature of cloud systems and concurrent ingestion of data streams from the edge

2.     Query characteristics of machine data (data generated by the asset)

  • A common use-case for monitoring and diagnostics purposes, is the ability to retrieve data for certain set of signals, concurrently for the recent past (days or weeks) where latency is of prime importance to end users
  • Access to long duration data for analytical model development or for analysing trends where the wear and tear might be evident only over long periods of time.
  • Ability to quickly retrieve last received data for a specific signal to understand the current state of an asset and derive any immediate actions or for reporting purposes
  • Ability to align data of multiple signals by time for diagnostic purposes 
  • Concurrent access to data of different assets for batch analytical algorithms

3.     Query characteristics of decorated or processed data (data generated in the cloud)

  • Due the high frequency nature of the data, it is better that the data is down sampled and stored (while preserving the underlying pattern) to serve long duration queries in a performant manner e.g., one second resolution data for a single signal would have 31 million points in a year. Hence it is desirable that an offline copy of the down sampled data, say at 10 minutes (52k points) or at 1 hour resolution (9k points) is used, which is sufficient for trend analysis for most cases.
  • While access to “raw data” is useful for detailed analysis, many use-cases can be served using aggregated data such as minimum, maximum, minimum, mean, standard deviation, Holt Winters, Count for analysis of simple anomalous behavior
  • Another useful use case is to keep track of counters e.g., number of times the asset was started over its lifetime
  • Support for custom interpolation to make data time equidistant
  • Ability to query data not just by the generated time (time at which the data was produced) but also by arrival time or the time at which it was stored. This is especially useful when analyzing late data

4.      Storage characteristics

  • Data compression and Retention policy: As large volume of data gets ingested; data needs to be compressed and stored (especially since SSD is the storage medium used for time series databases). Data can be deleted based on a Retention Policy.

5.      Alarm specific characteristics

  • While signal data is usually continuous and of high volume, alarms on the other hand do not exhibit the same ingestion characteristics.
  • On the query side, the ability to support fleet wide queries such as, to determine assets which have high severity active alarms and count of such alarms are useful to provide actionable insights.
  • Ability to modify an alarm data entry by a user – for example, an alarm on arrival remains unacknowledged and the status can be changed to acknowledged by an operator. This will help to detect if there are any unacknowledged alarms in the system.
  • Ability to determine the amount of time an alarm was in a particular state, esp. the active state.

Experience with InfluxDB

InfluxDB is a purpose-built time series database and storage engine to support many of the required ingestion and retrieval characteristics for time series data.

  1. Supports very high ingestion rates.
  2. Demonstrates very high data compression, a TB of data ingested is compressed down to few GBs.
  3. Data is grouped by time and hence very performant when needing to retrieve data within that time range
  4. Support for late and historical data ingestion where time sequence is automatically maintained
  5. Retention Policy support
  6. Various time series functions such as Minimum, maximum, Mean, Count, Last, First, Holt Winters, Standard Deviation.

What next!

  1. Native support for a performant solution for queries that ranges up-to few years for a series.
  2. Native support for decorated data related use cases mentioned in the section above. This will remove the need for custom implementation
  3. Fleet level queries when a single request retrieves points across thousands of tags 
  4. Ability to scale queries independent of storage. This is useful for resource intensive post processing of queries.
  5. Ability to retrieve compressed data in industry standard formats such as parquet to make it easier for the data science community
  6. Ability to trigger compaction manually to optimize the backup process and additionally provide better visibility on the compaction process and policy.
  7. Ability to add a new Data Node to the cluster and without restarts and thereby reducing user impact. Redistribution of the shards should ideally happen behind the scenes.
  8. Ability to move data to a low-cost storage based on user policies such as expiration of retention period. The data should still be available using the same query interface even though it may not give the same level of performance. This will reduce the need for a separate data lake especially for all the time series data.
  9. Writes to InfluxDB currently is through line protocol (with no compression), support for a compressed version will reduce data transfer costs especially when InfluxDB Cloud is used. 

Many of the enhancements suggested below are part of InfluxDB Cloud and the upcoming InfluxDB IOx product

This document captures the authors’ learning and suggestions at a specific point in time. We hope this will benefit the community and we urge other users to share their experiences as well.

Authors - Sushrutha Bankapura | Santosh Agarwal

Very nice article Santosh san, for downsampling the data, what would be the best strategy so the retention metrics is good enough to show the historical trend patterns.

Like
Reply

Very crisp article which provides insights into the aspects to be considered while choosing a database for time series data

Like
Reply

To view or add a comment, sign in

More articles by Santosh Agarwal

Others also viewed

Explore content categories