SQL and the Problem with Time Series
SQL isn't keeping pace with
the demand to quickly analyze large amounts of data
by Joe Celko and Arthur Whitney (July 1, 2002)
Inside Market Data Vol 17 No 39
July 1, 2002 (New York - London)
"Time is what keeps everything
from happening all at once." --George Carlin
In the real world, time is a continuum, not a point. But
in the digital world of ones and zeros, we pretend time
occurs in a series of points. It's a convenient lie, one
that enables us to apply digital technologies to all kinds
of business operations. In the analysis of market data,
for example, trends can be uncovered by looking at price
changes over time.
SQL has become the industry standard for querying relational
data. Unfortunately, however, analyzing price changes over
time is exactly what SQL and most other programming languages
don't do well. If Cobol had had temporal data types, we
probably would not have had a Y2K crisis. Today, as the
volume of trading transactions continues to grow, the limitations
of SQL in querying financial events over time is failing
to keep pace with the demand for quick analysis of large
volumes of data.
SQL's weaknesses are particularly apparent when it comes
to market data, where traders want to capture tick data
on every bid, ask, order and cancellation at as many levels
as possible. Market data providers often create added value
by supplying tools that can assist with these analyses.
Typically these tools run up against three limitations.
First, unless the streaming data is written to storage,
it can only be analyzed while it resides in memory--even
though traders would like to be able to analyze large amounts
of historical, relational data as well as streaming intraday
data.
Second, SQL-based queries of relational data are inordinately
slow, because SQL is not optimized to handle time-series
data.
Finally, as the volume of data grows--particularly on
the historical side--existing databases and analytic tools
often cannot scale to provide adequate analytic performance.
Although SQL was the first language to have temporal data
types built in, its time capabilities are limited. Standard
SQL provides only simple temporal functions--you can add
and subtract units of time and you can compare points in
time.
However, SQL is a set-oriented language, and the nature
of a set is that it has no ordering--that is, no sense
of time. It considers the data as a completed whole that
exists all at once. This approach provides a tremendous
advantage in dealing with infinite sets in mathematics;
for example, you can prove a theorem for all the integers
without having to deal with them one at a time.
The bad news is that you can't easily do sequential processing,
such as chart a variable--for example, the price of a stock--over
time. No single point on the chart can tell you if the
trend is up or down. No single point can give you the rate
of change. Nor will the entire set of points tell you these
things. The information is revealed only in the sequence
in which the points appear. In order to handle the sequencing,
SQL must use multiple joins of the data, which only expands
the amount of data involved in the query and further impedes
performance. SQL can do the job, but not well.
What's the Solution? Given the limitations of SQL, how
can market data providers offer their subscribers comprehensive
query capabilities for time series analysis? The challenge
is to avoid settling for a partial solution. Assisting
customers to expand their decision support for trading
and to develop and execute sophisticated trading models
that result in competitive advantage should be the paramount
concerns.
Here are the solution characteristics to look for:
- A solution that does not split the streaming and historical
data, enabling the same tool to be used with both.
- Extremely high (near real-time) performance, even
on enormous volumes of data.
- Consistent scalability as data volumes grow to multiple
gigabytes and into terabytes.
- Maximum flexibility in data types--the ability to
receive and store any type of historical or real-time
data.
- Support for user-defined queries.
- Support for open standards: JDBC, ODBC, XML, etc.
- Investment protection--you or your customers should
be able to continue using any legacy SQL, Java, C and
C++ code as needed.
- Easy interfacing to a variety of third-party statistical
packages.
- Support for mobile, desktop, mid-range and mainframe
systems.
- Support for Windows, UNIX and Linux operating systems.
Solutions that offer all of these capabilities in one
package do exist, and they enable market data providers
to expand and enhance their services to subscribers.
About the authors: Joe
Celko, SQL expert and author of Instant
SQL Programming, is an independent consultant
based in Austin, Texas. Arthur
Whitney is CTO of Kx Systems (kx.com),
whose kdb database and KSQL language are used by market
data providers and financial services firms worldwide.
|