-
Notifications
You must be signed in to change notification settings - Fork 36
Retrieving data using Python
#Retrieving data using Python
This tutorial will cover how to retrieve data from a sMAP archiver using Python. We will use the \http://www.openbms.org/](http://www.openbms.org/) site as an example data source; feel free to run these queries yourself! Before reading this, it will be helpful to have familiarized yourself with Key Concepts as well as followed the instructions for sMAP Library Installation.
You can download a working copy of this example, and then follow along with the explanation.
Client bindings for the sMAP archiver are available in the smap.archiver.client
package. To set one up, all you need is the following header:
from smap.archiver.client import SmapClient
c = SmapClient("http://www.openbms.org/backend")
Once you’ve got a client, you can start to retrieve data. In the simplest form of access, you already know the UUIDs of the streams you’re interested in. If you have that, you can access directly by range-query. The data retrieval method expects the range to be supplied in the form of Unix timestamps; the smap.contrib.dtutil
module contains several convenience functions for manipulating datetime's in different time zones:
from smap.contrib import dtutil
start = dtutil.dt2ts(dtutil.strptime_tz("1-1-2013", "%m-%d-%Y"))
end = dtutil.dt2ts(dtutil.strptime_tz("1-2-2013", "%m-%d-%Y"))
oat = [
"395005af-a42c-587f-9c46-860f3061ef0d",
"9f091650-3973-5abd-b154-cee055714e59",
"5d8f73d5-0596-5932-b92e-b80f030a3bf7",
"ec2b82c2-aa68-50ad-8710-12ee8ca63ca7",
"d64e8d73-f0e9-5927-bbeb-8d45ab927ca5"
]
data = c.data_uuid(oat, start, end)
data
is returned as a list of numpy matrices. Each element corresponds to the uuid in the oat
list, and has two columns: the first is timestamp (in unix-time milliseconds) and the second has data values.
####Query options
There are two optional query arguments to data_uuid
: cache
, and limit
. Using limit, you can restrict the number of points returned for each timeseries to a maximum; this can be useful to prevent returning unexpectedly large datasets.
By default, the client library will cache all data downloaded in the .cache
directly; subsequent downloads of the same time range will consult this local data rather than the server. If you wish to avoid this cache, you can pass cache=False
to the library.
####Plotting this data
Making a time-series plot in matplotlib
might be the next thing you want to do. It expects a slightly different date format than sMAP uses; matplotlib.dates
contains the right conversion utilities.
Continuing the previous example:
from matplotlib import pyplot, dates
for d in data:
pyplot.plot_date(dates.epoch2num(d[:, 0] / 1000), d[:, 1], '-',
tz='America/Los_Angeles')
pyplot.show()
The archiver also includes a Query Language, which allows SQL-like queries on data metadata. Rather than hard-coding lists of time series UUIDS, you can instead retrieve data on the basis of tags. For instance, we could instead retrieve the weather data in the previous example using a tag query:
uuids, data = c.data("Metadata/Extra/Type = 'oat'", start, end)
The first argument to data
is a where clause, restricting the set of time series returned to ones with appropriate tags. In this case, we know that the data we’re interested in is tagged with a Metadata/Extra/Type
value set to oat
.
In order to figure out which feed is which, we might instead want to retrieve the metadata for these streams. We can do this using the tags
method:
tags = c.tags("Metadata/Extra/Type = 'oat'")
The metadata is returned as list of dict‘s of tags, which you can inspect and match up with with returned data using the uuids
. A fully worked example puts this all together.
In order to explore what tags and values are available, you can try the stream status interface. This lets you explore the set of allowable tags and tag values using a graphical interface, and see some example data. Once you’ve located the data you’re interested in, you can either hard-code the UUIDs or encode that tag query directly into your application.
The client library contains several other methods for accessing data efficiently; for instance, you can get the latest data or access data relative to an reference timestamp.
class smap.archiver.client
.SmapClient(base='http://ar1.openbms.org:8079', key=None, private=False, timeout=50.0) Source
latest(where, limit=1, streamlimit=10) Source
Load the last data in a time-series.
See prev
for args.
prev(where, ref, limit=1, streamlimit=10) Source
Load data before a reference timestamp. For instance, to locate the last reading whose timestamp is less than the current time, you can use latest(where_clause, int(time.time())
.
Parameters: • where (str) – a selector identifying the streams to query
• ref (int) – reference timestamp
• limit (int) – the maximum number of points to retrieve per stream
• streamlimit (int) – the maximum number of streams to query
next(where, ref, limit=1, streamlimit=10) Source
Load data after a reference time.
See prev
for args.
data(where, start, end, limit=10000, cache=True) Source
Load data for streams matching a particular query.
Parameters: • where (str) – the ArchiverQuery selector for finding time series
• start (int) – query start time in UTC seconds (inclusive)
• end (int) – query end time in UTC seconds (exclusive)
Returns: a tuple of (uuids, data). uuids is a list of uuids matching the selector, and data is a list numpy matrices with the data corresponding to each uuid.
data_uuid(uuids, start, end, cache=True, limit=-1) Source
Low-level interface for loading a time range of data from a list of uuids. Attempts to use cached data and load missing data in parallel.
Parameters: • uuids (list) – a list of stringified UUIDs
• start (int) – the timestamp of the first record in seconds, inclusive
• end (int) – the timestamp of the last record, exclusive
• cache (bool) – if true, try to save/read data from an on-disk cache. Sometimes useful if the same data is frequently accessed.
Returns: a list of data vectors. Each element is numpy.array
of data in the same order as the input list of uuids