TSDB tutorial - 2023/11/16
Table of Contents
We will study here the basic usage of a Time Series Database. We will base our work on the InfluxDB database, and study how to use it from the command line and through its web interface. We will query existing data, inject custom data and build some queries against it, illustrating some features of InfluxDB.
We will need 3 terminal windows: one for running the InfluxDB database (and see its log messages), another one for running the Telegraf service, and the last one for running the command-line client influx.
Key concepts
InfluxDB stores measurements as series. A measurement is composed of a name, a timestamp, and its values. Since there may be multiple properties for the same measurement, values are stored in the form field=value where field is an alphanumeric identifier. Field values can be strings (with single quotes), floats, integers, or booleans.
Another similar notion associated to measurement is tags: tags are optionally defined for each measurement as tagname=value, where value is always considered as a string.
The main and important difference between fields and tags is that tags are indexed: InfluxDB will be faster at querying data based on tags. They should contain appropriate metadata (source of measurement, sensor id…).
A series is a collection of points that share a measurement, a tag set and a field key.
Data is stored in buckets. A bucket combines the concept of a database and of a retention period (the duration of time that each data point persists). Buckets may (optionally) define explicit schemas that measurements must respect.
The database can be queried using a query language named Flux. In previous version there was a SQL-inspired language, with time-specific extensions, named InfluxQL, but some queries could not be expressed with it. In the context of this tutorial, we will use Flux.
IMPORTANT NOTE for Windows users
While not required by Influx, which can run in any environment, in the context of this tutorial you must use a WSL shell to execute all the commands given in the text
Setup
We will use the 2.7.3 InfluxDB version.
For settings things up, we will download static binaries from https://portal.influxdata.com/downloads/
Note: you do not need to register, you can just close the Register dialog popup.
Static binaries install
This unconventional approach can be used even when you do not have admin rights on the computer: all software runs in user mode and uses non-privileged ports. Obviously you should
IMPORTANT NOTE for Windows it seems you need to locate the
database in a directory located on the Windows filesystem (i.e. not in
the pseudo fs emulated for $HOME). So: make sure you are in your User
directory on Windows (the prompt should display something like
/mnt/c/Users/username/
. Create a tsdb
directory somewhere in this
tree (for instance /mnt/c/Users/username/Desktop/tsdb
to find it
easily) and make sure to always run the commands from there.
Download the Influxdb (v 2.7.3), Telegraf software and the InfluxDB v2 Cloud CLI (as static binaries*), *for Linux 64-bit (even on Windows since we will use WSL).
Uncompress the archives, and run the server in a terminal (again: make sure you are in the right directory):
./influxdb2-2.7.3/usr/bin/influxd --bolt-path=data/bolt --engine-path=data/engine
Then access the administrative URL at the address http://localhost:8086/
Create a new user, with a password (do not use your existing account
password). Use tp
for initial organization name (it corresponds to
a group), and local
for initial bucket name.
Note in a file the API key it generates to authentify against the admin user.
Then click on Quick start
to setup local metric collection. It will
automatically create sensors for multiple measurements linked to the
InfluxDB database itself (number of users, of queries, of databases…)
so that you can monitor its behaviour. The associated dashboard
InfluxDB 2.0 OSS Metrics will also be created so that you can
visualize the data.
[Question] Open the InfluxDB 2.0 OSS Metrics dashboard. How many buckets are defined in the system.
CLI usage
Let's configure the authentication of the command-line tool, so that we can use it more conveniently.
You should have kept the API token generated for the admin user. If
not, you will to generate a new token to grant access to the database. In
the web GUI, got to the Load Data / API Tokens
. Click on Generate
API token and select All Access API token (for simplifying things
in the context of this tutorial, do not do this at home). Click on the
created token and copy its textual value into a file to preserve it.
Open another terminal (the first one is used to have the daemon run) and set the environment variables so that you can use them more easily:
export INFLUX_TOKEN=<token value> export INFLUX_ORG=tp export INFLUX_BUCKET=local
Then configure the authentication for the CLI with the config
command:
./influx config create --config-name tpconfig \ --host-url http://localhost:8086 \ --org $INFLUX_ORG \ --token $INFLUX_TOKEN \ --active
You can now use influx on the command line to interact with your database. Do not use the web interface for the moment.
[Question] What is the influx command that allows you to list the available buckets?
[Question] What are the names of the available buckets?
Monitor system resources
For the moment, our InfluxDB is configured to monitor itself.
We will configure a system resource monitor, in order to monitor the
computer system. In the main page, click on Load your data
button.
The telegraf
component is used to collect data and send it to the
InfluxDB database. In the Load data
interface, select the
Telegraf
tab and create a new configuration. Select the System
configuration, select all measurements (cpu, disk, etc) and create the
configuration.
A specific token is created so that the Telegraf component can only write its data to a specific bucket. Copy it to the same file as the first token to keep a copy. A URL endpoint for writing data is also given. They are indicated in the dialog displayed after the creation.
Open another terminal (you should already have one for the database
and one for running the influx client), set the INFLUX_TOKEN
environment variable generated for the Telegraf component and start
the telegraf
server (note that it is in usr/bin
in the extracted
telegraf directory) with the command-line given by Influx. We will
let it run for a while so that it collects some data, before exploring
it.
Exploring existing data
Database creation
We will use a sample database. Download sample data. The grep
part
is to keep only lines that define measurements.
curl https://s3.amazonaws.com/noaa.water-database/NOAA_data.txt | grep h2o > NOAA_data.txt
Have a look at the loaded data to make sure everything is ok. Check the
timestamp values and determine their unit. Hint: in python,
time.localtime
returns a time structure from a number of seconds.
Using the command-line interface, create a new bucket named noaa
and
import the NOAA data into it (influx
binary, write
command). Important Do not forget to specify the precision (time
unit) used in your file.
- [Question] Give the command-line used to import the data file.
Global exploration
We will first use the command-line interface influx
to start
exploring the data, and make queries using the Flux language.
The Flux language often requires to filter against a time range. We first have to determine the appropriate time range for our dataset. For this, we will do some queries with a very large time range (starting in 1900) so that we know that we should get the appropriate info.
The following command runs the query that outputs the first measurement for all series:
./influx query -r 'from (bucket: "noaa") |> range(start: 1900-01-01, stop: now()) |> first() |> keep(columns: [ "_time", "_measurement" ])'
[Question] What is the beginning datetime of measurements? Note: you will need this information in the next questions
Modify the above query to find the time of the last measurement of noaa
.
[Question] What is the new query, and what is the end datetime of measurements?
Now that we know the appropriate time range, we can investigate our data structure. Influx provides a package (functions defined using Flux) to help with this, in the schema package. See https://docs.influxdata.com/influxdb/v2/query-data/flux/explore-schema/ for its documentation.
Note that many of the exploring functions need a start
parameter
expressing a duration (the default -30 days if fine for real-time
data, but is too short in our case).
- [Question] What measurements are present in the noaa bucket? What is the query to list them?
- [Question] What are the fields defined in the
h2o_feet
measurements, and the query to find them? - [Question] What are the tags keys defined in the
h2o_feet
measurements, and the query to find them? - [Question] What are the different values for the
location
tag inh2o_feet
measurements, and the query to find them?
Data exploration
Using the Flux syntax, answer the following questions:
- [Question] How many measurement points of water level are there in
the
h2o_feet
measurement for each location? (hint:count
). Give the query used. - [Question] Convert the
h2o_feet
measurement into a newh2o_meter
measurement by converting thewater_level
values to meters (hint:map
for mapping values,to
for storing the new measurement intonoaa
). Give the query used.
Using the Giraffe GUI
The CLI is useful for administration and some common tasks like data manipulations, but dealing with time series data is more convenient when you can display the stored data. The InfluxDB binary includes a web-based GUI named Giraffe.
The Explore component allows you to build queries and visualize results. The Dashboard components can combine multiple visualizations into a single-page dashboard.
- Build a query through the Data Explorer component (expand left column to see the names) to display the water level for both locations as a timeline (Graph). Pay attention to the time range definition.
- Click on "Script editor" to see the Flux query built by the GUI. To better understand the output structure, use the Table visualisation component (upper left corner) that displays raw data in table.
- In the Query editor, remove the
aggregateWindow
call and see the difference in the rendering of the graph (when zoomed).
[Question] Display the h2o_temperature
graph. Its values are in
Farenheit degrees, do a conversion (hint: map
) to get a display in
celsius degrees. Give the used query.
- Dashboard building: create a new Dashboard where you will define 2 timeline views: one for the water level, and the other for the average temperature.
[Question] Screenshot of visualisation
Injecting data
Data injection is most commonly done using the Telegraf component, as we saw at the beginning of the session. But there are multiple bindings that offer this possibility from different languages.
InfluxDB exposes a REST API on port 8086. You can write a measurement
with a curl
command:
curl --request POST \ "http://localhost:8086/api/v2/write?org=$INFLUX_ORG&bucket=$INFLUX_BUCKET&precision=ns" \ --header "Authorization: Token $INFLUX_TOKEN" \ --header "Content-Type: text/plain; charset=utf-8" \ --header "Accept: application/json" \ --data-binary ' monitoring,computer=u1 temperature=72.5,load=.7 monitoring,computer=u2 temperature=67,load=.5'
Another way is to use the influx write
command from the influx CLI:
influx write \ -b bucketName \ -o orgName \ -p s \ 'monitoring,computer=u1 temperature=72.5,load=.7 1556896326'
But we will here use the javascript client library to instrument a web page in order to send measures to the database.
Create a mouse
bucket in Influxdb.
Download the provided HTML template in a directory and complete it
(look for the TODO string) to connect to the Influx database and
send tracking events to the mouse
bucket. The HTML file displays 4
squares. The goal is to trace the activity of the mouse in the
different squares, and determine attributes of the mouse trajectory in
each of the squares.
[Question] Provide the relevant source code extract that you had to implement.
To access the page, launch a temporary web server (python3 -m
http.server
in the same directory as the HTML file will do the job)
and access it through the web browser (to avoid security limitations
for JS on local files).
Make sure, through the Influx GUI, that you can see the different captured values for x and y, grouped by source.
[Question] Use the Scatter plot to display mouse trajectories in a
meaningful way, illustrating the captured moves. You will have to
pivot
the data. Give the Flux query and a screenshot.
- Determine the global distance covered by the mouse in each of the
squares (hint: it involves
pivot
,difference
, andmath.sqrt
).
[Question] Give the query used to obtain the result
- Extending the previous query, determine the mean speed (in
pixels/seconds) of the mouse for each of the squares (hint: it
involves
derivative
)
[Question] Give the query used to obtain the result
There are no more questions to answer. The rest of the text presents some additional features that you are free to explore at your leisure.
- Bonus activity 1: define a dashboard that displays the scatter plot and speed/distance by widget, and updates in real time.
- Bonus activity 2: implement additional sensors (mouse clicks, number of typed characters, acceleration…)
To go further
We only saw here some basic features of the InfluxDB stack. We did not address complex queries, continuous queries, using alerts… Feel free to explore.
Bonus activity: replicate the tutorial using Warp10.io instead of InfluxDB, using either a locally installed version or the https://sandbox.senx.io/ sandbox server.
For reference, some interesting approaches to test for time series analytics/prediction:
- https://github.com/uber/orbit
- https://tsfresh.readthedocs.io/
- https://ai.googleblog.com/2020/12/using-automl-for-time-series-forecasting.html : AutoML approach by Google
TS smoothing: use gaussian smoothing rather than moving average
Stop aggregating away the signal in your data
Other possible datasource: Station Nantes - Pont Anne de Bretagne (Loire)