CC-BY

Traces - issues, capture, storage, manipulation

Olivier Aubert - www.olivieraubert.net
Cours INFO GCN - 07/12/2018

Summary

  • Context: from physical to digital traces
  • Traces for learning analytics
  • Time-Series Databases

Context

Trace = sign of the past, inscription of a past event or process

Marginalia

Annotation - trace of a scholarly reading activity

marginalia.jpg

Page of the Codex Oxoniensis Clarkianus 39 (Clarke Plato). Dialogue Gorgias. Public Domain

Diaries

diary.jpg

CC BY 2.0

Weavings

deparleur.jpg

Output from the performance "Le Déparleur" (Patrick Bernier et Olive Martin). Personal photograph - Olivier Aubert, CC BY-SA 4.0

Physical paths

Traces_de_ski_dans_la_neige.jpg

CC BY-SA 3.0

Other paths

gpx.png

CC BY-SA 4.0

Digital traces

facebook.jpg Social network traces Source

Digital traces - aggregation

twitter.png

Aggregated social network traces Source

Digital traces - learning environments

learninglocker.jpg

Learning Analytics dashboard Source

Digital traces - other subjects

monitoring.png

Access logs (offices, servers), sensor logs (health, linky, IOT…)

Common features

  • time dimension
  • trace of a past activity
    • there is a subject
  • some data is collected

General issues

Issues in

  • privacy/ethics
  • capture
  • storage
  • representation
  • manipulation
  • interpretation

Digital traces

  • Many things automatically trackable at a low cost
  • Possible to add higher level events
  • But beware of the semantic gap between what can be observed and what can be interpreted

Mediated activity

  • Activity carried out using tools/artefacts
  • Trace interpretation can be guided by the application knowledge
  • Tools influence interaction
  • Digital tools can be instrumented to capture traces
    • they can propose reflexivity (dashboards, history)
    • they can propose assistants

Variety of digital traces

Many types of traces. Here we will focus on:

  • activity traces for learning analytics (XAPI)
  • sensor data (TSDB)

but there are also

  • website analytics
  • server logs
  • e-mails
  • chat logs
  • revision control information

A word about ethics

Vast amount of data and processing capabilities

  • Programmer’s responsibility/ethics
  • What do you do if faced with the task of implementing illegal/unmoral software/processing?
  • Good to think about it before the issue arises…
  • For a start and further reading

Regulation - GDPR / RGPD

General Data Protection Regulation (GDPR) / Règlement Général sur la Protection des Données (RGPD)

  • Applies from May 25th, 2018
  • Applies to any EU company, or any company processing data from EU citizens
  • Fine up to 20 M€ or 4% of the annual worldwide turnover

GDPR principles

  • Explicit consent required
  • Privacy by Design and by Default
  • Responsibility and Accountability
  • Right of Access, Right of Erasure
  • Data portability
  • Data Protection Officer
  • Pseudonymisation encouraged
  • Data breach 72h notification maximum delay
  • Research exceptions

Learning analytics

Definition

Learning analytics is the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs. Source

Interdisciplinary domain (data science, pedagogy, social sciences)

Application domains

  • analytics on interaction with the learning platforms
  • analytics around social interactions
  • analytics around learning content
  • analytics in different spaces (digital/f2f)

Uses

  • Real-time or asynchronous feedback
  • Reflexivity
  • Usage analysis
  • Reporting
  • Alerts
  • Recommandation
  • Document re/conception

Architecture

Issue: traces are generated on a variety of platforms

  • we need a common model/protocol
  • we need trace repositories (personal/shared), called LRS (Learning Record Stores)
  • depending on the nature of the trace data, there may be other constraints (frequency, volume)

Models/Protocols

Experience API evolution

  • SCORM (2000) shortcomings
    • need to be connected
    • content must be imported and registered into the LMS before tracking
    • LRS embedded in LMS
  • Call for research for evolution in 2010
  • TinCanAPI project, inspired by ActivityStreams
  • v 1.0 released in 2013

Experience API data model

ExperienceAPI website - Reference spec

Activity events are recorded as Statements

Statement = (timestamp, actor, verb, object) [+ context] [+ result] [+ stored] [+ authority]

XAPI Statement example

{ "timestamp: "2018-12-07T14:02:47.598441+01:00",
  "actor": { "name": "Olivier Aubert",
             "mbox": "mailto:contact@olivieraubert.net" },
  "verb": { "id": "http://activitystrea.ms/schema/1.0/present",
            "display": { "en-US": "presented" } },
  "object": { "id": "https://olivieraubert.net/cours/gcn_stockage_traces",
              "definition": {
                  "name": "Traces - issues, capture, storage, manipulation",
                  "type": "http://adlnet.gov/expapi/activities/lesson"
              }
  "context": { "language": "fr",
               "extensions": {
                   "http://www.polytech.univ-nantes.fr/xapi/polytechRoom": "D118"
               }
             },
  "stored": "2018-12-07T14:02:47.954814+01:00",
}

Actor representation

Identified by at most ONE of mbox, mbox_sha1sum, openid, account (homepage, name)

{
 "name": "Sally Glider",
 "mbox": "mailto:sally@example.com"
}

or

{
 "name": "Sally Glider",
 "account": {
   "homePage": "http://twitter.com",
   "name": "sallyglider434"
 }
}

Actor/Group representation

Agent + specify objectType and member list.

{
   mbox: "mailto:info@tincanapi.com",
   name: "Info at TinCanAPI.com",
   objectType: "Group",
   member: [
       {
           mbox_sha1sum: "48010dcee68e9f9f4af7ff57569550e8b506a88d"
       },
…
}

Verb representation

URI + display string

{
 "id": "http://adlnet.gov/expapi/verbs/experienced",
 "display": {
   "en-US": "experienced"
 }
}

From XAPI vocabulary & profile index (common vocabularies), many common ones come from ActivityStreams W3C recommendation.

Object representation

Normally an activity, but can also be a person, group or even another statement.

{ "id": "https://olivieraubert.net/cours/gcn_stockage_traces",
  "definition": {
    "name": "Traces - issues, capture, storage, manipulation",
    "type": "http://adlnet.gov/expapi/activities/lesson"
  }
}

or

{ "objectType": "Agent",
  "mbox":"mailto:test@example.com"
}

Context representation

Additional information about the activity context

"context": { "language": "fr",
             "extensions": {
                 "http://www.polytech.univ-nantes.fr/xapi/polytechRoom": "D118"
             }
           }

Result representation

Representation of a measured outcome.

  • score (Object): The score of the Agent in relation to the success or quality of the experience.
  • success (Boolean): Indicates whether or not the attempt on the Activity was successful.
  • completion (Boolean): Indicates whether or not the Activity was completed.
  • response (String): A response appropriately formatted for the given Activity.
  • duration (String): Period of time over which the Statement occurred.

Extensions

  • Object, Context and Result can feature “extensions”
  • Custom vocabulary, key-value pairs where keys are URIs

XAPI protocols

4 REST APIs (last 3: Document APIs)

  • Statement API: main API for statements
  • State API: scratch space in which arbitrary information can be stored in the context of an activity, agent, and registration (per user/per activity).
  • Agent API: additional data against a profile (group, settings…)
  • Activity Profile API: additional data against an activity not specific to a user (collaboration activities, social interaction)

REST Principles

  • GET: access document
  • PUT: create/replace document
  • POST: update existing document
  • DELETE: delete document

Example

Call (source):

POST https://v2.learninglocker.net/v1/data/xAPI/activities/state

URL Parameters:

activityId:http://www.example.com/activities/1
stateId:http://www.example.com/states/1
agent:{“objectType”: “Agent”, “name”: “John Smith”, “account”:{“name”: “123”, “homePage”: “http://www.example.com/users/”}}

Headers:

Authorization:Basic YOUR_BASIC_AUTH
X-Experience-API-Version:1.0.0
Content-Type:application/json

Body:

{
   “favourite”: “It’s a Wonderful Life”,
   “cheesiest”: “Mars Attacks”
}

Trace repositories

  • Learning Record Store
    • personal or shared or integrated with the LMS
  • Privacy issues
  • Interesting approach: MIT OpenPDS
    • Data is personal
    • Processing results can be explicitly shared

Available LRS solutions

Complete list: https://experienceapi.com/get-lrs/

Learning Locker

learninglocker.png

(c) Learning Locker

  • Open-Source LRS
  • Node.js / MongoDB based
  • REST API
  • Features
    • Trace storage
    • Query builder
    • Dashboard builder

Time-series databases

Definition

A Time Series is

  • a collection of observations or data points obtained by repeated measure over time
  • measurements often happen in equal intervals
  • measurement is well defined (who measures what)

Use cases 1/2

  • Systems monitoring
    • Various measures (processor, network load, disk usage…)
    • Continuous monitoring
    • Prediction for future possible events (storage limit, etc)
    • Post-mortem analysis of event (failures mostly) causes

Use cases 1/2

  • Finances
    • Observing trends of stock prices
  • IOT / Industry / Health
    • Storing/analysing sensor measures
    • Continuous measures and evaluation
    • Warning when measurements deviate from the norm

Example

monitoring.png

Definition of Time Series Data

Time series data can be defined as:

  • a sequence of numbers representing the measurements of a variable at time intervals.
  • identifiable by a source name or id and a metric name or id.
  • consisting of {timestamp , value} tuples
    • value: float most of the times, but can be any datatype
  • raw data is immutable, unique and sortable
  • possible extension: Geographic information, for Geo-TimeSeries
  • possibly associated with tags

Conventional database approach

Time Series can be stored in conventional databases

series_id timestamp value
s01 00:50:37 2.56
s02 00:53:53 3.12
s01 00:56:52 4.42
s02 01:00:16 3.23
s01 01:03:32 5.20
s01 01:06:24 6.20

Conventional databases issues

  • scalability issues
    • volume (# of sensors, # of measures)
    • frequency (1 measure/second -> 86400 measures/day).
    • common workload in time series : millions of entries per second
  • query/transformation
    • expressivity issues
    • query performance
    • query characteristics (large batches, downsampling)

⇒ need for specialized time series databases

Time-Series Databases

A TSDB system is

  • a container for a collection of multiple time series
  • software system optimized for storing and querying arrays of numbers indexed by time, datetime or datetime range
  • specialized for handling/processing time series data, taking into account their characteristic workload

Characteristic write workload

  • write-mostly is the norm (95% to 99% of all workload)
  • writes are almost always sequential appends
  • writes to distant past or distant future are extremely rare
  • updates are rare
  • deletes happen in bulk

Characteristic read workload

  • happen rarely
  • are usually much larger than available memory (need for server-side processing)
  • multiple reads are usually sequential ascending or descending
  • reads of multiple series and concurrent reads are common (batch reading)

TSDB Designs

Based on these characteristics

  • proper internal representation of time series
  • distributed database options allow for more scalability than monolithic solutions
  • server-side query processing is necessary
  • memory caching/optimization

Storage implementation

  • May be based on existing DBMS (Cassandra, HBase, CouchDB…) for storing data or metadata.
  • May use its own data format (Time Structured Merge Tree for InfluxDB)
  • May focus on compression to store more data in memory (Gorilla by Facebook)

TSDB Design - Wide tables

One row per time period, columns are samples

series_id start t+1 t+2 t+3
s01 00:00:10 2.56 3.12 4.42
s02 00:00:10 4.12 5.12 6.12
s01 00:00:20 4.23 4.44 4.76

TSDB Design - Hybrid tables

One row per time period, completed lines are stored as BLOB.

series_id start t+1 t+2 t+3 compressed
s01 00:00:10       {…}
s02 00:00:10       {…}
s01 00:00:20 4.23 4.44    

TSDB Design - Direct BLOB insertion

Usually with memory cache.

series_id start data
s01 00:00:10 {…}
s02 00:00:10 {…}
s01 00:00:20 {…}

Some optimizations

  • Pre-aggregation: pre-compute common aggregation with common granularities - days, months, etc
  • Use a custom data format as input: json, protobuf, custom format (warp10.io)…

Retention policies

  • Round-robin tables approaches (RRDTool, Graphite): keep only a round-robin buffer of data. Use a fixed-size storage.
  • InfluxDB allows to configure the retention policy (duration, replication, shard group duration)

Some existing solutions

See Complete list of TSDB

OpenTSDB

  • OpenTSDB
  • Open Source
  • HBase backend (requires HBase)
  • Direct BLOB insertion
  • REST API
  • Milliseconds timestamps

OpenTSDB - Writing data

Inserting values into the database:

put <metric> <timestamp> <value> <tagk1=tagv1[ tagk2=tagv2 ...tagkN=tagvN]>

For instance:

put sys.cpu.user 1356998400 42.5 host=webserver01 cpu=0

OpenTSDB - Series identifier

  • A series key is a combination of metric name (sys.cpu.user) and tag values (host / cpu).
  • Every time series in OpenTSDB must have at least one tag.
  • Offers rapid aggregation through queries, e.g. sum:sys.cpu.user{host=webserver01,cpu=42} or sum:sys.cpu.user{host=webserver01}

OpenTSDB queries

  • SELECT by metric name, time, time range or values
  • GROUP BY by some property
  • DOWN SAMPLING - according to aggregation function and time interval
  • AGGREGATE with common function (min, max, sum, average) - query /api/aggregators for a list
  • INTERPOLATE to get the result with specified intervals

InfluxDB

  • InfluxDB
  • Open Source (monolithic version)
  • No external dependency (Go)
  • Custom LSMTree storage
  • Nanoseconds timestamps
  • REST API, CLI tool, language bindings
  • SQL-like query language

InfluxDB - Writing data

Using Line Protocol:

<measurement>[,<tag-key>=<tag-value>...] \
<field-key>=<field-value>[,<field2-key>=<field2-value>...] \
[unix-nano-timestamp]

Example:

cpu,host=serverA,region=us_west value=0.64
payment,device=mobile,product=Notepad,method=credit billed=33,licenses=3i 1434067467100293230
stock,symbol=AAPL bid=127.46,ask=127.48
temperature,machine=unit42,type=assembly external=25,internal=37 1434067467000000000

InfluxQL query language

SQL-inspired - Example:

SELECT MEAN("water_level")
FROM "h2o_feet"
WHERE "location"='santa_monica'
      AND time >= '2015-09-18T21:30:00Z'
      AND time <= now() + 180w
GROUP BY time(12m) fill(none)

InfluxDB TICK stack

Platform for collection, storage, graphing, and alerting on time series data

  • Telegraf: metrics collection agent
  • InfluxDB: storage agent
  • Chronograf: UI layer (graphs and dashboards)
  • Kapacitor: metrics and events processing and alerting engine

Warp10.io

  • Warp10
  • Dedicated to high-volume GeoTime Series (GTS) handling
  • Collection, storage and analysis of GTS
  • Open-source (Apache 2.0 OSL)
  • Server-side analysis scripts
  • Storage through LevelDB (standalone) or HBase (distributed)

Warp10.io - writing data

  • POST queries to an ingress endpoint
  • Encoding: TS/LAT:LON/ELEV NAME{LABELS} VALUE
POST /api/v0/update HTTP/1.1
Host: host
X-Warp10-Token: TOKEN
Content-Type: text/plain

1380475081000000// foo{label0=val0,label1=val1} 123
/48.0:-4.5/ bar{label0=val0} 3.14
1380475081123456/45.0:-0.01/10000000 foobar{label1=val1} T

Warp10.io - WarpScript

  • Expressive query language
  • RPN-inspired syntax (stack operations)
  • Output format: compact JSON objects
'TOKEN_READ' 'token' STORE                 // Storing token

[ $token ‘consumption’ {} NOW 1 h ] FETCH  // Fetch all values from now to 1 hour ago
[ SWAP bucketizer.max  0 1 m 0 ] BUCKETIZE // Get max value for each minute

[ SWAP [ 'room' ] reducer.sum ] REDUCE     // Aggregate all consumptions by room
[ SWAP mapper.rate 1 0 0 ] MAP             // Consumption being a counter, compute the rate

Visualisation interfaces

Grafana (Graphite, InfluxDB, OpenTSDB, Prometheus)

grafana.png

Dedicated: NBA Data visualisation (Source)

References