Azure Synapse Analytics – the essential Spark cheat sheet

In this article, I take the Apache Spark service for a test drive. It is the third in our Synapse series: The first article provides an overview of Azure Synapse, and in our second, we take the SQL on-demand feature for a test drive and provided some resulting observations.

This article contains the Synapse Spark test drive as well as cheat sheet that describes how to get up and running step-by-step, ending with some observations on performance and cost.

Synapse Spark Architecture

The Spark pool in Azure Synapse is a provisioned and fully managed Spark service, utilising in-memory cache to significantly improve query performance over disk storage.

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview

As Spark pools are a provisioned service, you pay for the resources provisioned and these can be automatically started up and paused when the Spark pool is in use. This can be controlled per Spark pool via the two configurations Auto-pause and Scale, which I will discuss later in this post.

Based on the Azure Pricing calculator, https://azure.microsoft.com/en-au/pricing/calculator/, the cost is based on a combination of the instance size, number of instances and hours of usage.

Why Spark pool?

  • Data scientists can run large amounts of data through ML models
  • Performance, the architecture auto-scales so you do not have to worry about infrastructures, managing clusters and so on.
  • Data orchestration effortless inclusion of spark workloads in notebook

Who Benefits?

  • Data scientists: collaborating, sharing and operationalising workflows that use cutting edge machine learning and statistical techniques in the language that fits the job is made simple.
  • Data engineers: complex data transformations can easily be replicated across different workflows and datasets in an obvious way.
  • Business users: even without a deep technical background, the end users of the data can understand and participate broadly in the preparation of their data and advise and sense-check with the capabilities that only a subject matter expert has.
  • Your business: data science processes within Azure Synapse are visible, understandable and maintainable, dismantling the ivory silo data science often occupies in organisations.

Steps to get up and running

I have already provisioned a data lake, the Azure Synapse Analytics workspace and some raw parquet files. In this section, I will:

  1. Access my Azure Synapse Analytics workspace.
  2. Provision a new Spark pool
  3. Create a new Notebook with Python as the chosen runtime language
  4. Configure my notebook session
  5. Add a cell and create a connection to my Data Lake
  6. Run some cells to test queries on my parquet files in the Data Lake
  7. Run another runtime language in the same notebook

Step 1 – Access my Synapse workspace

Access my workspace via the URL https://web.azuresynapse.net/

I am required to specify my Azure Active Directory tenancy, my Azure Subscription, and finally my Azure Synapse Workspace.

Before users can access the data through the Workspace, their access control must first be set appropriately. This is best done through Security Groups, but in this quick test drive, I used named users.

When I created Azure Synapse Analytics, I specified the data lake I want to use, this is shown under Data > Linked > data lake > containers. I can, of course, link other datasets, for example, those in other storage accounts or data lakes here too.

Step 2 – Create a new Apache Spark pool

In the Manage section of the Synapse workspace, I navigated to the Apache Spark pools and started the wizard to create a new Spark pool.

On the first screen ‘Basics’ of the Spark pool provisioning I had to take careful note of the multiple options available, more specifically;

Autoscale

If enabled, then depending on the current usage and load, the number of nodes used will increase/decrease. If disabled, then you can set a pre-determined number of nodes to use

Node size

This will determine the size of each node. For a quick reference, currently there are 3 sizes available, small, medium and large with the rough cost in AUD per hour for each node being $0.99, $1.97 and $3.95 respectively.

Number of Nodes

This determines the number of nodes that will be consumed when the Spark pool is online. As I selected to Enable the Autoscale setting above it, I get to now choose a range of nodes which determines the minimum and maximum number of nodes that can be utilised by the Spark pool.

If I were to Disable the Autoscale setting, I would only get to select the maximum number of nodes the Spark pool can use at a time.

Both options have a minimum of 3 nodes limit.

For the purpose of our tests, I selected the Medium sized node, enabled Auto-scale and left the default 3 to 40 number of nodes.

Continuing on to the ‘Additional Settings’ I left the default settings here.  

The main configuration setting that caught my focus was the Auto-pause option in which you can define how long the Spark pool will stay idle for before it automatically pauses.

Review and create the Spark pool.

Step 3 – Create a new Notebook with Python as the runtime language

Advanced analytics (including exploration and transformation of large datasets and machine learning using Spark) is delivered through a notebooks environment that is very similar to Databricks. Like Databricks you choose your default language, attach your notebooks to compute resources, and run through cells of code or explanatory text.

To get started, I added a new notebook in the Develop section of the Synapse workspace.

At a glance, there are some notable differences in language choices when compared to Databricks

  • R language is not available yet in the notebook as an option (this appears to be on the horizon though)
  • Synapse additionally allows you to write your notebook in C#

Both Synapse and Databricks notebooks allow code running Python, Scala and SQL.

Synapse Spark notebooks also allow us to use different runtime languages within the same notebook, using Magic commands to specify which language to use for a specific cell. An example of this in Step 7.

More information on this can be found in the following Microsoft documentation.https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks#develop-notebooks

Step 4 – Configure my notebook session

Notebooks give us the option to configure the compute for the session as we develop. Compute configuration in Synapse is a dream – you specify the pool from which compute is borrowed, and the amount you want to borrow and for how long. This is a really nice experience.

Step 5 – Add a cell and create a connection to my Data Lake

In the first instance, I added two cells, one for describing the notebook and the second one to create a connection to my Data Lake files with a quick count.

Documentation cells – in Databricks, you must know Markdown Syntax in order to write your documentation cells and format them. In Synapse, there are nice helpers so that for example, you don’t have to remember which brackets are which when writing hyperlinks and so on.

Connectivity – to connect to the Data Lake, I’ve used the Azure Data Lake Storage (ADLS) path to connect directly. This is a really convenient feature as it inherits my Azure Role Based Access Control (RBAC) permissions on the Data Lake for reading and writing, meaning that controlling data access can be done at the data source, without having to worry about mounting blob storage using shared access signatures like in Databricks (although access via a shared access signature can also be put in place.

The ADLS path is in the following format:

abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>

Also adding a quick row count in the parquet files, running the cell initially takes around 2 minutes and 15 seconds as it needs to spin up the spark pool and corresponding nodes (after seeing this happen a few times over the course of a few days, this spin up time varied from durations of 1 minute and sometimes taking up to 4 minutes). A subsequent run only takes just under 3 seconds to count the rows from 19 parquet files.

Step 6 – Run some cells to test queries on my parquet files in the Data Lake

In the previous step, I also added a reader on the parquet files within my Data Lake container. Let’s first display the contents using the code

display(sdf_kWh)

Running the notebook cell gives us a preview of the data directly under the corresponding cell, taking just 3.5 seconds to execute and display the top 1000 records.

From the simple display, I now run a simple aggregate with ordering, which takes 9 seconds to run.

Using another set of parquet files in the same Data Lake, I ran a slightly more complex query, which returns in around 11 seconds.

Step 7 – Run another runtime language in the same notebook

In Synapse, a notebook allows us to run different runtime languages in different cells, using ‘magic commands’ that can be specified at the start of the cell.

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks

I can access the data frames created in a previous cell using Spark Python within the same notebook and query it with a simple select statement in SQL syntax.

Or an aggregation in SQL syntax.

Other Observations

Publishing

Once all my cells were coded and working as intended, I proceeded to publish the notebook. Comparing the publishing paradigm to Databricks, Synapse works by having users publish their changes, giving the opportunity to test, but reverting the changes does not appear to be simple currently. Databricks exposes a full history of changes to notebooks, which is useful for assessing who changes what and reverting accidental changes.

Orchestration

Orchestration is another nice experience Synapse has to offer. After developing a notebook that accomplishes a task such as training a classification model, you can immediately add your notebook to a Data Factory pipeline as a task and choose the circumstances that should trigger your notebook to be run again. This is very simple to do and very easy to understand, especially given the excellent Data Factory pipeline interfaces that Azure.

Flexibility

Databricks gives the user quite good visibility of the resources on which the cluster is running – it allows the user to run shell commands on the Spark driver node and provides some utility methods for controlling Databricks (such as adding input parameters to notebooks). This does not appear to be the experience that Synapse is going for, potentially to some detriment. In one engagement, it proved very useful that Databricks exposed the shell of the machine running code as it enabled us to send commands to a separate piece of specialised optimisation software that we compiled on the machine.

Performance Observation

Although the use cases in this blog are limited and the size of data is quite small, it did give an indication of basic performance with simple commands. I gave up on speed testing since caching and random variation is just too much of an effect. Spark works faster than a single machine on large enough data, that’s all we really need to know.

A summary of the performance we’ve seen so far using 2 executors and 8 cores on medium sized instances.

TaskDuration
Initial spin up of the pool1 to 4 minutes
Row count of ~600k recordsUnder 3 seconds
Display top 1000 rows3.5 seconds
Aggregate over Dates9 seconds

Cost Observation

Pricing of the Spark pool is calculated by up time of the pool, so you only pay for when there is activity (a running notebook/application) on the Spark pool and also inclusive of the idle minutes configured in the Auto-pause functionality, in my case this is 15 minutes.

With a decent amount of use sporadically over two weeks, I observed a cost of nearly $100 AUD. Bear in mind that I did utilise the Medium sized instances with auto-scaling set to a maximum of 40 nodes, which was in hindsight overkill for what it was actually used it for.

Azure Synapse Analytics – the essential SQL on-demand cheat sheet

Our first article http://blog.exposedata.com.au/2020/05/21/azure-synapse-analytics-insights-for-all-and-breaking-down-silos/ introduced Azure Synapse Analytics and some of its core concepts. In this second article, I take the new SQL on-demand feature, currently in Preview, for a test drive.

Disclaimer: as Azure Synapse Analytics is still in Public Preview, some areas may not yet function as it will in a full General Availability stage.

This article contains the Synapse SQL on-demand test drive as well as a cheat sheet that describes how to get up and running step-by-step. I then conclude with some observations, including performance and cost.

But first let’s look at important architecture concepts of the SQL components of Azure Synapse Analytics, the clear benefits of using the new SQL on-demand feature, and who will benefit from it. For those not interested in these background concepts, just skip to the “Steps to get up and running” section later in this article.

Synapse SQL Architecture

Azure Synapse Analytics is a “limitless analytics service that brings together enterprise data warehousing and big data analytics. It gives you the freedom to query data…, using either serverless on-demand compute or provisioned resources—at scale.” https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/

It has two analytics runtimes; Synapse SQL for T-SQL workloads and Synapse Spark for Scala, Python, R and .NET. This article focusses on Synapse SQL, and more specifically the SQL on-demand consumption model.

Synapse SQL leverages Azure Storage, or in this case, Azure Data Lake Gen 2, to store your data. This means that storage and compute charges are incurred separately.

Synapse SQL’s node-based architecture allows applications to connect and issue T-SQL commands to a Control node, which is the single point of entry for Synapse SQL.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/overview-architecture

The Control node of the SQL pool consumption model (also called provisioned) utilises a massive parallel processing (MPP) engine to optimise queries for parallel processing and then passes operations to Compute nodes to do their work in parallel. SQL Pools allows for querying files in your data lake in a read-only manner, but it also allows you to ingest data into SQL itself, and sharding them using a Hash, Round Robin or Replicate pattern.

As SQL Pool is a provisioned service, you pay for the resources provisioned and these can be scaled up or down to meet changes in compute demand, or even paused to save costs during periods of no usage.

The Control node of the SQL on-demand consumption model (also called serverless) on the other hand utilises a distributed query processing (DQP) engine to optimise and orchestrate the distribution of queries by splitting them into smaller queries, executed on Compute nodes. SQL on-demand allows for querying files in your data lake in a read-only manner.

SQL on-demand is, as the name suggests, an on-demand service where you pay per query. You are therefore not required to pick a particular size as is the case with SQL Pool, because the system automatically adjusts. The Azure Pricing calculator, https://azure.microsoft.com/en-us/pricing/calculator/, currently shows the cost to query 1TB of data as being A$8.92. I give my observations re cost and performance later in this article.

Now let’s focus on SQL on-demand more specifically.

Why SQL on-demand

I can think of several reasons why a business would want to consider Synapse SQL on- demand. Some of these might be:

  • It is very useful if you want to discover and explore the data in your data lake which could exist in various formats (Parquet, CSV and JSON), so you can plan how to extract insights from it. This might be the first step towards your logical data warehouse, or towards changes or additions to a previously created logical data warehouse.
  • You can build a logical data warehouse by creating a relational abstraction (almost like a virtual data warehouse) on top of raw or disparate data in your data lake without relocating the data.
  • You can transform your data to satisfy whichever model you want for your logical data warehouse (for example star schemas, slowly changing dimensions, conformed dimensions, etc.) upon query rather than upon load, which was the regime used in legacy data warehouses. This is done by using simple, scalable, and performant T-SQL (for example as views) against your data in your data lake, so it can be consumed by BI and other tools or even loaded into a relational data store in case there is a driver to materialise the data (for example into Synapse SQL Pool, Azure SQL Database, etc.).
  • Cost management, as you pay only for what you use.
  • Performance, the architecture auto-scales so you do not have to worry about infrastructures, managing clusters, etc.

Who will benefit from SQL on-demand?

  • Data Engineers can explore the lake, then transform the data in ad-hoc queries or build a logical data warehouse with reusable queries.
  • Data Scientists can explore the lake to build up context about the contents and structure of the data in the lake and ultimately contribute to the work of the data engineer. Features such as OPENROWSET and automatic schema inference are useful in this scenario.
  • Data Analysts can explore data and Spark external tables created by Data Scientists or Data Engineers using familiar T-SQL language or their favourite tools that support connection to SQL on-demand.
  • BI Professionals can quickly create Power BI reports on top of data in the lake and Spark tables.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview

Is T-SQL used in Synapse SQL the same as normal T-SQL?

Mostly, yes. Synapse SQL on-demand offers a T-SQL querying surface area, which in some areas are more extensive compared to the T-SQL we are already familiar with, mostly to accommodate the need to query semi-structured and unstructured data. On the other hand, some aspects of T-SQL we are already familiar with are not supported due to the design of SQL on-demand.

High-level T-SQL language differences between consumption models of Synapse SQL are described here: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/overview-features.

Let’s now look at getting up and running.

Steps to get up and running

I have already provisioned both a data lake and Azure Synapse Analytics. In this section, I will:

  1. Access my Azure Synapse Analytics workspace.
  2. Then load five raw data parquet files, each containing approx. 1,000 records to my data lake.
  3. Then access the data lake through Synapse and do a simple query over a single file in the data lake.
    1. Part of this sees me set appropriate RBAC roles on the data lake.
  4. Then extend the query to include all relevant files.
  5. Then create the SQL on-demand database and convert the extended query into a reusable view.
  6. Then publish the changes.
  7. Then connect to the SQL on-demand database through Power BI and create a simple report.
  8. Then extend the dataset from 5,000 records to approx. 50,000.
  9. And test performance over a much larger dataset, i.e. 500,000 records, followed by a new section on performance enhancements and side by side comparisons.

Step 1 – Access my Synapse workspace

Access my workspace via the URL https://web.azuresynapse.net/

I am required to specify my Azure Active Directory tenancy, my Azure Subscription, and finally my Azure Synapse Workspace.

Before users can access the data through the Workspace, their access control must first be set appropriately. This is best done through Security Groups, but in this quick test drive, I used named users.

When I created Azure Synapse Analytics, I specified the data lake I want to use, this is shown under Data > Linked > data lake > containers. I can, of course, link other datasets, for example, those in other storage accounts or data lakes here too.

Step 2 – load data to my data lake

I have a data lake container called “rawparquet” where I loaded 5 parquet files containing the same data structure. If I right-click on any of the Parquet files, I can see some useful starter options.

https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-synapse-studio

Step 3 – Initial query test (access the data lake)

I right-clicked and selected “Select TOP 100 rows”, which created the following query:

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://xxxxxxdatalakegen2.dfs.core.windows.net/rawparquet/userdata2.parquet',
        FORMAT='PARQUET'
    ) AS [r];

The first time I ran this query, I got this error:

This was because my Azure Active Directory identity doesn’t have rights to access the file. By default, SQL on-demand is trying to access the file using my Azure Active Directory identity. To resolve this issue, I need to have the proper rights to access the file.

To resolve this, I granted both ‘Storage Blob Data Contributor’ and the ‘Storage Blob Data Reader’ role on the storage account (i.e. the data lake).

https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-synapse-studio

and https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/access-control.

Those steps resolved the error.

Step 4 – Extend the SQL

In my theoretical use case, I have a Data Factory pipeline that loads user data from the source into my data lake in Parquet format. I currently have 5 separate Parquet files in my data lake.

The query mentioned previously obviously targeted a specific file explicitly, i.e. “userdata2.parquet”

In my scenario, my Parquet files are all delta files, and I want to query the full set.

I now simply extend the query by removing the “TOP 100 *” and open the OPENROWSET part of the query to the whole Container, not just the specific file. It now looks like this:

SELECT
     *
FROM
    OPENROWSET(
        BULK 'https://xxxxxxdatalakegen2.dfs.core.windows.net/rawparquet/',
        FORMAT='PARQUET'
    ) AS [r];

Step 5 – Now let’s create a database and views dedicated for my SQL on-demand queries

This database serves as my Logical Data Warehouse built over my Data Lake.

I firstly ensure that SQL on-demand and Master is selected:

CREATE DATABASE SQL_on_demand_demo 

I now create a view that will expose all the data in my dedicated Container, i.e. “rawparquet” as a single SQL dataset for use by (for example) Power BI.

I firstly ensure that SQL on-demand and the new database SQL_on_demand_demo is selected.

I now run the create view script:

CREATE VIEW dbo.vw_UserData as 
SELECT
     *
FROM
    OPENROWSET(
        BULK 'https://xxxxxxdatalakegen2.dfs.core.windows.net/rawparquet/',
        FORMAT='PARQUET'
    ) AS [r];

I now test the view by running a simple select statement:

Select * from dbo.vw_UserData 

Step 6 – Publish changes

Select Publish to move all the changes to the live environment.

If you now refresh your Data pane, you will see the new database and view appear as an on-demand database. Here you will be able to see both Provisioned (SQL Pool) and on-demand databases:

My data volumes at this stage is still very low, only 5,000 records. But we will first hook Power BI on to Synapse, and then throw more data at it to see how it performs.

Step 7 – Query through Power BI

It is possible to build interactive Power BI reports right here in the Synpase workspace, but for now I am going to go old school and create a Direct Query report from the view we created, essentially querying the data in the data lake via the logical data warehouse, SQL_on_demand_demo.

  1. To connect:
    1. Open a new Power BI Desktop file.
    2. Select Get Data
    3. Select Azure SQL Database.
    4. Find the server name
      1. Navigate to your Synapse Workspace
      2. Copy the SQL on-demand endpoint from the overview menu
    5. Paste it into the Server field in the Power BI Get Data dialog box
    6. Leave the database name blank
    7. Remember to select Direct Query if the processing must be handed over to Synapse, and the footprint of Power BI must be kept to a minimum.
    8. Select Microsoft Account as the authentication method and sign in with your organisational account.
    9. Now select the view vw_UserData
    10. Transform, then load, or simply load the data.
  2. Create a simple report, which now runs in Direct Query mode:

Step 8 – Add more files to the data lake and see if it simply flows into the final report

I made arbitrary copies of the original Parquet files in the “rawparquet” container and increased the volume of files from 5 to 55, and as they are copies, they obviously all have the same structure.

I simply refreshed the Power BI Report and the results were instantaneous.

Step 9 – Performance over a much larger dataset

For this, I am going to publish the report to Power BI Service to eliminate any potential issues with connectivity or my local machine.

The published dataset must authenticate using OAuth2.

Once the report is published, I select the ‘Female’ pie slice and the full report renders in approx. 4 seconds. This means the query generated by Power BI is sent to Synapse Analytics, using the SQL on demand mode and its SQL Query to query the multiple Parquet files in the data lake and return the data back to Power BI to render.

I now again arbitrarily increase the number of files from 55 to 500.

Refreshing this new dataset containing 498,901 took 17 seconds.

Selecting the same ‘Female’ pie slice initially rendered the full report in approx. 35 seconds. And then in approx. 1 second after that. The same pattern is observed for the other slices.

I am now going to try and improve this performance.

Performance enhancements and side by side comparison

The performance noted above is okay considering the record volumes and the separation of stored data in my data lake and the compute services; but I want the performance to be substantially better, and I want to compare the performance with a competitor product (* note that the competitor product is not mentioned as the purpose of this article is a test drive of Azure Synapse Analytics SQL on-demand, and not a full scale competitor analysis).

To improve performance I followed two best practice guidelines: (a) I decreased the number of Parquet files the system has to contend with and of course increased the record volumes within each file, and (b) I collocated the data lake and the
Azure Synapse Analytics in the same region.

Tests 1 and 2 shows the impact of performance enhancements, whereas tests 3 and 4 represents my observations when the two competitors, i.e. Synapse in test 3 and the competitor in test 4 are compared side by side.

Test summaries

Test 1 – large number of records and files, not collocated, Azure Synapse, Azure Data Lake, Power BI

Record volumes – 500,000
Number of Parquet Files – 500
Azure Data Lake Gen 2 region – Australia Southeast
Azure Synapse Analytics – Australia East

Results:
Initial refresh – 17 seconds
Refresh on initial visual interaction – 35 seconds
Refresh on subsequent visual interaction – 1 second

Test 2 – large number of records, decreased numbers of files, not collocated, Azure Synapse, Azure Data Lake, Power BI

Record volumes – 500,000
Number of Parquet Files – 20
Azure Data Lake Gen 2 region – Australia Southeast
Azure Synapse Analytics – Australia East

Results:
Initial refresh – 9 seconds
Refresh on initial visual interaction – 4 seconds
Refresh on subsequent visual interaction – less than 1 second

Test 3 – large number of records and files, collocated, Azure Synapse, Azure Data Lake, Power BI

Record volumes – 500,000
Number of Parquet Files – 20
Azure Data Lake Gen 2 region – Australia East
Azure Synapse Analytics – Australia East

Results:
Initial refresh – 3 seconds
Refresh on initial visual interaction – 2.5 seconds
Refresh on subsequent visual interaction – less than 1 second

Test 4 – large number of records and files, collocated, Competitor product, Azure Data Lake, Power BI

Record volumes – 500,000
Number of Parquet Files – 20
Azure Data Lake Gen 2 region – Australia East
Azure Synapse Analytics – Australia East

Results:
Initial refresh – 4 seconds
Refresh on initial visual interaction – 3 seconds
Refresh on subsequent visual interaction – less than 1 second

Performance Conclusion

The results in the table above show that Azure Synapse performed best in a side by side competitor analysis – see tests 3 and 4.

We describe this as a side-by-side test as both Synapse and the compared competitor analytic services are located in the same Azure region as the data lake, and the same parquet files are used for both.

Cost observation

With the SQL on-demand consumption model, you pay only for the queries you use, and Microsoft describes the service as auto scaling to meet your requirements. Running numerous queries in the steps described and across the course of three days seemed to have incurred only very nominal query charges when analysing cost analysis on the particular resource group hosting both the data lake and Azure Synapse Analytics.

I did initially observe higher than expected storage costs, but this, it turns out related to a provisioned SQL Pool, which had no relation to this SQL on-demand use case. Once that unrelated data was deleted, we were left only with the very nominal storage charge across the large record volumes in the Parquet files in the data lake.

All in all a very cost effective solution!

Conclusion

  • Getting up and running with Synapse SQL on-demand once data is loaded to the data lake was a very simple task.
  • I ran a number of queries over a large dataset over the course of five days. The observed cost was negligible compared to what would be expected with a provisioned consumption model provided by SQL Pools.
  • The ability to use T-SQL to query data lake files, and the ability to create a logical data warehouse provides for a very compelling operating model.
  • Access via Power BI was simple.
  • Performance was really good after performance adjustments as described in the “Performance enhancements and side by side comparison” section.
  • A logical data warehouse holds huge advantages compared to materialised data as it opens up the concept to reporting over data streams, real time data from LOB systems, increased design responsiveness, and many others.

Exposé will continue to test drive other aspects of Azure Synapse Analytics such as the Spark Pool runtime for Data Scientists and future integration with the Data Catalog replacement.

Azure Synapse Analytics – Insights for all and breaking down silos

(And a party down the lakehouse)

Cloud databases are a way for enterprises to avoid large capital expenditures, they can be provisioned quickly, and they can provide performance at scale. But data workloads continue to change, fast, which means conventional databases alone (including those running in data warehouse configurations) can no longer cope with this fast-changing demand.

Exposé have over the past few years written and spoke extensively about why conventional data warehousing is no longer fit for purpose (http://blog.exposedata.com.au/2017/02/09/is-the-data-warehouse-dead/ and http://blog.exposedata.com.au/2018/11/07/databricks-cheat-sheet-1-concepts-business-benefits-gettings-started/ as examples). The future data warehouse must at least:

  • Be able to cope with data ranging from relational through to unstructured.
  • Be able to host data ingested in a latent manner (e.g. daily) as well as real-time streams, and everything in between.
  • Be able to host data in its raw form, at scale, and at low cost populated by extract and load (EL) or data streams.
  • Provide the mechanisms to curate, validate and transform the data (I.e. the “T” of ELT).
  • Be able to scale up to meet increasing demand, and back down during times of low demand.
  • Be able to integrate seamlessly into modern workloads that rely on the DW; these include AI, visualisations, governance and data sharing.

What is Azure Synapse Analytics?

Say hello to Azure Synapse Analytics now in public preview – https://aka.ms/Synapse_Insights4All

Microsoft describes Azure Synapse Analytics as a “limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data…, using either serverless on-demand compute or provisioned resources—at scale.” https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/

It is the new version of the Azure SQL DW and gives Microsoft a stronger competitor platform against AWS Redshift, Google Big Query and Snowflake. For background, a comparison between the three competitors can be found at https://gigaom.com/report/data-warehouse-cloud-benchmark/ – note that it was done for Synapse’s predecessor Azure SQL DW.

No, what is it really?

Microsoft’s description of “limitless analytics service that brings together enterprise data warehousing and Big Data analytics” can be translated as two siblings that historically hosted two different types of data; i.e. highly relational (the ‘enterprise data warehousing’ or SQL workloads) and all other including semi-structured and unstructured data (i.e. the ‘big data’ workloads in data lakes), unified in a workspace that allows the user to query and use both SQL/ relational and big data with languages that they are comfortable with (SQL, Python, .NET, Java, Scala and R). It breaks down the barriers between the DW and the data lake. Now that is HUGE (don’t act like you’re not impressed). Imagine that…a “lakehouse”.

This is shown conceptually in the image below.

Okay, so it’s SQL and Spark, wrapped into a clever unified limitless compute workspace? No, it’s a bit more

Firstly, it includes Data Integration, so it not only unifies the differing data types (i.e. relational and big data), but it also includes the means to ingest and orchestrate this data using Azure Data Factory, which has become so pervasive in the market, natively inside Synapse (called Data Integration). Does this mean batch data loading only? No, you can of course load your realtime data streams to your data lake using some kind of IOT Hub/ Event Hub/ Stream Analytics configuration or achieve low latent data feeds into your data lake using Logic Apps or Power Automate.

Secondly, it not only integrates with Power BI, it actually includes Power BI as part of Synapse. In fact, interactive Power BI reports and semantic models can be developed within the Azure Synapse Studio. Imagine the ability to quickly ingest both structured and unstructured data into your data lake, either move the data into SQL (the data warehouse) or leave it in raw form in the lake. Then you have the ability to explore the data using a serverless SQL environment whether the data resides in the data lake or in the data warehouse, and potentially do this all in a Direct Query mode. This not only reduces the Power BI model footprint and hands the grunt over to Synapse, but also allows for much more real time reporting over your data. https://azure.microsoft.com/en-au/resources/power-bi-professionals-guide-to-azure-synapse-analytics/

Thirdly, it integrates seamlessly with Azure Machine Learning for those who need to use data from the unified platform for predictive analytics or deep learning and share results back into the platform for wider reuse. Including using Azure Data Share for a seamless and secure data sharing environment with other users of Azure.

Ah okay, so…

It unifies the DW and the data lake (real time, latent, and data of any type) and it also brings Data Integration and Data Visualisation into that unified platform. It then seamlessly integrates with Machine Learning and Data Share. So, its SQL, Spark, ADF and Power BI all at the same party, or ahem…lakehouse 😊 where you can ingest, explore, prepare, train, manage, and visualise data through a single pane of glass. Yes, we are bursting with excitement too!

Let’s get technical

Let’s look at some of the technical aspects of Synapse:

  • Users can query data using either serverless on-demand compute or provisioned resources.
    • Serverless on-demand compute (technically this is called SQL on demand) allows you to pay per query and use T-SQL to query data from your data lake in Azure rather than provision resources ahead of time. The cost for this is noted as approximately $8.90 per TB processed – https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/. This feature is still in Preview.
    • Provisioned resources, in line with the incumbent SQL DW data warehouse unit (DWU) regime that allows the user to provision resources based on workload estimates, but able to scale up or down, or pause within minutes (technically this is called SQL Pools) – see https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/. This feature is still in General Availability.
  • Azure Synapse Studio supports a user’s ability to ingest, explore, analyse and visualise data using a single sleek user interface, which is sort of a mix between the Azure Data Factory and Databricks UI’s.
  • Users can explore the data using a refreshed version of Azure Data Explorer.
  • Users can transform data using both T-SQL (data engineers) and Spark Notebooks (data scientists).
  • On the security front, there is threat detection, transparent data encryption, always-on encryption, fine-grained access control via column-level and native row-level security, as well as dynamic data masking to automatically protect sensitive data in real-time.

Please see this important fact sheet and a list of capabilities in General Availability vs those in Preview – https://azure.microsoft.com/en-us/services/synapse-analytics/#overview

Also, please see our essential cheat sheet for the Synapse SQL on-demand test drive where we put that exciting new service through it’s paces and help you get up an running, quickly – http://blog.exposedata.com.au/2020/06/01/azure-synapse-analytics-the-essential-sql-on-demand-cheat-sheet/

What are the business benefits?

They are numerous, but in our humble opinion, and it must be noted that we do have extensive experience in data warehouses and modern data platforms, these are:

  • The lakehouse that unifies Spark and SQL engines, PLUS the ability to query them through a single pane of glass is something the industry have been asking for, for a long time as it breaks down data silos. As a result, is also breaks down skill silos, as those familiar with SQL can continue using SQL and those that prefer Python, Scala, Spark SQL, or .Net can do so as well…all from the same analytics service.
  • The new serverless on-demand compute model allows users to use T-SQL to execute serverless queries over their data lake and pay for what they use. This coupled with the Provisioned Resources model enables customers with multiple ways to analyse data so they can choose the most cost-effective option for each use case.
  • Security including column-level security, native row-level security, dynamic data masking, data discovery and classification is all included at no additional cost to customers.

How can Exposé help?

We are Australia’s premium data analytics company and have, since our inception, made sure we fully understand changes in the data analytics market so that we can continue to tailor our best of breed architectures and solutions to our customers’ benefit. Below is some of the highlights of our journey:

  • We were the first consultancy to coin the phrase “friends don’t let friends build old school and expensive data warehouses” – we were passionate about finding solutions that were truly modern and delivered the best ROI.
  • We were one of the first consultancies in Australia to understand the value that Databricks could play in modern data workloads, championed it, and facilitated one of the most high-profile solutions which have won our client multiple awards.
  • We were selected as the runner up for the Power BI Global Partner of the year 2019 due to our big data smart analytics solution we created for our customer that embraced many leading-edge Azure services, culminating in a Power BI analytical and monitoring solution.
  • We went on to create a modular and industry agnostic Digital Twin product, built on Azure big data services, and bringing together Power BI & gaming engines in an immersive user experience that seamlessly ties into existing customer Azure investments.

It is this passion, our focus on R&D and you the customer, which makes us a good partner in your Synapse journey.

Digital Twin – Topical use cases

Being left behind in an increasingly digital world is a scary thought.  Gartner predicts that this year alone, organisations with digital trustworthiness will have a substantial advantage (of up to 20% increased revenue in online channels) compared to their competitors.  In our previous article, Digital Twins – Why all the fuss, we introduced the concept of digital twins and their benefits. This article describes five potential use cases of digital twin technologies, focused on providing real value to organisations and consumers.  Digital integrations, specifically with digital twin solutions, are predicted to be one of this decade’s biggest disruptors – let’s explore some practical use cases.

Improve utility service delivery by understanding, predicting and executing usage strategies in real time with a digital twin

Electricity is the most fundamental of utilities.  Even though it’s such a new concept in the context of human history (I know – this sounds strange – but the earliest evidence of human’s harnessing electricity in the early common era, only about 5% of our existence as a species)!  Despite its relative brevity, we have a fundamental reliance on power!  Utility management is very important, particularly in South Australia, where we have a reliance on other states for our energy.  So, how can service providers achieve better outcomes using a digital twin?  Using our new-found definition, let’s explore this in the context of a digital model of a space – an electrical substation.

What type of data is available?

Without providing contextual value, all a digital twin is, is a digital model, so we need to understand the type of data available to our use case to understand potential value.  A substation contains sensors which measure electrical throughput.  Subject Matter Experts (SME’s) also understand the thresholds associated with throughput, pertaining to specific assets within the substation.  All this information, along with historical supply and usage information, should be modelled and organised in a manner to provide contextual insight.  Further to this, AI predictive models, trained on available historical data, can be integrated to provide recommended actions, warnings and decision support.

What’s the benefit?

Imagine you’re a service technician, responsible for maintaining service delivery to a region within South Australia.  Using your industry-customised digital twin such as the exposé Digital Twin, you can explore the substation digitally, in a fraction of the time, compared to actual exploration.  (Imagine not needing to travel to the site when remote working)!  Your digital twin ensures you’re provided with context sensitive information immediately and in real-time. 

As you are traversing the digital space, you notice that the electrical throughput on one of the transformers is slightly elevated (via visual and auditory cues).  Upon clicking on the resource, a map of available resistors denotes that a resistor has malfunctioned and is offline; the digital twin recommends that a service ticket is generated to resolve the issue.  You accept the recommendation, and the ticket is created.  Later that afternoon, the resistor is replaced, and no service loss is experienced. To extend this, composite digital twins allow you to understand and visualise the entire, or local network, rather than only providing contextual information for one distribution centre or substation.

Pathfinding in 2020: understand University library movements with digital twin technologies

Have you ever found yourself wandering the aisles of a library, supermarket, or department store aimlessly looking in “the obvious places” for something, only to find it twenty minutes later, in a location you never thought to look?  If you answered no, you’re either a liar, or a genius!  Improving customer experience is core to any business, so it is no surprise that organisations and educational facilities are exploring digital twin technologies to increase customer engagement.  Traditionally, it is next to impossible to understand customer movement patterns with a high degree of certainty or fidelity, but with advances in artificial intelligence, underpinned with digital twin visualisation techniques, the paradigm is shifting, and shifting quickly.

Gamifying User Experience

How a customer moves from Point A to Point B is important.  Let’s say, a customer, Janice, is researching the history of computer science, and needs to navigate to the reference section of the library from the front desk.  The time spent looking for the section she needs is directly proportional to her user experience. As such, you want to make her trip as easy and prompt as possible.  The better her experience, the better your score.  Realistically, Janice’s goal is arbitrary, and does not represent the goal of all customers, but each consumer indeed has a goal. 

A digital twin representation of your Library can help you visualise consumer movements in real time but should not stop there.  Fundamentally, the library should be designed in a way that it improves overall experience, and to achieve this, the digital twin provides context sensitive information on paths traversed by all consumers, as well as time spent in each section.

Setting the Score

By allowing Subject Matter Experts (SMEs) to define rules based on how customers are supposed to interact with areas, or assets within the Library, a digital twin can measure the effectiveness of specific areas and assets.  Resource footprints should then be rearranged, to improve user experience, based on the actionable insights recognised from the digital twin.

Bridging the Age Gap – Digital twin, an Aged Care Story

Retirement and Aged Care facilities are information rich ecosystems.  Physical and digital record keeping does not stop at pen, paper, or database; no doubt, these facilities do indeed have Big Data.  Internet of Things (IoT) devices measure and record many things: Temperature maintenance (both medical refrigeration and facility temperature), site and resident security (ins and outs) and medical administration, amongst many others.  The success of the facility depends heavily on proper record management and stringent policy adherence.

Human-proofing care

Humans are in no way perfect, and we make mistakes.  That is what makes us human, we accept it, and in some ways, we celebrate it; however, our mistakes can be problematic.  In a care setting, errors can literally be fatal.  Taking a digital twin approach to Aged Care, ensures that context sensitive information is delivered at the right time, to the right people.  Proper information delivery could mean the difference between an incident and one mitigated. 

Imagine, as an orderly completes their rounds, they navigate the facility on their tablet as they walk the halls.  The digital twin of the facility displays information pertaining to refrigerator and room temperatures, alleviating the requirement for manual investigation.  Interventions are also possible from within the application, improving service delivery and ensuring a safe environment for all, in a fraction of the regular time.

Front-foot tactics for Local Government Infrastructure with digital twin

Local Governments are the conduit of communities.  They provide services and infrastructure which improve our lives and societies as Australians.  In order to continue to develop in an efficient and cost-effective manner, digital twin implementations can provide tremendous value in assisting councils to understand their infrastructure, and the citizens they serve.  But why should councils use digital twins over traditional means?

Why digital twins?

Local Government budgets are not only tightly constrained, they also have a wide scope.  The ability to generate actionable insights on infrastructure and citizens is invaluable.  Imagine having the ability to visualise, in real time, the communities use of new infrastructure, or a map view identifying houses whose rate payments are due, or being able to explore arterial and local rounds, with context sensitive information around maintenance schedules?  This sounds great, but the real benefit comes from being able to do all of this, and more, in a matter of minutes.

One Stop Shop – Actionable Insights via the exposé Digital Twin

Councils, utilities, care providers, retailers, manufacturers, property management companies, construction companies, engineering firms, universities, etc. are adopting and embracing digital twins right now!  Click here to find out how organisations are exploring digital twin technology to manage infrastructure, digitally, and how Exposé can assist your business in bridging the digital gap.

Digital Twins – Why all the fuss?

What do Twins, NASA, Pokémon Go, the Internet of Things, Big Data and business value have in common?

As our landscape changes to embrace digital integration, the physical and virtual worlds are moving closer together than ever before.

As flagged by Gartner, digital twins are pegged to be a key strategic technology trend for 2020 , and well into the following years too.  Since their inception in 2017, digital twins have enormous potential to create significant opportunity and also cause major disruption.

So, what is a digital twin?

Simply put, a digital twin is a virtual representation, a “twin”, of any physical world object, space, asset, model, or system on which the operations of that physical twin is projected.

It is immensely useful to anyone who needs to understand their physical world by performing analysis, gaining insights and performing simulation and modelling on a platform that acts as a replica twin of the physical twin.

It is so much more than a redundant copy of the physical twin—to use an analogy, in 2016, the world was introduced to Pokémon Go, an Augmented Reality video gaming experience; far different to anything most had previously witnessed. What Pokémon Go meant to gaming, is pretty much what digital twin means for methods of analysis, insights and modelling. In both Pokémon Go and digital twin, immersive augmented and virtual reality literally blends the physical world with the digital world, with the latter helping us gain a new understanding of the former in a way never possible before. In the case of digital twins, this immersive experience allows the user to be immersed as if in the physical space to conduct required analysis, test hypothesis, monitor, correct, etc. – all remotely.

Variants of digital twins are:

  • Composite digital twin – where data from multiple digital twins are aggregated for a composite view across a number of physical world entities such as a power plant or a city; and
  • Predictive digital twin, where machine learning puts our insights and understanding of the physical world on steroids!

Think:

There are endless use cases for digital twins across most industries, including local-, state- and federal government, utilities, universities, retail, manufacturing, defence, healthcare, age care, construction, and so on. Our subsequent article Digital twin – Topical use cases will delve much further into some topical use cases, but here are a few:

  • A domain scientist who needs to understand the acoustics in the pipes of a water network so that pipe bursts can be predicted. This will provide huge cost savings and avoid reputational damage;
  • An urban planner who needs to maximise the amount of residential, business and recreational space with consideration of both pedestrians and vehicle access, movement and connectivity. This will help understand the integration of land use and transport needs; something all cities battle with;
  • Rostering analysts, for example, a residential care organisation that needs to understand the location of field staff and their tasks and skills in order to do more effective rostering and save time. Digital twin can assist in this way to save time and costs and ensure the right skills are at the right place, at the right time;
  • The environmental analyst who needs to monitor and decrease the organisation’s carbon footprint by monitoring CO2 emissions and power generated. This will help achieve carbon offset and ultimately reduce CO2 emissions;
  • An asset planner needs to analyse the performance of an asset, specially through the lens of past history such as servicing, faults, outputs, etc. as well as predicted performance, and a real time monitoring of said asset. This will help optimise all aspects of the asset and ultimately extend its life, reduce life-cycle costs and ensure availability. Assets in this sense are by no means just machinery in a manufacturing plant but ranges from a pool pump in a leisure centre, through to an advanced diagnostic machine in a hospital, or the crane on a construction site;
  • Head of security at a large stadium needs to understand crowd volume and sudden negative sentiment changes where larger groups congregate. This will help proactively deal with crowd security issues immediately, before they get out of hand by moving security personnel around where they are mostly required;
  • An engineer conducting building information modelling (BIM) needs to simulate construction, logistics, and fabrication sequences with the supply chain, and ensure the design takes people flow and emergency evacuations into account. This will help achieve an optimal and safe building will emerge from construction;
  • A university librarian needs to understand the movement of student through the large university library. This will help achieve a better use and mix of space.

Who benefits?

We all do. As is shown in the examples above, those benefiting from the superior insights gained from digital twin are not only those persons who own, manage and operate the physical twins, but us, the consumer (i.e, more targeted aged care), the citizen (a better and cleaner city) and the patient (for example more accurate diagnostics).

Why is it disruptive (‘Houston, we have a problem’)?

The concept of a twin created to understand another is certainly not new. NASA, in the 60s, used twinning ideas to create physically duplicated systems here on earth to match the systems in space, which allowed engineers on the ground to model and test possible solutions, simulating the conditions in space.

When Apollo 13’s lunar module ran into serious problems, such as rising carbon-dioxide that approached life-threatening levels, the engineers on the ground used the duplicates here on the ground to model and test theories and simulations so that they could instruct the astronauts, and eventually get the ill-fated crew of Apollo 13 back to earth alive.

The value in NASA’s replicas, or the many replicas since then, including motor vehicle design wind tunnels, mini wave and tidal pools, etc. and undeniable. But of course, the NASA, and subsequent replicas where physical, not digital.

With the advance of computing capacity and the Internet of Things (IOT), digital twins are now gaining traction across so many industries. The physical mirrors can now be replaced with digital ones and the pervasiveness and reduced cost of IOT means we can monitor what is happening with the physical twin in real time. Throw in machine learning, and all manner of additional insights and modelling are possible.  

So IOT and artificial intelligence (AI) is the miracle mirror, right? Not really. AI augments human capabilities, but it does not replace them. Like  Henk van Houten, Executive Vice President, Chief Technology Officer, Royal Philips, states, “…it was human ingenuity that helped to bring the crew of Apollo 13 home – not technology alone” (https://www.philips.com/a-w/about/news/archive/blogs/innovation-matters/20180830-the-rise-of-the-digital-twin-how-healthcare-can-benefit.html). This means that a digital twin is not meant to be an unsupervised fully intelligent expert system, but rather a platform where a human can analyse and model in order to gain the insight and understanding required. Even when predictive models through machine learning are included, domain subject matter experts must still form part of the analysis process due to their understanding of the physical twin.

In conclusion

Digital twins, as described here, enable users to analyse the physical world, with context sensitive information, without having to traverse the particular physical space (twin). The benefits of this are:

  • A location can be explored at a fraction of the time, compared to actual exploration;
  • Context sensitive information is available immediately, and in real time;
  • And the users react immediately to their experience.

In our subsequent article, Digital Twins – Topical use cases, we will delve much further into some topical use cases and we show why organisations should really consider how digital twins could benefit them.

Our unique product, the exposé Digital Twin is a quick to market, cost effective version of this disrupting technology and provides a truly 360 degree view of your physical world though our highly interactive visual experience, revolutionising the way you interact with your world.

The STEM gender gap and how we can lead the way for the next generation

Authored by Emma Girvan and Sandra Raznjevic

In our continued partnership with St. Peter’s Collegiate Girls’ School, we were honoured to be invited to attend the inaugural Women in STEM Breakfast on Thursday 23 May, which was hosted by Year 10-12 students.

This fantastic event included over 30 female industry professionals who were invited as mentors.  They were only too happy to share their own personal journeys and experiences as successful women working in STEM related fields. We were all lucky enough to hear from 3 invited guest speakers;  Sarah Brown (State Director, Code Like A Girl), Dr Kristin Alford (Director, MOD) and Dr Bronwyn Hajek (Lecturer/Researcher, University of South Australia) which was incredibly inspiring for all of us to hear how they’ve been able to navigate their respective careers.

We heard accounts of how women are still so dramatically underrepresented across all STEM studies and careers and as mentors for these young, impressionable and highly motivated students, we all felt a sense of responsibility to share and encourage them on their paths to success.

Whilst enjoying some of the culinary delights prepared by some of the Food Technology students (which they were being assessed on!), the girls had the opportunity to practice their networking skills and gain as many insights as they could from the industry mentors around the table. It was fascinating to hear the diversity of each of the girls’ passions and interests and what their career aspirations were. Amongst the girls, we met a budding  Commonwealth Games Australian Archer (maybe Olympian one day if they change the rules), a software developer, a forensic scientist, a designer, and organ transplant surgeon and some professions which…to be honest we’d never heard of!  For us, it was a great opportunity to dispel some of the misconceptions of what a career in IT looks like and how some of their interests and skills could be applied in the least likely of ways.

Studies also indicate that role models can be used to both attract and retain women in STEM. Using women as role models has been found to be more effective in retaining women in STEM[2]. With research indicating fewer than one in five students enrolled in degrees in engineering, physics, mathematical sciences or information and communications technology (ICT) in Australia are women[1], and at a time when technology continues to transform the way we live, work and learn, the need to close the STEM gender gap is more critical than ever.

Women are lost at every stage of the professional ladder in STEM fields, due to a range of factors including stereotypes, discrimination, and workplace culture and structure[1], some of which manifest from early school years. [

Studies also indicate that role models can be used to both attract and retain women in STEM. Using women as role models has been found to be more effective in retaining women in STEM[2].

At Exposé, we can proudly boast that 40% of our staff are women; many of us with young daughters, including our own General Manager. We are very passionate about continuing our partnerships with the colleges and universities in Adelaide particularly, working with young women to provide mentorship and guidance to help steer them on a path which traditionally has seen girls gradually drop off the radar.  So much so, that we will be working with St. Peter’s Collegiate Girls’ School again this year to provide our special data analytics project which encourages young women to think beyond the “nerdy” coding and help desk stigma associated with the IT industry.

After listening to the panel of speakers and chatting to a number of young girls who were keen to pick our brains on ‘a day in the life of a girl working in STEM’, the message at the end of the morning was clear – find something you love doing, then find a way to do it every day, and if you’re lucky enough you might even get paid well for doing it.

For all of us, invited guests included, it was a good reminder to keep giving new things a go, even if it puts you outside of your comfort zone, because you never know where it could lead!  Moreover, it dawned on us, that for anyone who is raising young women, we also need to be mindful that we are also (potentially) raising future Mum’s. So in the back of our minds, there will most likely be a period of our daughter’s life, where she will need to take some time out to create another human being and also potentially manage and nurture this human into adulthood.  We need to support this in our industry to not only entice young females to step over into the (not so) dark side, but to show them that they are supported.  This is extremely relevant to Exposé currently as our General Manager raises her new child, whilst running the business.

[1] International Labour Organization. ABC of women workers’ right and gender equality. (International Labour Organization, Geneva, 2007)

[2] Drury, B. J., Siy, J. O. & Cheryan, S. When do female role models benefit women? The importance of differentiating recruitment from retention in STEM. Psychological Inquiry 22, 265 – 269, doi:10.1080/1047840X.2011.620935 (2011)

Exposé – the 2019 Microsoft Worldwide Partner of the Year, runner up in the category of Power BI

We are delighted to announce that Exposé has been named as the runner up for the 2019 Microsoft Global Partner of the year award in the category Power BI. We’re proud to add this global award to our two previous Microsoft Australian Partner Awards in 2016 and 2017. This achievement is no small feat given how young we are and is truly a testament to our talented and committed team; both in Adelaide and Melbourne.

The SA Water solution we submitted for this global award, followed our tried and tested best practice approach, ensuring a thorough understanding and delivery of business outcomes first, with the technology simply being the enabler. We constantly bend and push the envelope on technology, in this instance Power BI, to deliver the required outcomes, rather than make outcomes bend to technology.

“The modularity and scalability provided by the Power BI and the larger Azure platform allowed us to tailor something pretty unique to our customer and their challenging requirements. It allowed us to create a truly scalable, responsive and extendable IOT based analytical ecosystem that can be scaled out to thousands of devices, leveraging complex alarm rules controlled by users, visual remote monitoring and responsive actions, visual analysis, and now deep learning over the data.” Etienne Oosthuysen, National Manager, Technology and Solutions

“I am incredibly proud of our team for delivering a solution for SA Water which has not only over delivered on our customer’s requirements, but has now been recognised globally as a best of breed solution. Thank you to SA Water for trusting us with your data and allowing us to develop a forward thinking solution.” Kelly Drewett, General Manager

See the nominated SA Water solution case study here.

See a short video of the solution here.

Say that Again? Power BI Commentary extends to Reports

Power BI recently announced the extension of  its commentary capability to Power BI reports. Yes, you can now add comments to both report pages or specific visuals to improve your data discussions!

These conversations are automatically bookmarked, so the report context is retained exactly as the comment was written, complete with the original filters. Reporting by exception is embraced with those mentioned by @mentions receiving a push notification to their mobile device to alert them.

Whilst commentary is nothing new in BI tools – Power BI is a bit late to the game – its here now and we’ve subsequently put it through its paces to see how it stacks up!

Backstory

The following exposé samples show the analysis for a retail organisation. The data, which updates hourly, is sourced from 3 different on-premise systems and modelled into a user-friendly sales model with a specific focus on Products, Customers and Suppliers & Export. The Head of Sales noticed an unusual spike in sales (in $ terms) back in April and created a comment for his sales managers to see. His sales manager picked up the comment and conducted the visual analysis, finding the reason for the spike. By retaining the conversation, anyone with access to the sales analysis can visually play back what was said and see the context of the discussion visually.

This saves staff time –  they don’t need to rediscover the reason for what may well be a very common question.

In the sections below, we step though these events, culminating in our conclusions on this new functionality in Power BI.

Let’s have a look

The first set of images shows the 4 relevant visuals the Head of Sales would have initially looked at, either on his laptop or on his mobile phone. They analyse sales through the lenses of Product, Customer Country, Export (Supplier Country) and Sales (over time) respectively.

The Head of Sales picks up the unusual spike in April in the 4th visual, Product Sales. And he posts his first comment.

This comment is then picked up by one of the Sales Managers, who conducts some interactive analysis and subsequently responds to the Head of Sales. The Head of Sales is notified, clicks on the comment to see the full visual context – see how selecting the comment plays back the visual as it would have looked appeared when the comment was made, and spotlights the specific played back visual clearly showing the 4 products.

The Head of Sales now has a further comment, asking for clarification as to where these 4 specific products are sold.

This specific Sales Manager (note I simply use one of our guest accounts to represent him) is notified of the comment and does further interactive analysis, and responds.

The Head of Sales is notified of the new comment and clicks on the new comment to see the full visual context – selecting the comment again plays back the visual to what it would have looked like when the comment was made, and spotlights the specific played back visual clearly showing the 2 countries.

This now gives the Head of Sales enough context to understand what lead to the spike. He/ his delegate now jump into Power BI and create a new visual from the user friendly sales model that will continue to track and trend these 4 specific ‘focus’ products within the Germany and US ‘high volume’ markets. This shows them that they are becoming popular and that they should invest in some additional marketing around those 4 products.

How this works

Using commentary requires no update or reinstall. Simply navigate to your report in Power BI Service and create comments. This can be done on the visuals themselves after analysis has been done to retain the context.

On on the report page in totality.

In my sales example here, I used a combination of the report page and specific contextual visual commentary in my discussion. The comments page will show all relevant comments and selecting any one of them will play the report and the context back to the time of the report.

Conclusion

The new commentary capabilities are still object based, and not intimately linked to the data as it was, for example in Business Objects – where commentary is made and written back to the solution based on the actual intersection of data—for example, a Sales Value of Product X for 1st of January 2019, in Vancouver in Canada, by Mary Jackson. The difference, however, could be quite subtle as Power BI could allow for the comment on a visual that shows the Sales Value has been filtered to Product X for 1st of January 2019, in Vancouver in Canada, by Mary Jackson.

One of the main downsides of this object based approach is that the commentary data itself remains inaccessible if you, for example, wanted to use it as raw contextual time based data itself. Disclaimer: I say this data is inaccessible, as I am unaware of where it would be stored or accessed. Happy to be advised of the contrary

The ability to play the report and visuals back to what it looked like when the comment was made is, however, a very nice feature—the reader can as it were, “step back in time” and see what happened when the comment was made. This seems to be the case even as more data is appended to the model (in this case) on an hourly basis.

There is no workflow attached to the commentary, which is quite common in financial reporting where commentary and narrative undergo review and approval.

This feature is not available to public facing reports using the “Embed to Web” functionality. But if you’re interested in looking at the sample reports I used for this user story, they can be viewed and interacted with here.

Databricks: distilling Information from Data

In the first of this series of articles on Databricks, we looked at how Databricks works and the general benefits it brings to organisations ready to do more with their data assets. In this post we build upon this theme in the advanced analytics space. We will also walk through  an interesting (biometric data generated by an Apple Watch) example of how you might use Databricks to distill useful information from complex data.

But first, let’s consider four common reasons why the value of data is not fully realised in an organisation:

Separation between data analysis and data context.

Those who have deep data analytic skills – data engineers, statisticians, data scientists – are often in their own specialised area with a business. This area is separated from those who own and understand data assets. Such a separation is reasonable: most BAU data collection streams don’t have a constant demand for advanced analytical work, and often advanced analytical projects require data sourced from a variety of business functions. Unfortunately, success requires strong engagement between those that deeply understand the data and those that deeply understand the analysis. This sort of strong engagement is difficult to moderate in practice.

We’ve seen cases where because of the short-term criticality of BAU work or underappreciation of R&D work, business data owners are unable to appropriately contribute to a project, leaving advanced analytics team members to make do. We’ve seen cases where all issues requiring data owner clarification are expected to be resolved at the start, and continued issues are taken as a sign that the project is failing. We’ve seen cases where business data knowledge resides solely in the minds of a few experts.

Data analysis requires data context. It’s often said “garbage in, garbage out”, but it’s just as true to say “meaningless data in, meaningless insights out”. Databricks improves this picture by encouraging collaboration between data knowledge holders and data analysts, through its shared notebook-style platform.

Difficulty translating analytical work to production workloads.

Investigation and implementation are two different worlds. Investigation requires flexibility, testing different approaches, and putting “what” before “how”. Implementation requires standards, stability, security and integration into systems that have a wider purpose.

A good example of this difficulty is the (still somewhat) ongoing conflict between the use of Python 2 and Python 3. Python 3 has now almost entirely subsumed Python 2 in functionality, speed, consistency and support. However, due to legacy code, platforms and standards within organisations, there are still inducements to use Python 2, even if a problem is better addressed with Python 3. This same gap can also be found in individual Python modules and R packages. A similar gap can be found in organisational support for Power BI Desktop versions. A more profound gap can be seen if entirely different technologies are used by different areas.

This could either lead to substantial overhead for IT infrastructure sections or substantial barriers to adoption of valuable data science projects. PaaS providers offer to maintain the data analysis platform for organisations, enabling emerging algorithms and data analysis techniques to be utilised without additional infrastructure considerations. Additionally, Databricks supports Python, R, SQL and Scala, which cover the major non-proprietary data analysis languages.

Long advanced analysis iterations.

The two previous issues contribute to a third issue: good advanced analyses take time to yield useful results for the business. By the time a problem is scoped, understood, investigated, confronted and solved the problem may have changed shape or been patched by business rules and process changes enough that the full solution implementation is no longer worth it. Improving the communication between data knowledge holders and data analysts and shortening the distance between investigation and implementation mean that the time between problem and solution is shortened.

What this mean for your organisation is that the business will begin to see more benefits of data science. As confidence and acceptance grow so does the potential impact of data science. After all, more ambitious projects require more support from the business.

Data science accepted as a black box.

Data science is difficult, uncertain and broad. This has three implications. Firstly, a certain amount of unsatisfying results must be expected and accepted. Secondly, there is no single defensible pathway for addressing any given problem. Thirdly, no one person or group can understand every possible pathway for generating solutions. Unfortunately, these implications mean that data science practitioners can be in a precarious position justifying their work. Many decision makers can only judge data science by its immediate results, regardless of the unseen value of the work performed. Unseen value may be recognition of data quality issues or appreciation of better opportunities for data value generation.

We don’t believe in this black box view of data science. Data science can be complicated, but its principles and the justifications within a project should be understood by more than just nominal data scientists. This understanding gap is a problem for an organisation’s maturity in the data science space.

Over recent years wide in-roads have been made into this problem with the rise in usage of notebook-style reports. These reports contain blocks of explanatory text, executable code, code results and mathematical formulas. This mix of functions allows data scientists to better expose the narrative behind their investigation of data. Notable examples of this style are Jupyter Notebooks, R Markdown, or Databricks.

Databricks enables collaboration, platform standardisation and process documentation within an advanced analytics project. Ultimately this means a decreased time between problem identification and solution implementation.

Databricks Example: Biometric Data

For demonstrating Databricks, we have an interesting, real data source: the biometrics collected by our watches and smartphones. You probably also have access to this kind of data; we encourage you to test it out for yourself. For Apple products it can be extracted as an XML file and mounted to the Databricks file system. Not sure how to do this? See our previous article.

Specifically, the data we have is from our national manager for technology, Etienne’s watch and smartphone. Our aim is to extract useful insights from this data. The process we will follow (discussed in the subsequent sections are):

  1. Rationalise the recorded data into an appropriate data structure.
  2. Transform the data to be useful for its intended purpose.
  3. Visualise and understand relationships within the data.
  4. Model these relationships to describe the structure of the data.

Typically, advanced analytics in the business context should not proceed this way. There, a problem or opportunity should be identified first and the model should be in service of this. However here we have the flexibility to decide how we can use the data as we analyse it. This is a blessing and a curse (as we shall see).

Rationalisation

The process of converting the XML data into a dataframe could be overlooked. It’s not terribly exciting. But it does demonstrate the simplicity of parallelisation when using Databricks. Databricks is built over Apache Spark, an engine designed for in-memory parallel data processing. The user doesn’t need to concern themselves how work is parallelised*, just focus on what they need done. Work can be described using Scala, Python, R or SQL. In this case study we’ll be using Python, which interacts with Spark using the PySpark API.

Since we’ve previously mounted our XML biometrics summary, we can simply read it in as a text file. Note that there are ways to parse XML files, but to see what we’re working with a text file is a bit easier.

We’ve asked Spark (via sc, a representation of “Spark Context”) to create a Resilient Distributed Dataset (RDD) out of our biometrics text file export.xml. Think of RDDs as Spark’s standard data storage structure, allowing parallel operations across a cluster of machines. In our case our RDD contains 2.25 million lines from export.xml. But what do these lines look like?

A simple random sample of 10 lines shows that each biometric observation is stored within the attributes of a separate record tag in the XML. This means that extracting this into a tabular format can be quite straight forward. All we need to do is identify record tags and extract their attributes. However, we should probably check that all of our record tags are complete first.

We’ve imported re, a Python module for regular expression matching. Using this we can filter our RDD to find records that begin with “<Record” but are not terminated with “>”. Fortunately, it appears that this is not the case. We can also test for the case where there are multiple records in the same line, but we’ll skip this here. Next we just need to filter our RDD to Record tags.

In both of these regular expression checks, I haven’t had to consider how Spark is parallelising these operations. I haven’t had to think any differently from how I would solve this problem in standard Python. I want to check each record has a particular form – so I just import the module I would use normally, and apply it in the Pyspark filter method.

*Okay, not entirely true. Just like in your favourite RDBMS, there are times when the operation of the query engine is important to understand. Also like your favourite RDBMS, you can get away with ignoring the engine most of the time.

Transformation

We already have our records, but each record is represented as a string. We need to extract features: atomic attributes that can be used to compare similar aspects of different records. A record tag includes features as tag attributes. For example, a record may say unit=”cm”. Extracting the individual features from the record strings in our RDD using regular expressions is fairly straightforward. All we need to do is convert each record string into a dictionary (Python’s standard data structure for key-value pairs) with keys representing the feature names and values representing the feature values. I do this in one (long) line by mapping each record to an appropriate dictionary comprehension:

This has converted our RDD into a dataframe – a table-like data structure, composed of columns of fixed datatypes. By and large, the dataframe is the fundamental data structure for data science investigations, inherited from statistical programming. Much of data science is about quantifying associations between features or predictor variables and variables of interest. Modelling such a relationship is typically done by comparing many examples of these variables, and rows of a dataframe are convenient places to store these examples.

The final call to the display function in the above code block is important. This is the default (and powerful) way to view and visualise your data in Databricks. We’ll come back to this later on.

So we have our raw data converted into a dataframe, but we still need to understand the data that actually comprises this table. Databricks is a great platform for this kind of work. It allows iterative, traceable investigations to be performed, shared and modified. This is perfect for understanding data – a process which must be done step-by-step and is often frustrating to document or interpret after the fact.

Firstly in our step-by-step process, all of our data are currently strings. Clearly this is not suitable for some items, but it’s easily fixed.

The printSchema method indicates that our dataframe now contains time stamps and decimal values where appropriate. This dataframe has, for each row:

  • creationDate: the time the record was written
  • startDate: the time the observation began
  • endDate: the time the observation ended
  • sourceName: the device with which the observation was made
  • type: the kind of biometric data observed
  • unit: the units in which the observation was measured
  • value: the biometric observation itself

Visualisation

So we have a structure for the data, but we haven’t really looked into the substance of the data yet. Questions that we should probably first ask are “what are the kinds of biometric data observed?”, and “how many observations do we have to work with?”. We can answer these with a quick summary. Below we find how many observations exist of each type, and between which dates they were recorded.

We see that some of the measures of energy burned have the most observations:

  1. Active Energy Burned has over 650,000 observations between December 2015 and November 2018
  2. Basal Energy Burned has over 450,000 observations between July 2016 and November 2018
  3. Distance Walking/Running has over 200,000 observations between December 2015 and November 2018
  4. Step Count has about 140,000 observations between December 2015 and November 2018
  5. Heart Rate has about 40,000 observations between December 2015 and November 2017
  6. Other kinds of observations have less than 30,000 observations

This tells us that the most rich insights are likely to be found by studying distance travelled, step count, heart rate and energy burned. We might prefer to consider observations that are measured (like step count) rather than derived (like energy burned), although it might be an interesting analysis in itself to try to find how these derivations are made.

Let’s begin by looking into how step count might relate to heart rate. Presumably, higher step rates should cause higher heart rates, so let’s see whether this is borne out in the data.

I’ve chosen to convert the data from a Spark dataframe to a Pandas dataframe to take advantage of some of the datetime manipulations available. This is an easy point of confusion for a starter in PySpark: Spark and Pandas dataframes are named the same, but operate differently. Primarily, Spark dataframes are distributed so operate faster with larger datasets. On the other hand, Pandas dataframes are generally more flexible. In this case since we’ve restricted our analysis to a subset of our original data that’s small enough to be confident with a Pandas dataframe.

Actually looking at the data now, one problem appears: the data are not coherent. That is, the two kinds of observations are difficult to compare. This manifests in two ways:

  1. Heart rate is a point-in-time measurement, while step count is measured across a period of time. This is a similar incoherence to the one in economics surrounding stock and flow variables. To make the two variables comparable we can assume that the step rate is constant across the period of time the step count is measured. As long as the period of time is fairly short this assumption is probably quite reasonable.
  2. Heart rate and step count appear to be sampled independently. This means that comparing them is difficult because at times where heart rate is known, step count is not always known, and vice versa. In this case we could assume that both types of observation are sampled independently so we can restrict our comparisons to observations of heart rate and step rate that are reasonably close.

Once we have some observations of heart rate and step rate, we can compare them:

On the vertical axis we have heart rate in beats per minute and on the horizontal axis we have pace in steps per second. Points are coloured so that older points are lighter, which allows us to see if there is an obvious change over time. The graph shows that Etienne’s usual heart rate is about 80 bpm, but when running it increases to between 120 and 180. It’s easy to notice an imbalance between usual heart rate observations and elevated heart rate observations – the former are much more prevalent.

There appears to be at least one clear outlier – the point where heart rate is under 40 bpm. There are also a small amount of observations that have normal heart rate and elevated pace or vice versa – these may be artifacts of our imperfect reconciliation of step count and heart rate. We could feed this back to improve the reconciliation process or re-assess the assumptions we made, which would be particularly useful with subject matter expert input.

The graph above shows the observations of step rate over time, with black indicating observations that have elevated heart rates. There are a few interesting characteristics – most obviously, observations are far more dense after July 2016. Also, rather alarmingly, there are only a small number of clusters of observations with elevated heart rates, which means that we cannot treat observations as independent. This is often the case for time series data, and it complicates analysis.

We could instead compare the progression of heart rate changes with pace by looking at each cluster of elevated heart rate records as representative of a single exercise events. However, we would be left with very few events. Rather than attempt to clean up the data further, let’s pivot.

Transformation (Iteration 2)

Data doesn’t reveal all of their secrets immediately. Often this means our analyses need to be done in cycles. In our first cycle we’ve learned:

  1. Data have been collected more completely since mid-2016. Perhaps we should limit our analysis to only the most recent year. This means we should not perhaps attempt to identify long-term changes in the data.
  2. Heart rate and step rate are difficult to reconcile because they often make observations at different times. It would be better to focus on a single type of biometric.
  3. There are only a small number of reconcilable recorded periods of elevated heart rate and step rate. Our focus should be on observations where we have more examples to compare.

Instead of step count and heart rate, let’s instead look at patterns in distance travelled by day since 2017.  This pivot answers each of the above issues: it is limited to more recent data, it focuses on a single type of biometric data, and it allows us to compare on a daily basis. Mercifully, distance travelled is also one of the most prevalent observations in our dataset.

You’d be right to say that this is a 180 degree pivot. We’re now looking at an entirely different direction. This is an artifact of our lack of a driving business problem, and it’s something you should prepare yourself for too if you commission the analysis of data for the sake of exploration. You may find interesting insights, or you may find problems. But without a guiding issue to address there’s a lot of uncertainty about where your analysis may go.

Stepping down from my soapbox, let’s transform our data. What I want to do is to record the distance travelled in every hourly period from 8am to 10pm since 2017. Into a dataframe “df_x”, I’ve placed all distance travelled data for 2017:

In the above we tackle this in three steps:

  1. Define a udf (user defined function) which returns the input number if positive or zero otherwise
  2. Use our udf to iteratively prorate distance travelled biometrics into the whole hour between 8am and 10pm that they fell into, naming these columns “hourTo9”, up to “hourTo22”.
  3. Aggregate all distances travelled into the day they occurred

This leaves us with rows representing individual calendar days and 14 new columns representing the distance travelled during a hour of the day.

Visualisation (Iteration 2)

This section is not just an exploration of the data, but an exploration of Databricks’ display tool, which allows users to change the output from a code step without re-running the code step. Found at the bottom of every output generated by the display command is a small menu:

This allows us to view the data contained in the displayed table in a graphical form. For example, choosing “Scatter” gives us a scatterplot of the data, which we can refine using the “Plot Options” dialogue:

We can use these plot options to explore the relationship between the hourly distance travelled variables we’ve created. For example, given a selection of hours (8am to 9am, 11am to 12pm, 2pm to 3pm, 5pm to 6pm, and 8pm to 9pm), we observe the following relationships:

Notice that long distances travelled in one hour of a day makes it less likely that long distances are travelled in other hours. Notice also that there is a fair skew in distances travelled, which is to be expected since the longest distances travelled can’t be balanced by negative distances travelled. We can make a log(1+x) transformation, which compresses large values to hopefully leave us with less skew:

The features we have are in 14 dimensions, so it’s hard to visualise how they might all interact. Instead, let’s use a clustering algorithm to classify the kinds of days in our dataset. Maybe some days are very sedentary, maybe some days involve walking to work, maybe some days include a run – are we able to classify these days?

There are a lot of clustering algorithms at our disposal: hierarchical, nearest-neighbour, various model-based approaches, etc. These perform differently on different kinds of data. I expect that there are certain routines within days that are captured by the data with some random variation: a set jogging route that occurs at roughly the same time on days of exercise, a regular stroll at lunchtime, a fixed route to the local shops to pick up supplies after work. I think it’s reasonable to expect on days where a particular routine is followed, we’ll see some approximately normal error around the average case for that routine. Because of this, we’ll look at using a Gaussian Mixture model to determine our clusters:

I’ve arbitrarily chosen to cluster into 4 clusters, but we could choose this more rigorously. 4 is enough to show differences between different routines, but not too many for the purpose of demonstration.

The graph above shows the 4 types of routine (labelled as “prediction” 0-3), and their relative frequency for each day of the week. Notably type 1 is much more prevalent on Saturday than other days – as is type 3 for Sunday. Type 2 is much more typical a routine for weekdays, appearing much less on weekends. This indicates that perhaps there is some detectable routine difference between different days of the week. Shocking? Not particularly. But it is affirming to see that the features we’ve derived from the data may capture some of these differences. Let’s look closer.

Above we have the actual profiles of the types of daily routines, hour-by-hour. Each routine has different peaks of activity:

  • Type 0 has sustained activity throughout the day, with a peak around lunchtime (12pm – 2pm).
  • Type 1 has sustained activity during the day with a local minimum around lunchtime, and less activity in the evening.
  • Type 2 has little activity during core business hours, and more activity in the morning (8am – 10am) and evening (5pm-7pm)
  • Type 3 has a notable afternoon peak (3pm – 6pm) after a less active morning, with another smaller spike around lunchtime.

If you were doing a full analysis you would also be concerned about the variability within and between each of these routine types. This could indicate that more routines are required to describe the data, or that some of the smaller peaks are just attributable to random variation rather than actual characteristics of the routine.

Finally, the visualisation above shows the composition of the daily routines over the course of a year, labelled by week number. The main apparent change through the course of the year is for routine type 2, which is more frequent during cooler months. This concords with what we might suspect: less activity during business hours in cooler, wetter months.

Taken together, perhaps we can use the hourly distance features to predict whether a day is more likely a weekday or a weekend. This model might not seem that useful at first, but it could be interesting to see which weekdays are most like weekends – perhaps these correspond with public holidays or annual leave?

Modelling

Let’s do a quick model to prove that weekends can be classified just with hourly movement data. There are a lot of possible ways to approach this, and a lot of decisions to make and justify. As a demonstrator here we’ll create a single model, but won’t refine it or delve too deeply into it.

Based on the types of routines identified in our cluster analysis, it’s fair to suspect that there may not be a monotonic relationship between the distance travelled in any particular hour and weekend/weekday membership. So rather than using the simplest classification model, logistic regression*, let’s fit a random forest classifier. First, we need to include a label for weekends and weekdays. I choose to call this “label” because by default this is the column name that Pyspark’s machine learning module will expect for classification.

As usual to allow us to check for overfitting, let’s separate the data into a training set and a test set. In this case we have unbalanced classes, so some might want to ensure we’re training on equal numbers of both weekdays and weekends. However, if our training data has the same relative class sizes as the data our model will be generalised to and overall accuracy is important then there isn’t necessarily a problem with unbalanced classes in our training data.

Now let’s prepare for model training. We’ll try a range of different parametrisations of our model, with different numbers of trees, and different numbers of minimum instances per node. Cross-validation is used to identify the best model (where best is based on the BinaryClassificationEvaluator, which uses area under ROC curve by default).

Fitting the model is then simply a matter of applying the cross-validation to our training set:

Finally, we can evaluate how successful our model is.:

So our model is reasonable on our test data, with a test ROC curve covering 0.86 and an overall accuracy of 0.82, which compares favourably to the accuracy of our null model, which would classify all observations as a weekday and have an accuracy of 0.71. There are many more possible avenues to investigate, even within the narrow path we’ve taken here. This is a curse of exploratory analysis.

*To be fair, logistic regression can capture non-monotonicity as well, but this requires modifying features (perhaps adding polynomial functions of features)

Wrapping Up

Databricks gives us a flexible, collaborative and powerful platform for data science, both exploratory and directed. Here we’ve only managed to scratch the surface, but we have shown some of the features that it offers. We also hope we’ve shown some of the ways it addresses common problems businesses face bringing advanced analysis into their way-of-working. Databricks is proving to be an important tool for advanced data analysis.