Databricks: beyond the guff, business benefits and why businesses should care. Here’s a cheat-sheet to get you started

Search for info on Azure Databricks and you’ll likely hear it described along the lines of “a managed Apache Spark platform that brings together data science, data engineering, and data analysis on the Azure platform”. The finer nuances and, importantly, information about the business benefits of this platform can be trickier to come by.  This is where our ‘cheat sheet’ comes in.  This is the first of a series designed to assist you in deciphering this potentially complicated platform. Feel free to also read the second article in the series, distilling information from data, hereafter.

What is it?

Databricks is a managed platform in Azure for running Apache Spark. Apache Spark, for those wondering, is a distributed, general-purpose, cluster-computing framework. It provides in-memory data processing capabilities and development APIs that allow data workers to execute streaming, machine learning or SQL workloads—tasks requiring fast, iterative access to datasets.

There are three common data worker personas: the Data Scientist, the Data Engineer, and the Data Analyst. Through Databricks, they’re able to collaborate on big data projects and acquire, engineer and analyse data, wherever it exists, in parallel. The bigger picture is that they are therefore all able to contribute to a final solution which is then brought to production.

  • Databricks is not a single technology but rather a platform that can, thanks to all its moving parts, personas, languages, etc., appear quite daunting. With the aim of simplifying things, our cheat sheet starts with a high-level snapshot of the workloads performed on Databricks by our Data Scientist, Data Engineer and Data Analyst personas.
  • We’ll then look at some real business benefits and why we think businesses should be paying attention. Lastly, we’ll delve into two related workloads:
    • Data transformation, and
    • Queries for visual analysis.

Our subsequent cheat sheets will start to unpick the remaining workloads.

The image below shows a high-level snapshot of the workloads performed by our three data worker personas. The workloads in the coloured sections form (to varying degrees) the basis for the contents of our cheat sheet.

  • Data engineering forms, in our opinion, the largest of the cohort of workloads:
    • Data acquisition – i.e. how data is acquired for transformation, data analysis and data science using Databricks. This could potentially fall beyond the realms of Databricks due to the fact that data can be leveraged from wherever it exists (for example Azure Blob or Azure Data Lake stores, Amazon S3, etc.) and data may already be hosted in those stores as a result of some preceding ETL process. Databricks can of course also acquire data.
    • Data transformation – discussed later in this article, focussing on the ETL processes within Databricks (ETL within).
  • Data analysis takes on two flavours:
    • Queries – these could overlap heavily with the world of the Data Scientist, especially if the languages used are Python or R and if the intent is machine learning and predictive analytics. But Data Analysts could, of course, also perform queries for ‘on the fly’ data analysis.
    • Queries for visual analysis – queries are also performed to ready data for visual analysis. This is discussed later in this article; however, it must be noted that the lines between this kind of Queries and Data Transformation performed by the Data Engineer can become very blurred. This in itself proves the collaborative and parallel nature Databricks allows.
  • Data science has machine learning and associated algorithms, with predictive and explanatory analytics as the end goal. Here too, queries are performed, and the lines are similarly blurred with the queries performed by the Data Analyst and the Data Engineer.
  • Underpinning all of this are the workloads involved in moving the solutions to production states.

These workloads are logical groupings only aimed at clearing what could otherwise be muddy waters to the untrained eye. Queries may, for example, be performed, then used for transformations, data science and visual analysis.

So, without any further ado, let’s look at why businesses should be watching Databricks very closely!

Why Databricks? – Beyond the guff, business benefits and importantly, why businesses should care

If you search on Google for ‘Apache Spark’ you’ll find loads of buzzwords – “open-source”, “distributed”, “big data”, etc. On first glance, this can look like marketing babble and appear completely removed from a business’s actual data challenges. So let’s dispense with the buzzwords and focus on the business challenges.

Note also: although Apache Spark (and therefore Databricks too) is positioned in the big data camp, its application is not limited to big data workloads. So, if some of the challenges we list below apply to your data landscape (big data or not), read on.

Time to market

Challenge – data warehouses takes too long to deliver business benefit

BenefitDatabricks is naturally geared towards agility via its ability to serve parallel collaboration, which, in turn, leads to improved responsiveness to change. This means that the time it takes to deliver data workloads is reduced

Parallel collaboration rather than seriality

Challenge – participants in the data solutions processes are too dependent on each other to complete their tasks before they can participate. These challenges are a result of serial workloads

Benefit – parallel collaboration delivers maximum agility. It means that the three main data personas, i.e. the Engineer, the Scientist and the Analyst can collaborate on delivering the data elements that will form part of a final data deliverable in parallel. As the Engineer acquires the data, the Analyst, the Scientist and indeed the Engineer start contributing to the logic that transforms and manipulates the data all in parallel. This, in turn, contributes to a reduction in time for solutions to get to market

Responsiveness and nimbleness

Challenge – companies change, requirements change, and business may not know exactly what they want or need from data that is stored in a variety of formats in different locations

Benefit – companies frequently generate thousands of data files, hosted in diverse formats including CSV, JSON, and XML from which analysts need to extract insights

The classic approach to querying this data is to load it into a central data warehouse. But this involves the design and development of databases and ETL. This works well but requires a great deal of upfront effort, and the data warehouse can only host data that fits the designed schema. This is costly, time-consuming and difficult to change.

With the data warehouse approach, insights can only be extracted after the data is transformed upon load.

Databricks presents a different approach and allows insights to be extracted and transformed upon query from vast amounts of data stored cheaply in its native format (such as XML, JSON, CSV, Parquet, and even relational database and live transactional data) in Blob Stores. With Databricks, data is read directly from the raw files, and by using SQL queries, data is cleansed, joined and aggregated – hence the term transform upon query.

Transforming the data each time a query run means this approach is much more geared towards quick turnaround and becomes more responsive to change. BUT, it requires superior performance.

Performance

Challenge – workloads (such as queries) serving analytics and data science, are run often and transform the data each time the query runs (transform upon query). Logic dictates that this will not perform as well as data transformed upon load once and the transformed data materialised for reuse.

Benefit – Databricks provides a performant environment that handles the transform upon query paradigm. This is done by utilising a variety of mechanisms, such as:

  • Databricks includes a Spark engine that is faster and performs better through various optimisations at the I/O layer and processing layer:
    • For example, Spark clusters are configured to support many concurrent queries and can be scaled to handle increased demand.
  • It includes high-speed connectors to Azure storage (i.e. Azure Blob and Azure Data Lake stores)
  • It uses the latest generation of Azure hardware (Dv3 VMs), with NvMe SSDs capable of even faster I/O performance.

A managed big data (or in our opinion, all data) platform

Challenge – The data landscape is becoming increasingly complex and fragmented and costly to maintain.

Benefit – “Databricks is a managed platform (in Azure) for running Apache Spark – that means that you neither have to learn complex cluster management concepts nor perform tedious maintenance tasks to take advantage of Spark. Databricks also provides a host of features to help its users to be more productive with Spark. It’s a point and click platform for those that prefer a user interface, such as data scientists or data analysts.” – https://docs.databricks.com/_static/notebooks/gentle-introduction-to-apache-spark.html

Not just Azure Blob Storage – access data where it lives

Challenge – Data is not necessarily stored in Azure Blobs

Benefit – Databricks connections are not limited to Azure Blob or Azure Data Lake stores, but also to Amazon S3 and other data stores such as Postgres, HIVE and MY SQL, Azure SQL Database, Azure Event Hubs, etc. via JDBC (Java Database Connectivity). So, you can immediately start to benefit from the cost, flexibility and performance benefits offered by Databricks for your existing data

Cost of the cluster

Challenge – Big data solutions tend to cost a lot of money

Benefit – The Databricks File System (DBFS), is a layer over your data (where it lives) that allows you to mount the data, making it available to other users in your workspace and persisting the data after a cluster is shut down. Data is not synced, but mounted, which means you do not double pay for storage.

When a Databricks cluster is shut down (which is also done automatically at an interval you specify when not in use), it stops costing you money, so you only pay for what you use

Furthermore, Azure Databricks leverages the economies of scale provided by Azure. Analysis workloads (Interactive workloads for analysing data collaboratively with notebooks) on a Premium F4 instance (4 virtual CPU’s and 8 GB RAM) running 24 x 7 will, for example, only cost you $380 pm. And Data Engineering workloads (Automated workloads for running fast and robust jobs via API or UI) for the same tier will, for example, only cost you $307 pm.

*Note that the pricing above is in AUD and is an estimate only as per the Azure Pricing Calculator.

Australian region

Challenge –some big data solutions such as Azure Data Lake, first generation, is not available in the Australian region as at the date of first publication of this article

Benefit – Databricks can be provisioned in the following Australian regions:

  • Australia Central
  • Australia Central 2
  • Australia East
  • Australia South East

Like everything, there are some downsides/ realities to consider

SQL, R, Python, Scala – can be daunting

SQL has become the “lingua franca” for most Data Engineers and Data Analysts, whereas the same applies to R and Python for Data Scientists. These personas collaborate on Databricks using notebooks as interfaces to the data, which allows them to create runnable code, visualisations and narrative.

Suddenly these personas gain visibility over the code from other personas in the same notebook, and as notebooks can consist of multiple languages, this can seem quite daunting to personas unfamiliar with languages they have not previously used, especially considering that the languages used in Databricks, i.e. R, Python, Scala and SQL, each have their peculiarities.

Obviously this is only an issue if you are unfamiliar with such an environment. For those with good coverage of SQL, R, Python and Scala, this is a benefit as they can work with multiple languages in the same Databricks notebook easily, i.e. personas can use their preferred language of choice irrespective of the choice of other personas. All that needs to be done is to prepend the cell with the appropriate magic command, such as %python, %r, %sql, etc.

From another viewpoint however, this diversity of languages can be a strength for the right business environment: the workflow naturally dissipates technical debt and encourages capability sharing.

Learning curve

There will often be a requirement for personas to become more familiar with a broader set of languages and the notebook environment to make following what is happening in the total notebook easier. This will make for easier collaboration and is inline with a move from pure serial to more parallel workloads.

Case study – Data Transformation and Visual Analysis

The use case described in this section is used as a vehicle for a more technical deep dive into the workloads shown in the coloured sections of the Databricks Workflow image above (i.e. Data Transformations and ETL within Databricks, and Queries for visual analysis).

Our use case – IoT and wearable devices, such as Apple Watches, are currently under a substantial spotlight as there is a lot of interest as to what can be gleaned from the data they produce (see our article June’s story as an example – http://blog.exposedata.com.au/2018/09/03/artificial-intelligence-in-aged-care-junes-story/). In our use case, Apple Watch data is brought into Azure from where the datasets will be mounted to Databricks, ETL processes then transforms and loads the data, and finally Queries are performed.

An Apple Watch is used to generate data we will use in this user story. An app on the watch integrates with Azure and streams some data into Azure Blob Storage (this app and stream are not within the scope of this article as Data Acquisition will be discussed in a subsequent article).

The data manifests itself as CSV files in Azure Blob store > Container:

Data Engineering > Data Transformation > ETL within


This section assumes that data is already available in an appropriate store for mounting (in this case Azure Blob store). We notionally call the next steps “ETL within Databricks” as it represents a logical ETL that will extract and validate the data, apply a schema, then load the data ready for use by (for example for analytical querying). ETL within Databricks should not be confused with ETL to get data into Azure in the first place (which will be discussed in a subsequent article).
ETL within Databricks is conceptually the same as the ETL concepts we know from conventional BI workloads, in that you first extract the data, then transform it, and then load it, but it is done in a much nimbler fashion and it adheres to the notion of the transformation of data upon query, rather than upon load.
The common steps associated with our two workloads, i.e. ETL within and queries to ready the data for visual analysis are shown visually in the image below:

Remember that Spark is the engine used by Databricks, and SQL/ Scala/ Python/ R/ Java uses that engine to perform the various workload tasks.

In the sections below, we will first mount our Apple watch data (this is the extract step), we will then transform the data and load it into a table using SQL (the amber route shown above), create a data frame and load it as a parquet file (the green route above). Later we will deal with Analysis of the loaded data, readying the data for, for example, visual analysis. For now let’s focus on the ETL.

The queries shown in each step below are examples of what could be done and should give the reader a starting point from where to build more complicated ETL within Databricks and subsequent queries. Databricks is a massively flexible platform, so the sample queries may be made much more complex or approached in an entirely different way.

Extract

In the first step we mount the data held in our Azure Blob store to the Databricks File System (DBFS). This represents the “Mounted Stores in DBFS” step in the image above (we are not focussing on the JDBC step in this use case).

We first generated a SAS URL for the Azure Blob store to use as a variable, then used it in the query.

Mounting means creating a pointer to the store, which means that the data never actually syncs. The mount point is simply a path representing where the Blob Storage container or a folder inside the container is mounted in DBFS.
Optional – We may quickly validate the mount by running the following query to see the contents of the mount point.

Optional – We lastly validate the data in any of the files within our mount by looking at the content of any of the files within our mount point.

Transform

As per the Transform steps, there are two options: a SQL path (shown in Amber) and a Scala/ Python/ R/ Java path (shown in Green). The reader can jump to the Scala/ Python/ R/ Java path if wanting to bypass the SQL sections, which to many may seem a bit familiar.

Transform and Load using SQL (Option A)

We use SQL to create a table in DBFS which will “host the data” via metadata, then infer the schema from the files in our Azure Blob store container. Note that the scheme can be explicit rather than inferred. In our use case all our files have the same structure and the schema can therefore be inferred. But in cases where structures differ, then standardisation queries will precede this step.

It is worth noting that in Databricks a table is a collection of structured data. Tables in Databricks are equivalent to Data Frames in Apache Spark.

Optional – We can now perform all manner of familiar SQL queries. It is also worth noting that data can be visualised on the fly using the options in the bottom left corner. In the first example, we review the data we had just loaded, in the second we do a simple record count.

Transform and Load using Scala (Option B)

Tables are familiar to any conventional database operator. Let’s now extend this concept to include Data Frames. A Data Frame is essentially the core Transformation layer in this alternative ETL path – it is a dataset organised into named columns. It is conceptually equivalent to a table in a relational database but with richer optimisations under the hood. Data Frame code follows a “spark.read.option” pattern.
In the next query, we read the data from the mount, we infer the headers (we know that all our files have the same format so no preceding column standardisation is required), we select only certain columns of value to us, and we transform the column names as a subsequent step, as loading the data to Parquet restricts us from using “restricted characters” such as “(” and “,” .

We lastly load the data into a parquet file in DBFS. Whilst blob stores like AWS S3 and Azure Blob are the data storage options of choice for Databricks, Parquet is the storage format of choice. They are highly efficient, column-oriented data format files that show massive performance increases over other options such as CSV. For example Parquet compresses data repeated in a given column and preserves the schema from a write.

Queries for Visual Analysis

Once we have Extracted, Transformed and Loaded the data we can now perform any manner of query-based analysis. We can for example query the Parquet file directly, or we can create a table from the Parquet file and then query that, or we can bake the final query into the Table create.

Let’s first query the Parquet directly:

Now let’s create a table from its Metadata which can then be used by BI tools such as Power BI.

In the final query, we query the table and prepare the data for visual analysis in something like Power BI. We select the maximum number of steps our Apple Watch wearer by day (we only loaded two days’ worth of data).

We will, in subsequent articles introduce many of the other workloads associated with Databricks building on the concepts we used in this article.

Author: Etienne Oosthuysen; Contributor: Rajesh Kotian

Young, female and paving the way for technology in South Australia

We recently joined forces with St. Peter’s Girls’ Collegiate School, facilitating an 8-week data and analytics project for Year 11 students.  This exercise provided the girls with real-world skills development but most importantly, opened their eyes to what a career in IT can look like, dispelling some of the misconceptions in the process.

View the article here: St Peters Girls DA Project

Exposé team members working with the St Peter’s Girls were Andrew Exley, Etienne Oosthuysen, Kelly Drewett, Trevene Leonard

Gov Hack 2018 – Our Winning Emergency Response Exposed

This year we got a team together and entered the 2018 Gov Hack competition.  Over the course of 46 hours, we built a solution that brings together fragmented datasets, some of which are listed below, in an Emergency Response solution, adhering to the spirit of Gov Hack by showing the power of Open Data.

The team consisted of Andrew Exley, Cameron Wells, Etienne Oosthuysen, Jake Deed and Jean-Noel Seneque.

See a short summary of our journey and a condensed version of our video submission here:

The solution contains:

  • The architecture and data platform that allows the datasets to be ingested in a periodic and in a real time manner, stored, and blended to serve a variety of emergency related user stories. 
  • A user interface that can be accessed from anywhere (PC or mobile phone) and allows for real time tracking of emergency events vis-a-vis points of interest (such as your home), the nearest point of safety, rolling social media coverage of the event, other points of interest to assist emergency services respond (such as bodies of water for water bomb runs, traffic and congestion, helipads or airports, etc.)
  • A platform by which data can be analysed for trends by analysts working for the emergency services.

Some examples of the datasets:

  • G-NAF (Geocoded National Address File) which is one of the most ubiquitous and powerful spatial datasets. It contains a full geo-spatial description of each address (including the state, suburb, street, number and coordinate reference (or “geocode”) for all street addresses in Australia).This forms the basis of location of people or places, and the distance of people to places, such as your home to a point of safety during an emergency.
  • Twitter and sentiment, especially during emergency events. This helps determine sentiment during an event, such as the inherent urgency during an emergency.
  • Dams (Angus Catchment) by the Department for Environment and Water in South Australia. The dataset contains polygon data outlining the physical extent of dams and estimated dam capacity (volume range in megalitres). This forms the basis of water bomb runs in the case of a bushfire emergency.
  • Statistical Area Level 1 (SA1) by the Australian Bureau of Statistics. This used in combination with incidents and statistical population to estimate people affected or likely to be affected by incident.
  • Country Fire Service of South Australia live incident feed. This forms the basis of identifying when emergencies occur.

Artificial Intelligence in Aged Care – June’s story

Meet June; long time Adelaidean, keen gardener and grandmother of twelve!  At 86 years ‘young’, June moved from her own home into a local aged care facility following a series of falls that saw her hospitalised over the summer.  June was diagnosed with Parkinson’s disease 18 months ago and following an increasing number of falls, June and her family made the decision to move her into residential care.

As symptoms of Parkinson’s disease progress at different rates for different people, getting June’s treatment plan right has been tricky, complicated by the fact that like many aged care residents, she requires several different medications to manage her health.  June and her carers have noticed that her tremors appear to be triggered by stress or emotional experiences and lessen when she is relaxed.  It also appears that regular exercise and engagement in leisure activities aid in keeping June’s tremors at bay.  As tremors often lead to lack of balance, which is likely to result in a fall, June’s care team have put together a robust healthcare plan which includes regular activity and time spent outdoors on top of her medication and occupational therapy.

The aged care facility where June lives recently embarked upon an initiative with the goal of improving the overall response to incidents such as falls, ensuring that responses are timely and that any incidents are attended to by the correct staff.  CCTV cameras have been installed in the corridors on the higher dependency floors, such as the one June lives on.  The CCTV is used to track residents’ movements via location tracking as well as emotions via facial recognition.  Residents of these sections have also been given smart devices to wear that track real-time data such as number of steps taken, standing vs walking rate and heart rate.

When dealing with personal data, it is of paramount importance to ensure its security.  Additional precautionary measures will be taken to ensure the security of June’s personal data so that it will be accessed for authorised purposes only.  Steps need to be taken so that June’s personal data is not shared or used for any commercial gain, for example, as a way to categorise June, possibly affecting her insurance premiums based on her risk as a patient.

Given the knowledge we have around the impact of stress on the incidence of tremors, the data from the CCTV coupled with June’s smart device will trigger an alert to the team lead in charge of her zone, should the variables compute to show an increased likelihood of stress.  The team lead is then able to ensure not only that there are sufficient carers positioned in high risk zones, that they are also equipped to deal with a possible fall. Furthermore, the wearable device shows the care team when June is outside and how much sunlight – linked to positive mental health – June is getting.  The data also enables the team to see links between steps and heart rate.  If it is found, as an example, that steps are going down and heart rate increasing, this could be a sign of a potential health issue, which would enable the appropriate medical intervention to happen proactively.

This scenario illustrates a proactive solution that benefits June and other residents in terms of the level of care they receive, not only through better response to incidents but in helping to prevent incidents happening in the first place.  At an organisational level, management also get insights that assist them in planning and resourcing more effectively as well as the ongoing process improvements brought about by machine learning.

Stay tuned for a follow up instalment as we explore the technical aspects of the business case!

Author: Sophia Siegele; Contributor: Shishir Sarfare

Artificial Intelligence and Occupational Health and Safety – AI an enabler or a threat

We increasingly hear statements like, “machines are smarter than us” and “they will take over our jobs”. The fact of the matter is that computers can simply compute faster, and more accurately than humans can. So, in the short video below, we instead focus on how machines can be used to assist us do our jobs better, rather than viewing AI as an imminent threat. It shows how AI can assist in better occupational health and safety in the hospitality industry. It does however apply to many use cases across many industries, and positions AI as an enabler. Also see an extended description of the solution after the video demo.

Image and video recognition – a new dimension of data analytics

With the introduction of video, image and video streaming analytics, the realm of advanced data analytics and artificial intelligence just stepped up a notch.

All the big players are currently competing to provide the best and most powerful versions;   Microsoft with Azure Cognitive Services APIs, Amazon with AWS Rekognition, Google Cloud Video Intelligence as well as IBM with Intelligent Video Analytics.

Not only can we analyse textual or numerical data historically or in real time, we’re now able to extend this to use cases of videos and images. Currently, there are API’s available to carry out these conceptual tasks:

  • Face Detection

o   Identify a person from a repository / collection of faces

o   Celebrity recognition

  • Facial Analysis

o   Identify emotion, age, and other demographics within individual faces

  • Object, Scene and Activity Detection

o   Return objects the algorithm has identified within specific frames i.e. cars, hats, animals

o   Return location settings i.e. kitchen, beach, mountain

o   Return activities from video frame i.e. riding, cycling, swimming

  • Tracking

o   Track movement/path of people within a video

  • Unsafe Content Detection

o   Auto moderate inappropriate content i.e. Adult only content

  • Text Detection

o   Recognise text from images

The business benefits

Thanks to cloud computing, this complex and resource demanding functionality can be used with relative ease by businesses.  Instead of having to develop complex systems and processes to accomplish such tasks, a business can now leverage the intelligence and immense processing power of cloud products, freeing them up to focus on how best to apply the output.

In a nutshell, vendors offering video and image services are essentially providing users API’s which can interact with the several located cloud hosts they maintain globally. All the user needs to do, therefore, is provide the input and manage the responses provided by the many calls that can be made using the provided API’s. The exposé team currently have the required skills and capability to ‘plug and play’ with these API’s with many use cases already outlined.

Potential use cases

As capable as these functions already are, improvements are happening all the time.  While the potential scope is staggering, the following cases are based on the currently available. There are potentially many, many more – the sky really is the limit.

Cardless, pinless entry using facial recognition only

This is a camera used to view a person’s face, which then gets integrated with the facial recognition API’s.  This then sends a response, which can be used to either open the entry or leave it shut. Not only does this improve security, preventing the use of someone else’s card, or pin number, but if someone were to follow another person through the entry, security can be immediately alerted. Additional cameras can be placed throughout the secure location to ensure that only authorised people are within the specified area.

Our own test drive use case

As an extension of the above cardless, pinless entry using facial recognition only use case, additional API’s can be used to not only determine if a person is authorised to enter a secure area, but to check if they are wearing the correct safety equipment. The value this brings to various occupational health and safety functions is evident.

We have performed the following scenario ourselves, using a selection of API’s to provide the alert. The video above demonstrates a chef who the API recognises using face detection.  Another API is then used to determine that he is wearing the required head wear (a chef’s hat). As soon as the chef is seen in the kitchen not wearing the appropriate attire, an alert is sent to his manager to report the incident.

Technical jargon

To provide some understanding of how this scenario plays out architecturally, here is the conceptual architecture used in the solution showcased in the referenced Video.

Architecture Pre-requisite:

·        Face Repository / Collection

Images of faces of people in the organisation. The vendors solution maps facial features, e.g. distance between eyes, and stores this information against a specific face. This is required by the succeeding video analytics as it needs to be able to recognise a face from various angles, distances and scenes. Associated with the faces are other metadata such as name, date range for permission to be on site, and even extra information such as work hours.

Architecture of the AI Process:

·        Video or Images storage

Store the video to be processed within the vendors storage location within the cloud, so it is accessible to the API’s that will be subsequently used to analyse the video/image.

·        Face Detection and Recognition API’s

Run the video/images through the Face Detection and Recognition API to determine where a face is detected and if a particular face is matched from the Face Repository / Collection.  This will return the timestamp and bounding box of the identified faces as output.

·        Frame splitting

Use the face detection output and 3rd party video library to extract the relevant frames from the video to be sent off to additional API’s for further analysis.  Within each frames timestamp create a subset of images from the detected faces bounding box, there could be 1 or more faces detected in a frame.  The bounding box extract will be expanded to encompass the face and area above the head ready for the next step.

·        Object Detection API’s

Run object detection over the extracted subset of images from the frame.  In our scenario we’re looking to detect if the person is wearing their required kitchen attire (Chef hat) or not.  We can use this output in combination with the person detected to send an appropriate alert.

·        Messaging Service

Once it has been detected that a person is not wearing the appropriate attire within the kitchen an alert mechanism can be triggered to send to management or other persons via e-mail, SMS or other mediums. In our video we have received an alert via SMS on the managers phone.

Below we have highlighted the components of the Architecture in a diagram:

Conclusion

These are just a couple of examples of how we can interact with such powerful functionality; all available in the cloud. It really does open the door to a plethora of different ways we can interact with videos and images and automate responses. Moreover, it’s an illustration of how we can analyse what is occurring in our data, extracted from a new medium – which adds an exciting new dynamic!

Video and image analytics opens up immense possibilities to not only further analyse but to automate tasks within your organisation. Leveraging this capability, the exposé team can apply our experience to your organisation, enabling you to harness some of the most advanced cloud services being produced by the big vendors. As we mentioned earlier, this is a space that will only continue to evolve and improve with more possibilities in the near future.

Do not hesitate to call us to see how we may be able to help.

 

Contributors to this solution and blog entry:

Jake Deed – https://www.linkedin.com/in/jakedeed/

Cameron Wells – https://www.linkedin.com/in/camerongwells/

Etienne Oosthuysen – https://www.linkedin.com/in/etienneo/

Chris Antonello – https://www.linkedin.com/in/christopher-antonello-51a0b592/

 

Tableau Prep – we test drive this new user-centred data preparation tool

Data preparation is undoubtedly one of the most competency-reliant and time-consuming parts of report generation. For this reason, it is fast becoming the new focal area for further development and we are seeing a large uptick in the number of options being made available to help alleviate these issues.

One recent entrant to this space is Tableau with the announcement of their new tool, Tableau Prep. This tool brings a new user experience to the artform of data preparation and  follows the similar user-centred design form as their reporting tool.

Tableau Prep concentrates on providing a ‘no-code required’ solution in data preparation with a view to enabling a greater number of users’ accessibility and a quicker turnaround for organisations in wrangling datasets.

Every step within Tableau Prep is visual and shows the immediate effects of any transforms on data. Its proficiency is in hiding complex smart algorithms that carry out the data manipulation and surface them with one-click operations is greatly simplifying the data preparation process.

The preparation paradigm concentrates on having the user set up a pathway from the dataset through to the output, introducing the required transformations along the way.

Preparation Workflow

By clicking on an element in the workflow, it will bring up a secondary pane showing more details relevant to the step selected.

Dataset input step

Adding steps within Tableau Prep is as simple as clicking on the “+” icon and choosing the appropriate method.

Add prep step menu

Interacting with data within Tableau Prep is similarly a visual experience. In the example below, in performing a group and replace operation, Tableau Prep has recognised that as a result of joining datasets together, some used the full state name of “California” and others used the contraction of “CA”. It then groups these together utilising a fuzzy matching algorithm and presents the options to allow the user to choose which should be the chosen representation of the datapoints.

Group and Replace

Tableau Prep provides a visual summary of all changes that have been made within each step. As changes to data are made, it updates the preview in real time, allowing the user to see the effect this has had.

Example of field change descriptions

After creating the transformation pathway, generating an output file is done as a final step. Currently Tableau Prep is very focussed on its integration with the Tableau product set. It automatically publishes to Tableau Desktop, Server and Online but it does also offer an output as a CSV file format.

Conclusion

In summary, Tableau Prep is going to enhance the ability of analysts who are used to working with the Tableau product suite. Whilst it won’t replace other more mature and prevalent data preparation products on the market such as Alteryx, Trifacta or Knime, it does offer a significant productivity opportunity to Tableau focussed organisations.

New Subscriptions

With the release of Tableau Prep, Tableau has also introduced new subscription offerings.

Subscription offerings

The new subscription levels; Tableau Creator, Explorer, and Viewer have been packaged around expected usage within an organisation. Tableau Prep has been included in the Creator package along with Tableau Desktop.

Also see our video test drive here.

Common Data Service (CDS) – A Common Data Model. Accelerate your ability to provide insights into your business data

If you’ve been following Microsoft’s recent press releases, chances are you’ll have been exposed to the term “Common Data Service” (CDS). Read on as we shed light on the exact nature of CDS and what it can mean to your business.

Back in November 2016, Microsoft released their Common Data Service to general availability. In a nutshell, CDS is Microsoft’s attempt at providing a solution to counter the time and effort customers are spending to bring together disparate apps, services and solutions. At its most basic level, it provides a way to connect disparate systems around a focal point of your data. The intention is that Microsoft will provide the “heavy lifting” required to ensure the data flows back and forth as required.

To achieve this, Microsoft has defined a set of core business entities and then built them into what is known as the Common Data Model (CDM). For example, they have exposed entities for managing data around Accounts, Employees, Products, Opportunities and Sales Orders (for a full list see: https://docs.microsoft.com/en-us/powerapps/developer/common-data-service/reference/about-entity-reference). Where there isn’t an existing entity to suit a business requirement, Microsoft has made the CDM extensible, which allows you to add to your organisation’s instance of the CDM to meet your needs. As your organisation adapts and changes your CDS instance, Microsoft will then monitor this and look for common patterns amongst the business community that it will use to modify and extend the standard CDS.

Microsoft is committed to making their applications CDS aware and is working with their partners to get third party applications to interact effectively with the CDS.

When establishing CDS integration from an organisational use perspective, it should ideally be a simple configuration of a connector from a source application to the CDS, aligning its data entities with the reciprocal entities within the CDS.  This will ensure that as products are changed to meet business needs over time, the impact should be almost negligible to other systems. This negates the need for an organisation to spend an excessive amount of time ensuring the correct architecting of a solution in bringing together disparate apps and siloed information. This can now be handled through the CDS.

Since its release in 2016, CDS has evolved with Microsoft recently announcing the release of two new services; Common Data Service for Apps (CDS for Apps) and Common Data Service for Analytics (CDS for Analytics).

CDS for Apps was released in January 2018 with CDS for Analytics expected for release in second quarter 2018. As a snapshot of how the various “pieces” fit together, Figure 1 provides a logical view of how the services will interact.

Figure 1 – Common Data Service Logical Model

Common Data Service for Apps

CDS for Apps was initially designed for businesses to engage with their data on a low-code/no-code basis through Microsoft’s PowerApps product. This allows a business to rapidly develop scalable, secure and feature-rich applications.

For organisations needing further enhancement, Microsoft offers developer extensions to engage with CDS for Apps.

Common Data Service for Analytics

CDS for Analytics was designed to function with Power BI as the visual reporting visual product. Similarly to the way CDS for Apps is extensible by developers, CDS for Analytics will also provide extensibility options.

Figure 2 below provides the current logic model for how CDS for Analytics will integrate.

Figure 2 – CDS for Analytics Logic Model

Business Benefits

Implementing the CDS for Apps and CDS for Analytics will enable you to be able to easily capture data and then accelerate your ability to provide insights into your business data.

To assist in this acceleration, Microsoft and expose data, as their partners, will be building industry specific apps that immediately surface deep insights to an organisation’s data. An initial example is currently being developed by Microsoft; Power BI for Sales Insights will address the maximisation of sales productivity by providing insights into which opportunities are at risk and where salespeople could be spending their time more efficiently.

The ease of development and portability of solutions aren’t possible, however, without having a standardised data model. By leveraging Microsoft’s new common data services and with the suite of Microsoft’s platform of products being CDS aware, utilisation of tools such as Azure Machine Learning and Azure Databricks for deeper analysis of your organisation’s data becomes transformational.

If you’d like to understand more about how to take advantage of the Common Data Service or for further discussion around how it can assist your business, please get in touch.

European GDPR and its impact on Australian organisations. We give you the low-down from an analytic tool perspective.

What is the European GDPR and how will it impact Australian organisations?  We give you the low-down from an analytic tool perspective.

GDPR (General Data Protection Rules) is the European privacy and data protection law that comes into effect on the 25th of May 2018.  This surely doesn’t affect Australian companies, right? Wrong!

The thing is, whilst the new regulation governs data protection and privacy for all EU citizens it also addresses personal data outside of the EU. The impact will be far-reaching, including Australian businesses, as all businesses concerned with the gathering and analysis of consumer data could be affected.

What the law says

According to the Office of the Australian Information Commissioner (OAIC), Australian businesses of any size may need to comply. In addition, all Australian businesses must comply with the Australian Privacy Act 1988.

Are these two laws complimentary? Some of the common requirements that businesses must adhere to include:

  • Implementation of privacy by design approach to compliance
  • An ability to demonstrate compliance with privacy principles and obligations
  • Adoption of transparent information handling practices
  • Appropriate notification in case of any data breach
  • Conduction of Privacy impact assessments

But some GDPR requirements are not part of the Australian Privacy Act, such as the “right to be forgotten”.

What now?

We would suggest that Australian businesses firstly establish whether they need to comply with GDPR.  If they do, then they should take prompt steps to ensure their data practices comply. Businesses should already comply with the Australian Privacy Act, but also consider rolling out additional measures required under GDPR which are not inconsistent with the Privacy Act.

Who is affected

In a nutshell, the GDPR applies to any data processing activities undertaken by an Australian business of any size that:

  • Has a presence in the EU
  • Has a website/s that targets EU customers or mentions customers or users in the EU
  • Tracks individuals in the EU to analyse (for example to predict personal preferences, behaviours and attitudes)

Refer to the following link for more information: https://www.oaic.gov.au/media-and-speeches/news/general-data-protection-regulation-guidance-for-australian-businesses

Do analytic tools comply?

Once a need for your organisation to comply has been established, it is worth ascertaining whether the actual tools you are using for analytics comply; specifically regarding the last bullet point above (tracking and analysing individuals).

In the next section of this article we look at two common players in the analytics space; Power BI and Qlik, through the lens of GDPR (and by default the Australian Privacy Act).

The scope of GDPR is intended to apply to the processing of personal data irrespective of the technology used. Because Power BI and Qlik may be used to process personal data, there are certain requirements within the GDPR that compel users of these technologies to pay close attention:

  • Article 7 states that consent must be demonstrable and “freely given” if the basis for data processing is consent.  The data subject must also have the right to withdraw consent at any time
  • Articles 15 to 17 covers the rights to access, rectification, and erasure. This means that mechanisms must allow data subjects to request access to their personal data and receive information on the processing of that data. They must be able to rectify personal data if it is incorrect. Data subject must also be able to request the erasure of their personal data (i.e. the “right to be forgotten”)
  • Articles 24 to 30 require maintenance of audit trails and documentary evidence to demonstrate accountability and compliance with the GDPR
  • Article 25 requires businesses to implement the necessary privacy controls, safeguards, and data protection principles so that privacy is by design
  • Articles 25, 29 and 32 require strict data security access control to personal data through for example role-based access and segregation of duties

Microsoft Power BI

Power BI can be viewed through the lens of GDPR (and the Australian Privacy Act for that matter) via four pillars in the Microsoft Trust Centre. With specific reference to GDPR, Microsoft states, “We’ve spent a lot of time with GDPR and like to think we’ve been thoughtful about its intent and meaning”.  Microsoft released a whitepaper to provide the reader with some basic understanding of the GDPR and how it relates to Power BI. But meeting GDPR compliance will likely include a variety of different tools, approaches, and requirements.

Security

Power BI is built using the “Security Development Lifecycle”, Through Azure Active Directory Power BI is protected from unauthorised access by simplifying the management of users and groups, which enables you to assign and revoke privileges easily.

Privacy

The Microsoft Trust Centre clearly states that “you are the owner of your data” and it is not used for mining for advertising.  http://servicetrust.microsoft.com/ViewPage/TrustDocuments?command=Download&downloadType=Document&downloadId=5bd4c466-277b-4726-b9e0-f816ac12872d&docTab=6d000410-c9e9-11e7-9a91-892aae8839ad_FAQ_and_White_Papers

From the Power BI white paper, “We use your data only for purposes that are consistent with providing the services to which you subscribe. If a government approaches us for access to your data, we redirect the inquiry to you, the customer, whenever possible. We have challenged, and will challenge in court, any invalid legal demand that prohibits disclosure of a government request for customer data.” https://powerbi.microsoft.com/en-us/blog/power-bi-gdpr-whitepaper-is-now-available/  

Compliance

Microsoft complies with leading data protection and privacy laws applicable to Cloud services, and this is verified by third parties.

Transparency

Microsoft provides clear explanations on:

  • location of stored data
  • the security of data
  • who can access it and under what circumstances

Qlik

The BI vendor, Qlik, released a statement that declares “With more stringent rules and significant penalties, GDPR compels businesses to use trusted vendors. Qlik is committed to our compliance responsibilities – within our organization and in delivering products and services that empower our customers and partners in their compliance efforts.” – https://www.qlik.com/us/gdpr

Qlik released an FAQ document as a GDPR compliant vendor stating that they have various measures in place to protect personal data and comply with data protection/privacy laws, including GDPR:

  • Legal measures to ensure the lawful transfer
  • Records of data processing activities (Article 30)
  • Ensuring Privacy-By-Design and Privacy-By-Default
  • Data retention and access rules
  • Data protection training and policies

For more information, please view the links below:

https://www.qlik.com/us/-/media/files/resource-library/global-us/direct/datasheets/ds-gdpr-qlik-organization-and-services-en.pdf?la=en

Conclusion

The two vendors discussed are clear in their commitment to ensuring their security arrangements can comply with GDPR. This does not mean that other major players (Tableau, Google, etc.) do not have the same initiatives in flight, we have only focused on Microsoft and Qlik.

Whilst there is no ‘magic button’ available to ensure all regulations are miraculously met, it is possible regardless of vendor:

  • To ensure security policies can meet GDPR compliance
  • To design with privacy in mind.  Even though platforms may meet “privacy is by design”, your specific solution must still be proactively designed.  You cannot simply rely on the vendor
  • To conduct an appropriate solution audit with aligned to GDPR (or Australian Privacy Act) as a good final step

GDPR can indeed be a tricky landscape to navigate – if in doubt, check it out.

We can certainly assist in guiding you through the process from an Data and Analytics perspective.

A Power BI Cheat Sheet – demystifying its concepts, variants and licencing

Power BI has truly evolved over the past few years.  From an add-on in Excel to a true organisation wide BI platform, capable of scaling to meet the demands of large organisations; both in terms of data volumes and the number of users. Power BI now has multiple flavors and a much more complicated licencing model. So, in this article, we demystify this complexity by describing each flavor of Power BI and their associated pricing. We summaries it all at the end with some scenarios and in a single cheat sheet for you to use.

Desktop, Cloud, On-premise, Pro, Premium, Embedded – what does all of this mean?

I thought it best to separate the “why” (i.e. why do you use Power BI – Development or Consumption), the “what” (i.e. what can you do given your licence variant), and the “how much” (i.e. how much is it going to cost you) as combining these concepts often leads to confusion as there isn’t necessarily an easy map of why what and how much.

Let’s first look at the “why”

“Why” deals with the workload performed with Power BI based on its deployment – I.e. why do you use Power BI? Is it for Development or for Consumption. This is very much related to the deployment platform (i.e. Desktop, Cloud, On-Premise or Embedded).

The term “consumption” for the purpose of this article could range from a narrow meaning (I.e. the consumption of Power BI content only) to a broad meaning (i.e. consumption of-, collaboration over-, and management of Power BI content – I refer to this as “self-serve creators”).

Why – workload/ deployment matrix

Now let’s overlay the “why” with “what”

In the table above, I not only dealt with the “why”, but I also introduced the variants of Power BI; namely Desktop, Free, Pro, On-Premise and Embedded. Variants are related to the licence under which the user operates and it determines what a user can do.

Confused? Stay with me…all will become clearer.

What – deployment/ licence variant matrix

Lastly let’s look at the “how much”

The Power BI journey (mostly) starts with development in Desktop, then proceeds to a deployed environment where it is consumed (with or without self-serve). Let’s close the loop on understanding the flavours of Power BI by looking at what this means from a licencing cost perspective.

Disclaimer: The pricing supplied in the following table is based on US-, Australian-, New Zealand- and Hong Kong Dollars. These $ values are by no means quotes but merely taken from the various calculators and pricing references supplied by Microsoft as at the date of first publication of this article.

How much – licence variant/ cost matrix

https://www.microsoft.com/en-Us/sql-server/sql-server-2017-pricing

https://powerbi.microsoft.com/en-us/calculator/

https://azure.microsoft.com/en-us/pricing/calculator/

**Other ways to embed Power BI content are via Rest API’s (authenticated), SharePoint online (via Pro licencing) and Publish to Web (unauthenticated), but that is a level of detail for another day. For the purpose of this article, we focus on Power BI Embedded as the only embedded option.

Pro is pervasive

Even if you deploy to the Cloud and intend to make content available to pure consumers of the content only (non-self-serve users), whether it be in PowerBi.com or as embedded visuals, you will still need at least one Pro licence to manage your content. The more visual content creators (self-server creators) you have, the more Pro licences you will need. But, it is worth considering the mix between Pro and Premium licences, as both Pro and Premium users can consume shared content, but only Pro users can create shared content (via self-service), so the mix must be determined by a cost vs capacity ratio (as discussed below).

A little bit more about Premium

Premium allows users to consume shared content only. It does not allow for any self-service capabilities. Premium licences are not per user, but instead, based according to planned capacity, so you pay for a dedicated node to serve your users. Consider Premium licencing for organisations with large numbers of consumers (non-self-serve) that also require the dedicated computer to handle capacity. The organisation would still require one or more Pro licences for content management and any self-serve workload.

Premium licencing is scaled as Premium 1, 2 or 3 dependant on the number of users and required capacity. You can scale up your capacity by adding more nodes as P1, P2 or P3, or scale up from P1 to P2, and from P2 to P3.

Premium capacity levels

The mix between Pro and Premium

Given that Pro users can do more than Premium users, and given that you will need to buy one or more Pro licences anyway, why would you not only use Pro rather than Premium? There are two reasons:

  • There is a tipping point where Pro becomes more expensive compared to Premium, and
  • With Pro licences you use a shared pool of Azure resources, so is not as performant as Premium which uses dedicated resources, so there is a second tipping point where your capacity requirements won’t be sufficiently served by Pro.

The diagram below shows the user and capacity tipping points (discussed further in scenario 1 below):

Capacity planning Premium 1 vs Pro: Users/ Cost/ Capacity

Put this all together

Right, you now understand the “why”, “what” and “how much” – let’s put it all together through examples (I will use Australian $ only for illustrative purposes). Please note that there are various ways to achieve the scenarios below and this is not a comprehensive discussion of all the options.

Scenario 1

A large organisation has 10 Power BI Developers; their Power BI rollout planning suggest that they will grow to 50 self-service creators and 1450 additional high activity consumers in 12 months. And that they will grow to 125 self-serve creators and 5000 high activity consumers in 48 months:

Initially, they will require

10 x Power BI Desktop licences = $0 x 10 = $0

500 x Power BI Pro licences to cover both self-serve users and consumers = $12.70 x 500 = $6,350

Total – A$6,350.00pm

Once they exceed 500 they can revert to

50 x Power BI Pro licences to cover self-serve users = $12.70 x 50 = $635

1 x P1 node to cover the next tranche of high activity consumers = $6,350

Total – A$6,985.00pm

Thereafter

Add Power BI Pro licences as required up to their planned 125 = $12.70 x 125 = $1,588

Add 1 additional P1 node at 1,450 users, and again at 2,900 users, and again at 4,250 users = $25,400 for 4 x P1 nodes

Total after 4 years at 5000 high activity consumers and 125 self-serve creators – A$26,988.00pm

Scenario 2

A small organisation with 1 Power BI developer, 5 additional self-service creators and 10 additional consumers of visual content, with no custom applications/ websites.

1 x Free version of Power BI Desktop: 1 x $0

15 x Pro licences as both visual creators and mere consumers will take part in shared content: 15 x $12.70

Total – A$190.50pm

Scenario 3

A small ISV organisation with 3 Power BI developers want to embed Power BI content in an application that they sell. The application must be up 24 x 7 and do not require a very high volume of concurrent users, but licencing cannot be on a per-user basis.

3 x Free version of Power BI Desktop: 3 x $0

1 x Pro licences acting as the mater of the Shared content: 1 x $12.70

A1 Node pricing: 1 x $937

Total – A$950.00pm

Scenario 4

A medium sized organisation with 5 Power BI developers want to embed Power BI content in an internal portal such as SharePoint which is used by potentially 250 users. They also have 10 self-service creators and 25 consumers of Power BI content through the Power BI portal.

5 x Free version of Power BI Desktop: 3 x $0

26 x Pro licences acting as 1 mater of the Shared content and 25 consumers: 26 x $330.20

A1 Node pricing: 1 x $937

Total – A$1,267.20pm

Power BI – licence variant, workload, deployment & cost cheat sheet

Any process is shown in Australian $

Disclaimer: The pricing supplied in the following table are by no means quotes, but merely taken from the various calculators and pricing references supplied by Microsoft as at the date of first publication of this article.

Licence variant, workload, deployment & cost cheat sheet

Blockchain in bits – A technical insight

baas_image

In our previous two articles, we articulated several real-life use cases for Blockchain implementations, and we have also elaborated conceptually how Blockchain differs from current/previous data storage architecture as well as other conceptual benefits of Blockchain as a platform.

In this article, we touch upon the technical components of Blockchain networks and Smart Contracts, and we walk through a technical implementation of a viable Blockchain application using the Microsoft Azure platform.

What is Blockchain?

The blockchain is a shared ledger which stores data differently to typical database platforms and solves several challenges by avoiding double spending and the need for trusted authorities or centralised computing servers. Furthermore, Blockchain as a technology has evolved since the introduction of the Bitcoin Blockchain in 2008 (invented by Satoshi Nakamoto), and are now solving more recognisable business problems other than cryptocurrencies.

In addition to the concepts discussed in the previous article, below are some additional descriptions of Blockchain components before we dive into the technical walk-through:

Blocks – A block is a valid record/transaction in Blockchain that Blockchain can’t be altered or destroyed. It is a digital footprint based on Cryptographic hash which remains in the system as long as the system is alive.  Since the Blockchain is decentralised, the blocks are replicated across the network nodes, thus making them immutable and secure.

Cryptographic hash – Cryptographic hash functions are cryptography algorithms that generate hash values for a given piece of data. It ensures authenticity, integrity and security of the data.

Nodes –  A node is a computer/server/virtual machine that participates in a Blockchain network. Nodes store all the blocks and transactions generated in the system. A peer-to-peer (P2P) architecture connects nodes of a Blockchain. When a device is attached to the network as a node, all blocks are downloaded and synchronised. Even if one node goes down, the network is not impacted.

Miner Node – Miner nodes create the blocks for processing the transactions. They validate new transactions and add blocks to the Blockchain. Any node can be a miner node since all the blocks in the network are replicated across each node including the miner node; hence a failing of any miner node is not seen as a single point of failure. It is advisable to set high computing machines as miner nodes since mining consumes a lot of power and resources.

How a Blockchain transaction works

A Blockchain transaction should complete a set of pre-cursory activities to ensure the integrity and security. These steps make the network of the Blockchain a unique proposition for a trust computing paradigm.

Let’s look at the Blockchain transaction lifecycle.

  1. A user initiates a transaction on Blockchain through a “wallet” or on a web3 interface.
  2. The transaction is validated by the set of computing nodes called miners using Cryptographic hash functions.
  3. Miner nodes create blocks based on the transaction using crypto economic options like Proof of Work (PoW) or Proof of Stake (PoS)
  4. The block is synchronised within the other nodes within the Blockchain network.
Blockchain transaction lifecycle

Types of Blockchain networks

Before setting up a Blockchain, one must determine the type of network required. There are three types of Blockchain Network applications.

Public Blockchain:

  • An open (public) network ready for use at any given point in time. Anyone can read the transactions and deploy decentralised apps that use the underlying blocks. No central authority controls the network.
  • These Blockchain networks are “fully decentralised”.
  • Use case: Ethereum Cryptocurrency Blockchain can be used efficiently for managing payments or running Blockchain apps globally.

Consortium Blockchain:

  • A group of nodes controlling the consensus process.  The right to read from may be public, but the participation within the Blockchain can be limited to consortium members by using API calls to limit the access and contents of the Blockchain.
  • For example, a statutory body or an organisation may implement a regulatory Blockchain application that allows selected organisations to participate in validating the process.
  • These Blockchain networks are “Partially decentralised”.
  • Use case: Reserve Bank of Australia (RBA) can set up a Blockchain network for processing and controlling specific banking transactions across banks based on statutory compliance requirements. Participating banks implement Blockchain nodes to authenticate transactions in the network.

Private Blockchain:

  • Similar to any other centralised database application that is controlled and governed by a company or organisation. They have complete write access and read permissions although the public may be allowed to see specific transactions at the Blockchain network administrator’s discretion.
  • These Blockchain networks are “Centralised”.
  • Use case: A company can automate its supply chain management using Blockchain technology.
Types of Blockchains

Implementing Blockchain on Azure

Blockchain on Azure is a Blockchain as a service (BaaS) which is an open flexible and scalable platform. Organisations can opt for BaaS to implement solutions on a federated network based on security, performance and operational processes without investing in physical infrastructure.

Azure BaaS provides a perfect ecosystem to design, develop and deploy cloud-based Blockchain applications. Rather than spending hours building out and configuring the infrastructure across organisations, Azure automate these time-consuming pieces to allow us to focus on building out your scenarios and applications. Through the administrator web page, you can configure additional Ethereum accounts to get started with smart contracts, and eventually application development.

Consortium Blockchains can be deployed using:

Ethereum Consortium Leader

  • To start a new multi-node Ethereum Consortium network, implement the Ethereum Consortium Leader.
  • And a primary network for the other multi-node members to join.

Ethereum Consortium Member

  • To join an existing Ethereum Consortium network, deploy the Ethereum Consortium Member.

Private Blockchains can be deployed using

Ethereum Consortium Blockchain

  • To create a private network use Ethereum Consortium Blockchain
  • Templated to build a private network within minutes on the Azure cloud

Below are links that will allow users to achieve a step by step approach to deploy a Blockchain network on the Azure cloud.

Once deployed you will receive the following details:

  • Admin Site: A website you can navigate to showing the status of the nodes on your Ethereum network.
  • Ethereum-RPC-Endpoint: An endpoint for connecting to your Ethereum network via an API like Truffle or web3 js.
  • Ssh-to-first-tx-node: To interact with your Blockchain, log in using your Secure Shell (SSH) client. I’m currently working on Windows, so I’ll be using Putty (https://www.putty.org/) to log in, but you can use any SSH client to connect the console. On Mac, you can just copy and paste the “ssh” line into your terminal.

Interacting with Your Azure Blockchain Using Geth

Geth is a multipurpose command line tool that runs a full Ethereum node implemented in Go. It offers three interfaces: the command line subcommands and options, a JSON-RPC server and an interactive console.

Steps to connect the Blockchain instance:

  • SSH into the Azure server using Putty or Command-line interface
  • Use the following command to connect to the Blockchain console

  • Loads all the modules below and the command prompt is available

  • Examples of geth Command

You can access the network using the Mist Ethereum wallet or any other Ethereum compatible wallet.

Mist Ethereum wallet

Smart Contracts in action

“Smart Contracts: Building Blocks for Digital Free Markets” – Nick Szabo

Smart contracts are set of terms and conditions one must meet to allow for something to happen between parties. It is just code in the form of blocks and is immutable.  Smart contracts:

  • Are anonymous.
  • Are secured using encryption so that they are safe.
  • Can’t be lost since they are duplicated into other Blockchain nodes.
  • Speed up the business process.
  • Save money since there is no need for any third party to validate and go through the contract terms.
  • Are accurate since they avoid errors that happen during manual execution of any contracts.
Example of how a smart contract works

In the above example, the following are the actions captured:

  1. Mark uses the healthcare consortium network to record his details. The details are persisted in the blockchain through a smart contract. A smart contract can hold all the required variables and attributes.
  2. Once the smart contract has acquired all the mandatory information and requirements, it is then deployed into the healthcare consortium network. A transaction is initiated for further consultation.
  3. Healthcare consortium network validates the transaction based on the logic defined in the smart contract. Mark has been detected with some health issues and the contract/health record is automatically sent to Dr John for further analysis and consultation.
  4. Dr John accesses the record and recommends Dr Anne for specialised treatment. The contract is automatically executed and sent to Dr Anne for further action.
  5. Dr Anne provides necessary treatment to Mark. The details of the treatment are persisted in the smart contract.

There are various tools to write/deploy a smart contract, however, common tools used are:

  • Languages: Solidity
  • IDE: Solidity Browser, Ethereum Studio.
  • Clients: geth, eth, Ethereum Wallet.
  • Api & framework : Embark, truffle, DAPPLE, Meteor, web3.js API, ethereumj,  Blockapps
  • TEST : TestRpc/ testnet or private network
  • Storage : IPFS/ swarm/Storj.
  • Dapp Browser: Netmask, Mist.

An example of solidity script can be found below.

Solidity script

Blockchain and Data Analytics

Perhaps the most critical development in information technology is the growth of data analytics and platforms in the Big Data, Machine Learning and Data Visualization space.  Analytics/Data lakes can source Blockchain data using federated APIs built on top of Blockchain. Since the provenance and lineage of data is well accomplished, the data from the Blockchain can be helpful in developing a productive data platform for data analytics or machine learning capabilities or AI development.

The following diagram is a simplistic view for integrating data analytics with Blockchain.

Blockchain Data Analytics

Conclusion

Before an organisation starts any of the technology assessments and implementation of a Blockchain, even if just for R&D, consider what a Blockchain would mean for your organisation through potential use cases and process improvement opportunities. Moreover, ensure some of the basic concepts described here and in the second article in the series are understood vis-a-via your identified use cases.

Only then proceed to the technology side of things.

Blockchain has the potential to be a fantastic technology through its federated computing paradigm. But do not lose sight of the process and people aspects associated with this