Bringing Australian Wine to the World – our Wine Australia case study

See how we used modern methodology, cloud analytical technologies and thought leadership to architect and create this public facing interactive export analytical solution that empowers Australian wine exporters to make informed, data-driven decisions.

See our case study here.

Have a look at the solution in the link below. Use any of the “Get Started” questions to start your journey. Market Explorer Tool

See our short video here.

Smart Wastewater network – our SA Water case study

This solution enables greater customer service through better management of the wastewater assets through an industry-leading smart wastewater network management solution. It incorporates a leading edge architecture built on the Azure platform using key technologies including Databricks, IoT, Stream Analytics, App Service, and more.

Read our case study here: Smart Wastewater network – our SA Water case study

Databricks: distilling Information from Data

In the first of this series of articles on Databricks, we looked at how Databricks works and the general benefits it brings to organisations ready to do more with their data assets. In this post we build upon this theme in the advanced analytics space. We will also walk through  an interesting (biometric data generated by an Apple Watch) example of how you might use Databricks to distill useful information from complex data.

But first, let’s consider four common reasons why the value of data is not fully realised in an organisation:

Separation between data analysis and data context.

Those who have deep data analytic skills – data engineers, statisticians, data scientists – are often in their own specialised area with a business. This area is separated from those who own and understand data assets. Such a separation is reasonable: most BAU data collection streams don’t have a constant demand for advanced analytical work, and often advanced analytical projects require data sourced from a variety of business functions. Unfortunately, success requires strong engagement between those that deeply understand the data and those that deeply understand the analysis. This sort of strong engagement is difficult to moderate in practice.

We’ve seen cases where because of the short-term criticality of BAU work or underappreciation of R&D work, business data owners are unable to appropriately contribute to a project, leaving advanced analytics team members to make do. We’ve seen cases where all issues requiring data owner clarification are expected to be resolved at the start, and continued issues are taken as a sign that the project is failing. We’ve seen cases where business data knowledge resides solely in the minds of a few experts.

Data analysis requires data context. It’s often said “garbage in, garbage out”, but it’s just as true to say “meaningless data in, meaningless insights out”. Databricks improves this picture by encouraging collaboration between data knowledge holders and data analysts, through its shared notebook-style platform.

Difficulty translating analytical work to production workloads.

Investigation and implementation are two different worlds. Investigation requires flexibility, testing different approaches, and putting “what” before “how”. Implementation requires standards, stability, security and integration into systems that have a wider purpose.

A good example of this difficulty is the (still somewhat) ongoing conflict between the use of Python 2 and Python 3. Python 3 has now almost entirely subsumed Python 2 in functionality, speed, consistency and support. However, due to legacy code, platforms and standards within organisations, there are still inducements to use Python 2, even if a problem is better addressed with Python 3. This same gap can also be found in individual Python modules and R packages. A similar gap can be found in organisational support for Power BI Desktop versions. A more profound gap can be seen if entirely different technologies are used by different areas.

This could either lead to substantial overhead for IT infrastructure sections or substantial barriers to adoption of valuable data science projects. PaaS providers offer to maintain the data analysis platform for organisations, enabling emerging algorithms and data analysis techniques to be utilised without additional infrastructure considerations. Additionally, Databricks supports Python, R, SQL and Scala, which cover the major non-proprietary data analysis languages.

Long advanced analysis iterations.

The two previous issues contribute to a third issue: good advanced analyses take time to yield useful results for the business. By the time a problem is scoped, understood, investigated, confronted and solved the problem may have changed shape or been patched by business rules and process changes enough that the full solution implementation is no longer worth it. Improving the communication between data knowledge holders and data analysts and shortening the distance between investigation and implementation mean that the time between problem and solution is shortened.

What this mean for your organisation is that the business will begin to see more benefits of data science. As confidence and acceptance grow so does the potential impact of data science. After all, more ambitious projects require more support from the business.

Data science accepted as a black box.

Data science is difficult, uncertain and broad. This has three implications. Firstly, a certain amount of unsatisfying results must be expected and accepted. Secondly, there is no single defensible pathway for addressing any given problem. Thirdly, no one person or group can understand every possible pathway for generating solutions. Unfortunately, these implications mean that data science practitioners can be in a precarious position justifying their work. Many decision makers can only judge data science by its immediate results, regardless of the unseen value of the work performed. Unseen value may be recognition of data quality issues or appreciation of better opportunities for data value generation.

We don’t believe in this black box view of data science. Data science can be complicated, but its principles and the justifications within a project should be understood by more than just nominal data scientists. This understanding gap is a problem for an organisation’s maturity in the data science space.

Over recent years wide in-roads have been made into this problem with the rise in usage of notebook-style reports. These reports contain blocks of explanatory text, executable code, code results and mathematical formulas. This mix of functions allows data scientists to better expose the narrative behind their investigation of data. Notable examples of this style are Jupyter Notebooks, R Markdown, or Databricks.

Databricks enables collaboration, platform standardisation and process documentation within an advanced analytics project. Ultimately this means a decreased time between problem identification and solution implementation.

Databricks Example: Biometric Data

For demonstrating Databricks, we have an interesting, real data source: the biometrics collected by our watches and smartphones. You probably also have access to this kind of data; we encourage you to test it out for yourself. For Apple products it can be extracted as an XML file and mounted to the Databricks file system. Not sure how to do this? See our previous article.

Specifically, the data we have is from our national manager for technology, Etienne’s watch and smartphone. Our aim is to extract useful insights from this data. The process we will follow (discussed in the subsequent sections are):

  1. Rationalise the recorded data into an appropriate data structure.
  2. Transform the data to be useful for its intended purpose.
  3. Visualise and understand relationships within the data.
  4. Model these relationships to describe the structure of the data.

Typically, advanced analytics in the business context should not proceed this way. There, a problem or opportunity should be identified first and the model should be in service of this. However here we have the flexibility to decide how we can use the data as we analyse it. This is a blessing and a curse (as we shall see).

Rationalisation

The process of converting the XML data into a dataframe could be overlooked. It’s not terribly exciting. But it does demonstrate the simplicity of parallelisation when using Databricks. Databricks is built over Apache Spark, an engine designed for in-memory parallel data processing. The user doesn’t need to concern themselves how work is parallelised*, just focus on what they need done. Work can be described using Scala, Python, R or SQL. In this case study we’ll be using Python, which interacts with Spark using the PySpark API.

Since we’ve previously mounted our XML biometrics summary, we can simply read it in as a text file. Note that there are ways to parse XML files, but to see what we’re working with a text file is a bit easier.

We’ve asked Spark (via sc, a representation of “Spark Context”) to create a Resilient Distributed Dataset (RDD) out of our biometrics text file export.xml. Think of RDDs as Spark’s standard data storage structure, allowing parallel operations across a cluster of machines. In our case our RDD contains 2.25 million lines from export.xml. But what do these lines look like?

A simple random sample of 10 lines shows that each biometric observation is stored within the attributes of a separate record tag in the XML. This means that extracting this into a tabular format can be quite straight forward. All we need to do is identify record tags and extract their attributes. However, we should probably check that all of our record tags are complete first.

We’ve imported re, a Python module for regular expression matching. Using this we can filter our RDD to find records that begin with “<Record” but are not terminated with “>”. Fortunately, it appears that this is not the case. We can also test for the case where there are multiple records in the same line, but we’ll skip this here. Next we just need to filter our RDD to Record tags.

In both of these regular expression checks, I haven’t had to consider how Spark is parallelising these operations. I haven’t had to think any differently from how I would solve this problem in standard Python. I want to check each record has a particular form – so I just import the module I would use normally, and apply it in the Pyspark filter method.

*Okay, not entirely true. Just like in your favourite RDBMS, there are times when the operation of the query engine is important to understand. Also like your favourite RDBMS, you can get away with ignoring the engine most of the time.

Transformation

We already have our records, but each record is represented as a string. We need to extract features: atomic attributes that can be used to compare similar aspects of different records. A record tag includes features as tag attributes. For example, a record may say unit=”cm”. Extracting the individual features from the record strings in our RDD using regular expressions is fairly straightforward. All we need to do is convert each record string into a dictionary (Python’s standard data structure for key-value pairs) with keys representing the feature names and values representing the feature values. I do this in one (long) line by mapping each record to an appropriate dictionary comprehension:

This has converted our RDD into a dataframe – a table-like data structure, composed of columns of fixed datatypes. By and large, the dataframe is the fundamental data structure for data science investigations, inherited from statistical programming. Much of data science is about quantifying associations between features or predictor variables and variables of interest. Modelling such a relationship is typically done by comparing many examples of these variables, and rows of a dataframe are convenient places to store these examples.

The final call to the display function in the above code block is important. This is the default (and powerful) way to view and visualise your data in Databricks. We’ll come back to this later on.

So we have our raw data converted into a dataframe, but we still need to understand the data that actually comprises this table. Databricks is a great platform for this kind of work. It allows iterative, traceable investigations to be performed, shared and modified. This is perfect for understanding data – a process which must be done step-by-step and is often frustrating to document or interpret after the fact.

Firstly in our step-by-step process, all of our data are currently strings. Clearly this is not suitable for some items, but it’s easily fixed.

The printSchema method indicates that our dataframe now contains time stamps and decimal values where appropriate. This dataframe has, for each row:

  • creationDate: the time the record was written
  • startDate: the time the observation began
  • endDate: the time the observation ended
  • sourceName: the device with which the observation was made
  • type: the kind of biometric data observed
  • unit: the units in which the observation was measured
  • value: the biometric observation itself

Visualisation

So we have a structure for the data, but we haven’t really looked into the substance of the data yet. Questions that we should probably first ask are “what are the kinds of biometric data observed?”, and “how many observations do we have to work with?”. We can answer these with a quick summary. Below we find how many observations exist of each type, and between which dates they were recorded.

We see that some of the measures of energy burned have the most observations:

  1. Active Energy Burned has over 650,000 observations between December 2015 and November 2018
  2. Basal Energy Burned has over 450,000 observations between July 2016 and November 2018
  3. Distance Walking/Running has over 200,000 observations between December 2015 and November 2018
  4. Step Count has about 140,000 observations between December 2015 and November 2018
  5. Heart Rate has about 40,000 observations between December 2015 and November 2017
  6. Other kinds of observations have less than 30,000 observations

This tells us that the most rich insights are likely to be found by studying distance travelled, step count, heart rate and energy burned. We might prefer to consider observations that are measured (like step count) rather than derived (like energy burned), although it might be an interesting analysis in itself to try to find how these derivations are made.

Let’s begin by looking into how step count might relate to heart rate. Presumably, higher step rates should cause higher heart rates, so let’s see whether this is borne out in the data.

I’ve chosen to convert the data from a Spark dataframe to a Pandas dataframe to take advantage of some of the datetime manipulations available. This is an easy point of confusion for a starter in PySpark: Spark and Pandas dataframes are named the same, but operate differently. Primarily, Spark dataframes are distributed so operate faster with larger datasets. On the other hand, Pandas dataframes are generally more flexible. In this case since we’ve restricted our analysis to a subset of our original data that’s small enough to be confident with a Pandas dataframe.

Actually looking at the data now, one problem appears: the data are not coherent. That is, the two kinds of observations are difficult to compare. This manifests in two ways:

  1. Heart rate is a point-in-time measurement, while step count is measured across a period of time. This is a similar incoherence to the one in economics surrounding stock and flow variables. To make the two variables comparable we can assume that the step rate is constant across the period of time the step count is measured. As long as the period of time is fairly short this assumption is probably quite reasonable.
  2. Heart rate and step count appear to be sampled independently. This means that comparing them is difficult because at times where heart rate is known, step count is not always known, and vice versa. In this case we could assume that both types of observation are sampled independently so we can restrict our comparisons to observations of heart rate and step rate that are reasonably close.

Once we have some observations of heart rate and step rate, we can compare them:

On the vertical axis we have heart rate in beats per minute and on the horizontal axis we have pace in steps per second. Points are coloured so that older points are lighter, which allows us to see if there is an obvious change over time. The graph shows that Etienne’s usual heart rate is about 80 bpm, but when running it increases to between 120 and 180. It’s easy to notice an imbalance between usual heart rate observations and elevated heart rate observations – the former are much more prevalent.

There appears to be at least one clear outlier – the point where heart rate is under 40 bpm. There are also a small amount of observations that have normal heart rate and elevated pace or vice versa – these may be artifacts of our imperfect reconciliation of step count and heart rate. We could feed this back to improve the reconciliation process or re-assess the assumptions we made, which would be particularly useful with subject matter expert input.

The graph above shows the observations of step rate over time, with black indicating observations that have elevated heart rates. There are a few interesting characteristics – most obviously, observations are far more dense after July 2016. Also, rather alarmingly, there are only a small number of clusters of observations with elevated heart rates, which means that we cannot treat observations as independent. This is often the case for time series data, and it complicates analysis.

We could instead compare the progression of heart rate changes with pace by looking at each cluster of elevated heart rate records as representative of a single exercise events. However, we would be left with very few events. Rather than attempt to clean up the data further, let’s pivot.

Transformation (Iteration 2)

Data doesn’t reveal all of their secrets immediately. Often this means our analyses need to be done in cycles. In our first cycle we’ve learned:

  1. Data have been collected more completely since mid-2016. Perhaps we should limit our analysis to only the most recent year. This means we should not perhaps attempt to identify long-term changes in the data.
  2. Heart rate and step rate are difficult to reconcile because they often make observations at different times. It would be better to focus on a single type of biometric.
  3. There are only a small number of reconcilable recorded periods of elevated heart rate and step rate. Our focus should be on observations where we have more examples to compare.

Instead of step count and heart rate, let’s instead look at patterns in distance travelled by day since 2017.  This pivot answers each of the above issues: it is limited to more recent data, it focuses on a single type of biometric data, and it allows us to compare on a daily basis. Mercifully, distance travelled is also one of the most prevalent observations in our dataset.

You’d be right to say that this is a 180 degree pivot. We’re now looking at an entirely different direction. This is an artifact of our lack of a driving business problem, and it’s something you should prepare yourself for too if you commission the analysis of data for the sake of exploration. You may find interesting insights, or you may find problems. But without a guiding issue to address there’s a lot of uncertainty about where your analysis may go.

Stepping down from my soapbox, let’s transform our data. What I want to do is to record the distance travelled in every hourly period from 8am to 10pm since 2017. Into a dataframe “df_x”, I’ve placed all distance travelled data for 2017:

In the above we tackle this in three steps:

  1. Define a udf (user defined function) which returns the input number if positive or zero otherwise
  2. Use our udf to iteratively prorate distance travelled biometrics into the whole hour between 8am and 10pm that they fell into, naming these columns “hourTo9”, up to “hourTo22”.
  3. Aggregate all distances travelled into the day they occurred

This leaves us with rows representing individual calendar days and 14 new columns representing the distance travelled during a hour of the day.

Visualisation (Iteration 2)

This section is not just an exploration of the data, but an exploration of Databricks’ display tool, which allows users to change the output from a code step without re-running the code step. Found at the bottom of every output generated by the display command is a small menu:

This allows us to view the data contained in the displayed table in a graphical form. For example, choosing “Scatter” gives us a scatterplot of the data, which we can refine using the “Plot Options” dialogue:

We can use these plot options to explore the relationship between the hourly distance travelled variables we’ve created. For example, given a selection of hours (8am to 9am, 11am to 12pm, 2pm to 3pm, 5pm to 6pm, and 8pm to 9pm), we observe the following relationships:

Notice that long distances travelled in one hour of a day makes it less likely that long distances are travelled in other hours. Notice also that there is a fair skew in distances travelled, which is to be expected since the longest distances travelled can’t be balanced by negative distances travelled. We can make a log(1+x) transformation, which compresses large values to hopefully leave us with less skew:

The features we have are in 14 dimensions, so it’s hard to visualise how they might all interact. Instead, let’s use a clustering algorithm to classify the kinds of days in our dataset. Maybe some days are very sedentary, maybe some days involve walking to work, maybe some days include a run – are we able to classify these days?

There are a lot of clustering algorithms at our disposal: hierarchical, nearest-neighbour, various model-based approaches, etc. These perform differently on different kinds of data. I expect that there are certain routines within days that are captured by the data with some random variation: a set jogging route that occurs at roughly the same time on days of exercise, a regular stroll at lunchtime, a fixed route to the local shops to pick up supplies after work. I think it’s reasonable to expect on days where a particular routine is followed, we’ll see some approximately normal error around the average case for that routine. Because of this, we’ll look at using a Gaussian Mixture model to determine our clusters:

I’ve arbitrarily chosen to cluster into 4 clusters, but we could choose this more rigorously. 4 is enough to show differences between different routines, but not too many for the purpose of demonstration.

The graph above shows the 4 types of routine (labelled as “prediction” 0-3), and their relative frequency for each day of the week. Notably type 1 is much more prevalent on Saturday than other days – as is type 3 for Sunday. Type 2 is much more typical a routine for weekdays, appearing much less on weekends. This indicates that perhaps there is some detectable routine difference between different days of the week. Shocking? Not particularly. But it is affirming to see that the features we’ve derived from the data may capture some of these differences. Let’s look closer.

Above we have the actual profiles of the types of daily routines, hour-by-hour. Each routine has different peaks of activity:

  • Type 0 has sustained activity throughout the day, with a peak around lunchtime (12pm – 2pm).
  • Type 1 has sustained activity during the day with a local minimum around lunchtime, and less activity in the evening.
  • Type 2 has little activity during core business hours, and more activity in the morning (8am – 10am) and evening (5pm-7pm)
  • Type 3 has a notable afternoon peak (3pm – 6pm) after a less active morning, with another smaller spike around lunchtime.

If you were doing a full analysis you would also be concerned about the variability within and between each of these routine types. This could indicate that more routines are required to describe the data, or that some of the smaller peaks are just attributable to random variation rather than actual characteristics of the routine.

Finally, the visualisation above shows the composition of the daily routines over the course of a year, labelled by week number. The main apparent change through the course of the year is for routine type 2, which is more frequent during cooler months. This concords with what we might suspect: less activity during business hours in cooler, wetter months.

Taken together, perhaps we can use the hourly distance features to predict whether a day is more likely a weekday or a weekend. This model might not seem that useful at first, but it could be interesting to see which weekdays are most like weekends – perhaps these correspond with public holidays or annual leave?

Modelling

Let’s do a quick model to prove that weekends can be classified just with hourly movement data. There are a lot of possible ways to approach this, and a lot of decisions to make and justify. As a demonstrator here we’ll create a single model, but won’t refine it or delve too deeply into it.

Based on the types of routines identified in our cluster analysis, it’s fair to suspect that there may not be a monotonic relationship between the distance travelled in any particular hour and weekend/weekday membership. So rather than using the simplest classification model, logistic regression*, let’s fit a random forest classifier. First, we need to include a label for weekends and weekdays. I choose to call this “label” because by default this is the column name that Pyspark’s machine learning module will expect for classification.

As usual to allow us to check for overfitting, let’s separate the data into a training set and a test set. In this case we have unbalanced classes, so some might want to ensure we’re training on equal numbers of both weekdays and weekends. However, if our training data has the same relative class sizes as the data our model will be generalised to and overall accuracy is important then there isn’t necessarily a problem with unbalanced classes in our training data.

Now let’s prepare for model training. We’ll try a range of different parametrisations of our model, with different numbers of trees, and different numbers of minimum instances per node. Cross-validation is used to identify the best model (where best is based on the BinaryClassificationEvaluator, which uses area under ROC curve by default).

Fitting the model is then simply a matter of applying the cross-validation to our training set:

Finally, we can evaluate how successful our model is.:

So our model is reasonable on our test data, with a test ROC curve covering 0.86 and an overall accuracy of 0.82, which compares favourably to the accuracy of our null model, which would classify all observations as a weekday and have an accuracy of 0.71. There are many more possible avenues to investigate, even within the narrow path we’ve taken here. This is a curse of exploratory analysis.

*To be fair, logistic regression can capture non-monotonicity as well, but this requires modifying features (perhaps adding polynomial functions of features)

Wrapping Up

Databricks gives us a flexible, collaborative and powerful platform for data science, both exploratory and directed. Here we’ve only managed to scratch the surface, but we have shown some of the features that it offers. We also hope we’ve shown some of the ways it addresses common problems businesses face bringing advanced analysis into their way-of-working. Databricks is proving to be an important tool for advanced data analysis.

Branch locations and success criteria using predictive analytics – our Agribusiness case study

Exposé designed and developed a solution that uncovered whether there were strong relationships between known characteristics of branches and their success in order to determine new locations to be considered and the services they will offer.

See our case study here: exposé case study – Agribusiness – Branch locations and success criteria using predictive analytics

Blockchain in bits – A technical insight

baas_image

In our previous two articles, we articulated several real-life use cases for Blockchain implementations, and we have also elaborated conceptually how Blockchain differs from current/previous data storage architecture as well as other conceptual benefits of Blockchain as a platform.

In this article, we touch upon the technical components of Blockchain networks and Smart Contracts, and we walk through a technical implementation of a viable Blockchain application using the Microsoft Azure platform.

What is Blockchain?

The blockchain is a shared ledger which stores data differently to typical database platforms and solves several challenges by avoiding double spending and the need for trusted authorities or centralised computing servers. Furthermore, Blockchain as a technology has evolved since the introduction of the Bitcoin Blockchain in 2008 (invented by Satoshi Nakamoto), and are now solving more recognisable business problems other than cryptocurrencies.

In addition to the concepts discussed in the previous article, below are some additional descriptions of Blockchain components before we dive into the technical walk-through:

Blocks – A block is a valid record/transaction in Blockchain that Blockchain can’t be altered or destroyed. It is a digital footprint based on Cryptographic hash which remains in the system as long as the system is alive.  Since the Blockchain is decentralised, the blocks are replicated across the network nodes, thus making them immutable and secure.

Cryptographic hash – Cryptographic hash functions are cryptography algorithms that generate hash values for a given piece of data. It ensures authenticity, integrity and security of the data.

Nodes –  A node is a computer/server/virtual machine that participates in a Blockchain network. Nodes store all the blocks and transactions generated in the system. A peer-to-peer (P2P) architecture connects nodes of a Blockchain. When a device is attached to the network as a node, all blocks are downloaded and synchronised. Even if one node goes down, the network is not impacted.

Miner Node – Miner nodes create the blocks for processing the transactions. They validate new transactions and add blocks to the Blockchain. Any node can be a miner node since all the blocks in the network are replicated across each node including the miner node; hence a failing of any miner node is not seen as a single point of failure. It is advisable to set high computing machines as miner nodes since mining consumes a lot of power and resources.

How a Blockchain transaction works

A Blockchain transaction should complete a set of pre-cursory activities to ensure the integrity and security. These steps make the network of the Blockchain a unique proposition for a trust computing paradigm.

Let’s look at the Blockchain transaction lifecycle.

  1. A user initiates a transaction on Blockchain through a “wallet” or on a web3 interface.
  2. The transaction is validated by the set of computing nodes called miners using Cryptographic hash functions.
  3. Miner nodes create blocks based on the transaction using crypto economic options like Proof of Work (PoW) or Proof of Stake (PoS)
  4. The block is synchronised within the other nodes within the Blockchain network.

Blockchain transaction lifecycle

Types of Blockchain networks

Before setting up a Blockchain, one must determine the type of network required. There are three types of Blockchain Network applications.

Public Blockchain:

  • An open (public) network ready for use at any given point in time. Anyone can read the transactions and deploy decentralised apps that use the underlying blocks. No central authority controls the network.
  • These Blockchain networks are “fully decentralised”.
  • Use case: Ethereum Cryptocurrency Blockchain can be used efficiently for managing payments or running Blockchain apps globally.

Consortium Blockchain:

  • A group of nodes controlling the consensus process.  The right to read from may be public, but the participation within the Blockchain can be limited to consortium members by using API calls to limit the access and contents of the Blockchain.
  • For example, a statutory body or an organisation may implement a regulatory Blockchain application that allows selected organisations to participate in validating the process.
  • These Blockchain networks are “Partially decentralised”.
  • Use case: Reserve Bank of Australia (RBA) can set up a Blockchain network for processing and controlling specific banking transactions across banks based on statutory compliance requirements. Participating banks implement Blockchain nodes to authenticate transactions in the network.

Private Blockchain:

  • Similar to any other centralised database application that is controlled and governed by a company or organisation. They have complete write access and read permissions although the public may be allowed to see specific transactions at the Blockchain network administrator’s discretion.
  • These Blockchain networks are “Centralised”.
  • Use case: A company can automate its supply chain management using Blockchain technology.

Types of Blockchains

Implementing Blockchain on Azure

Blockchain on Azure is a Blockchain as a service (BaaS) which is an open flexible and scalable platform. Organisations can opt for BaaS to implement solutions on a federated network based on security, performance and operational processes without investing in physical infrastructure.

Azure BaaS provides a perfect ecosystem to design, develop and deploy cloud-based Blockchain applications. Rather than spending hours building out and configuring the infrastructure across organisations, Azure automate these time-consuming pieces to allow us to focus on building out your scenarios and applications. Through the administrator web page, you can configure additional Ethereum accounts to get started with smart contracts, and eventually application development.

Consortium Blockchains can be deployed using:

Ethereum Consortium Leader

  • To start a new multi-node Ethereum Consortium network, implement the Ethereum Consortium Leader.
  • And a primary network for the other multi-node members to join.

Ethereum Consortium Member

  • To join an existing Ethereum Consortium network, deploy the Ethereum Consortium Member.

Private Blockchains can be deployed using

Ethereum Consortium Blockchain

  • To create a private network use Ethereum Consortium Blockchain
  • Templated to build a private network within minutes on the Azure cloud

Below are links that will allow users to achieve a step by step approach to deploy a Blockchain network on the Azure cloud.

Once deployed you will receive the following details:

  • Admin Site: A website you can navigate to showing the status of the nodes on your Ethereum network.
  • Ethereum-RPC-Endpoint: An endpoint for connecting to your Ethereum network via an API like Truffle or web3 js.
  • Ssh-to-first-tx-node: To interact with your Blockchain, log in using your Secure Shell (SSH) client. I’m currently working on Windows, so I’ll be using Putty (https://www.putty.org/) to log in, but you can use any SSH client to connect the console. On Mac, you can just copy and paste the “ssh” line into your terminal.

Interacting with Your Azure Blockchain Using Geth

Geth is a multipurpose command line tool that runs a full Ethereum node implemented in Go. It offers three interfaces: the command line subcommands and options, a JSON-RPC server and an interactive console.

Steps to connect the Blockchain instance:

  • SSH into the Azure server using Putty or Command-line interface
  • Use the following command to connect to the Blockchain console

  • Loads all the modules below and the command prompt is available

  • Examples of geth Command

You can access the network using the Mist Ethereum wallet or any other Ethereum compatible wallet.

Mist Ethereum wallet

Smart Contracts in action

“Smart Contracts: Building Blocks for Digital Free Markets” – Nick Szabo

Smart contracts are set of terms and conditions one must meet to allow for something to happen between parties. It is just code in the form of blocks and is immutable.  Smart contracts:

  • Are anonymous.
  • Are secured using encryption so that they are safe.
  • Can’t be lost since they are duplicated into other Blockchain nodes.
  • Speed up the business process.
  • Save money since there is no need for any third party to validate and go through the contract terms.
  • Are accurate since they avoid errors that happen during manual execution of any contracts.

Example of how a smart contract works

In the above example, the following are the actions captured:

  1. Mark uses the healthcare consortium network to record his details. The details are persisted in the blockchain through a smart contract. A smart contract can hold all the required variables and attributes.
  2. Once the smart contract has acquired all the mandatory information and requirements, it is then deployed into the healthcare consortium network. A transaction is initiated for further consultation.
  3. Healthcare consortium network validates the transaction based on the logic defined in the smart contract. Mark has been detected with some health issues and the contract/health record is automatically sent to Dr John for further analysis and consultation.
  4. Dr John accesses the record and recommends Dr Anne for specialised treatment. The contract is automatically executed and sent to Dr Anne for further action.
  5. Dr Anne provides necessary treatment to Mark. The details of the treatment are persisted in the smart contract.

There are various tools to write/deploy a smart contract, however, common tools used are:

  • Languages: Solidity
  • IDE: Solidity Browser, Ethereum Studio.
  • Clients: geth, eth, Ethereum Wallet.
  • Api & framework : Embark, truffle, DAPPLE, Meteor, web3.js API, ethereumj,  Blockapps
  • TEST : TestRpc/ testnet or private network
  • Storage : IPFS/ swarm/Storj.
  • Dapp Browser: Netmask, Mist.

An example of solidity script can be found below.

Solidity script

Blockchain and Data Analytics

Perhaps the most critical development in information technology is the growth of data analytics and platforms in the Big Data, Machine Learning and Data Visualization space.  Analytics/Data lakes can source Blockchain data using federated APIs built on top of Blockchain. Since the provenance and lineage of data is well accomplished, the data from the Blockchain can be helpful in developing a productive data platform for data analytics or machine learning capabilities or AI development.

The following diagram is a simplistic view for integrating data analytics with Blockchain.

Blockchain Data Analytics

Conclusion

Before an organisation starts any of the technology assessments and implementation of a Blockchain, even if just for R&D, consider what a Blockchain would mean for your organisation through potential use cases and process improvement opportunities. Moreover, ensure some of the basic concepts described here and in the second article in the series are understood vis-a-via your identified use cases.

Only then proceed to the technology side of things.

Blockchain has the potential to be a fantastic technology through its federated computing paradigm. But do not lose sight of the process and people aspects associated with this

Blockchain, lifting the lid on confusing concepts

In the previous article An Internet of Value – Blockchain, beyond the hype and why CxO’s must take note, we articulated why blockchain is more than just hype and provided some real-life use cases as to where/how blockchain is already helping businesses. But we want to dig deeper within this article, the second in a series of three articles, to explain how blockchain differs architecturally over its traditional database predecessors so it can do a better job at solving these real-world problems. We also describe some of the terminology used so that readers can better understand Blockchain.

This article demystifies two important concepts: Blockchain and Smart Contracts in relation to Blockchain. It then brings the concepts of Blockchain and Smart Contracts together through the prism of the Internet of Things (IoT) and finally concludes with Analytics.

But first, let’s explain why the evolution away from traditional databases towards Blockchain has been necessary to more readily share immutable data.

Blockchain Revisited

To restate what blockchain is – in its simplest form, it is a distributed database system, where there is no one master (primary) database, but many databases that are all considered primary, and where all parties participate in populating entries into the respective databases and receive the entries of the other participants. Blockchain started life as the technology that underpins cryptocurrencies. Whether cryptocurrencies have a long and bright future is very much an open question, but the technology that evolved because of them does indeed have a long and bright future.  Because of this, most businesses will have to start considering what this means for them and take steps to ready themselves for this new and important part of the technology landscape.

 

A precursor to Blockchain – a Centralised approach to data

Just about any organisation that uses digital infrastructure would be using one or more databases to store meaningful and up-to-date information. These are truly silos of data; secure from ‘outsiders’, private, and often customised for a business’ need to operate efficiently.

The use of databases has expanded over the last several decades, as businesses are needing to open up data they have securely stored in their databases to other businesses or users. Quite often this is done by using API’s (Application programming Interfaces – a controlled opening to a database), for other parties to read from and/or write to the database securely. To ensure data is not lost and always available, replication occurs so that there is redundant data ready to replace any data loss that might occur. Yet, no matter how well this is designed, it is vulnerable to single points of failure. Integrity can also be compromised if access is unopposed to a user who does harm to the underlying data.

As brilliant as database environments are, have been, and will be in the future, there are weaknesses.

D1

A typical database environment has a Primary database which stores and manages changes to data, whether by Inserting, Updating or Deleting. The ‘Replica’ database symbolises a conceptual backup database which can be used if the Primary fails, to avoid loss of data but also a loss of up-time. These databases have evolved over time to be able to handle millions of concurrent transactions, servicing thousands of users at any one time. Cost is and will likely continue to be the biggest contributing factor as to how powerful the database environment becomes, as it’s the hardware of the database environments which will limit how feasible mass concurrent transactions or users they can handle. The point is, there is ‘nearly’ no conceptual restriction for how powerful a realistic database environment can be as long as the user is willing to pay for it.

As time has gone on, having isolated database environments is not always what the 21st century requires. More and more, integration between environments is required, so to ensure an efficient and fast way for movement of data. This comes in many forms, such as movement of money from a payer on one system to payee on another. But also, for other reasons such as opening up a subset of data from a database so it can be interacted with by 3rd parties. An example of this is allowing a customer to log into their account and insert updated contact details, but to also read other data such as their Health Insurance Claims History. It is insecure to allow a 3rd party direct access to a database, so an API (Application Programming Interface) is typically developed.

The development of the API is managed by the owners of the database which they create for a specific interaction between a 3rd party and the underlying database. The API will typically contain very specific rules about what an end user is able to do within the database. These API’s can then be shared to 3rd party developers so that code can be added to websites, mobile apps or other applications not controlled by the database owner, but allows indirect interaction with the data under a controlled process.

Although API’s (which are used extensively) do satisfy more interactivity between systems, there are still issues with this database architecture, such as:

–         There is still a single point of failure

–         Lack of transparency

–         Data can be modified without any governance (governance is owned by the database owners – therefore the strength of this governance will be vastly different between different database environments).

–         Reliance on the database owners to develop APIs or other mechanisms so 3rd parties can access the data

Reliance on 3rd parties to handle contracts between transacting parties (more on this in Smart Contracts)

D2

Blockchain – a Decentralised approach to data

Conceptually, the architecture of Blockchain goes a long way towards solving many of the problems highlighted above within a typical database environment. An implementation of Blockchain can be considered a shared database; all sharing the same schema, and all acting like primary databases. No single point of failure, but all databases participate by populating data which is replicated throughout all the databases within the Blockchain, albeit with a slight lag time.

Blockchain also has its namesake due to the way a collection of transactions is confirmed which are blocks stored in the database, which effectively creates a chain of data, where each block or collection of transactions are immutably coupled to the previous and next block. This makes it near impossible to change, or remove a transaction within the blockchain, as it would cause the chain to be broken, and subsequently, the nodes processing the blockchain will never let that happen. Once data is inserted into the blockchain, it is locked in forever.

For an organisation that has no need to share data or be more transparent (even internally), there is no real benefit of Blockchain. But such organisations will become rare as commercial trends are to share, collaborate and to open key data assets for mutual benefit. Therefore, Blockchain presents a huge opportunity to make this process possible and secure.

D3

In a Blockchain, architecture transactions are written to blocks, which are replicated to the other participants in the blockchain network\consortium. All participants then validate the transactions and only then are the blocks added to the chain to create the tamper-proof audit log.

D4

Smart Contracts

In some of the use cases described in our previous article, An Internet of Value – Blockchain, beyond the hype and why CxO’s must take note, we mentioned “Smart Contracts”. But what are they, and how can they assist?

Smart Contracts is one of the key opportunities provided by Blockchain – first introduced by Nick Szabo all the way back in 1994.

What are Smart Contracts?

Contracts that set out the terms and conditions that must be met to allow for something to happen is commonplace, and in centralized models, these contracts between the two transacting parties affirming the terms and conditions of the contract are facilitated by 3rd parties (for example a bank, regulatory body, government, etc).

A Smart Contract is, like any other contract, language (actually a tiny piece of code stored inside a Blockchain) that describes a set of terms and conditions that must be met to allow for something to happen. It automatically verifies fulfilment and then executes the agreed terms. Smart Contracts, therefore, remove the need for 3rd parties as it allows parties to have an agreement to transact directly with each other.

Smart Contracts and Blockchains are immutable, in other words, they cannot be tampered with, and they are distributed, meaning the outcome of the contract is validated by everyone in the network.

For those interested in seeing what a smart contact actually looks like, here is an example. The third and final article in this series will also walk through an implementation of Blockchain and Smart Contracts.

IoT, using Blockchain and Smart Contracts (Blockchain IoT, or BIoT)

There were 20 billion connected IoT devices in 2017. This is projected to grow to 75 billion by 2025.

Centralised models for storing data, as described at the beginning of this article simply won’t be able to cope with security and volume demands. Plus, any transactional contracts will have to rely on 3rd parties.

The decentralised insertion of data into a Blockchain eliminates a single point of failure and is a tamper-proof way to store data. As a result, security against rogue participants is increased (in this case device spoofing and impersonation).

Regarding the IoT, each legitimate device can write to a Blockchain, and they can easily identify and authenticate to each other without the need for central brokers or certification authorities. The eco-system can scale massively as it can support billions of devices without the need for additional resources.

Smart Contracts extend the Blockchain functionality to include the contract between transacting parties. This removes any IoT eco system’s reliance on 3rd parties to handle the transactional contractual arrangements.

Analytics

Blockchain now also presents huge opportunities for Advanced Analytics.

Data analytics will be crucial in tracking Blockchain activities to help organisations that use Blockchain make more informed decisions. For every use case of Blockchain mentioned (sport, health, insurance, asset management and delivery), come data analytic opportunities. Especially when coupled with machine learning to find those nuggets deep inside the trillions of transactions.

Whether in a traditional database, unstructured text with big data or this new concept of the blockchain, data analytics can use this data to help users and IoT devices.  Moreover, AI, as well as predictive analytics, can help users make informed decisions, making the human ecosystem ever more efficient.

Conclusion

Blockchain, Smart Contracts along with their combination with IoT have plentiful use cases. The data technology sector has the ability to guide this new epoch of technology forward, ironing out the creases along the way for this technology to be as common as the internet is today. The idea of blockchain is very similar to the internet; it is networked and decentralised, requiring protocols (shared schemas) to be followed to allow it to effectively communicate. From that point onwards, everyone can see how life-changing the internet has become to our day to day lives. Organisations who visualise and implement these benefits sooner will be the leaders of the next technical revolution. This is absolutely where Exposé can help.

In the next article, we will delve a lot deeper into a technical implementation of Blockchain and Smart Contracts.

Joint Authors: Etienne Oosthuysen (Exposé, Head of Technology and Solutions) and Cameron Wells (Exposé, Technical Lead)

An Internet of Value – Blockchain, beyond the hype and why CxO’s must take note

A Blockchain, in its simplest form, is a distributed database system where there is no one master (primary) database, but many databases that are all considered primary. All parties participate in populating entries into the respective databases and receive the entries of the other participants.

But how does this apply to your business, and is this profoundly going to change how the world works? Let’s look at an analogy: Imagine I create a song and generate a file of my recording in mp3 format on a USB stick. I can give two of my friends a copy of this; they can do the same, and so on. With thousands of eventual copies going around, it will be impossible to establish which was the real version I own and which I ideally wanted to use in exchange for royalties. By the way, if I ever had to create a song and recorded it, I doubt very much that I would garner thousands of fans. I am just not David Grohl 😊

This is where Blockchain comes in. It is a shared ledger that is used to record any transaction and track the movement of any asset whether tangible, intangible or digital (such as my mp3). It is immutable, so participants cannot tamper with entries, and it is distributed, so all participants share and validate all the entries.

Blockchain will allow “my fans” 😊 to enter into a contract with me directly. As they stream the song, payment goes directly from their wallet into mine. The information about what was listened to and what I was paid, is verified by all the databases in the network and cannot be changed. There are no middlemen (like a central streaming service, or a record label), so the contract (a digital smart contract) is between those that listen to my song and me directly.

It is at this point important to mention that Blockchain is not Bitcoin, or any other cryptocurrency, although it did start life as the technology that underpins cryptocurrencies. This article, the first in a series of three articles, looks beyond its use in cryptocurrencies, and rather highlights use cases to show CxO’s why it is so important to take note of Blockchain and to start controlled proof of concepts (POC’s) and research and development (R&D) in this technology now. We look at some examples across a wide range of industries and use a Courier based use case to delve deeper into what Blockchain could mean for organisations using the Internet of Things (IoT).

Sport

Dope testing and cheating have been quite topical lately with large portions of the Russian contingent banned from the Rio Olympics in 2016, and again from the Winter Games in South Korea in 2018 for systemic manipulation of tests. Blockchain will make the test results immutable and open the results up to all that participate in the data cycle. Even if the athlete changes sports, that data will be available to participating sporting organisation. http://www.vocaleurope.eu/how-technology-can-transform-the-sports-domain-blockchain-2-0-will-boost-anti-doping-fight-sports-investments-and-e-sports/

Health

Some countries are planning health data exchanges with the aim of addressing a lack of transparency and improving trust in patient privacy as well as fostering better collaboration. Blockchain will provide practitioners and providers with better access to health, patient and research information. Adoption of Blockchain will lead to closer collaboration and better treatment and therapies, sooner.

Blockchain in healthcare is real and imminent. This study from IBM shows how pervasive Blockchain is expected to become with 16% of 200 responding health executives aiming to implement a Blockchain solution shortly. https://www.ibm.com/blogs/think/2017/02/Blockchain-healthcare/

Banking

 Australia’s Commonwealth Bank collaborated with Brighann Cotton and Wells Fargo to undertake the world’s first global trade transaction on Blockchain between independent banks – an unbroken digital thread that ran between a product origin and its final destination, capturing efficiencies by digitising the process and automating actions based on data. https://www.commbank.com.au/guidance/business/why-blockchain-could-revolutionise-the-shipping-industry-201708.html

CommBank is taking this a few steps further with an appointed head of Blockchain and a whopping 25 proof of concepts over the past five years, including the ability to peer-to-peer transfer of funds offshore within minutes rather than days, and the issuing of smart contracts. http://www.innovationaus.com/2017/12/CBA-outlines-a-blockchain-future

Insurance

Customers and insurers will be able to manage claims better, transparently and securely. Claim records, which are tamper proof once written to the chain, will streamline the claim process and minimise claimant fraud such as multiple claims for the same incident.

With Smart Contracts, payments can be triggered as soon as certain minimum conditions are met. There are also many smart contract rules that could ascertain when a claim is also fraudulent automatically denying the claim. https://www2.deloitte.com/content/dam/Deloitte/ch/Documents/innovation/ch-en-innovation-deloitte-Blockchain-app-in-insurance.pdf

Courier Delivery

Couriers deliver millions of items each day, very often crossing vast geographical distances and across multiple sovereign boundaries with unique laws and processes.

These businesses, who often make heavy use of IoT devices, will benefit hugely from Blockchain to improve the ability to track every aspect of a package delivery cycle and minimise fraud.

There were 20 billion connected IoT devices in 2017 and projected to grow to 75 billion by 2025. https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/

The current centralised approach for insertion and storage of IoT data (see the image below) simply won’t be able to cope with volume demands and transactional contracts will have to rely on multiple 3rd parties. Also Managing data security can be very complex because data will flow across many administrative boundaries with different policies and intents.

In contrast, the Blockchain decentralised peer-to-peer approach for insertion and storage of IoT data eliminates issues with volume demand, (the data is stored across a potentially unlimited number of databases). There is no single point of failure that can bring the whole IoT network to a halt (computation and storage is shared and there is no one primary). It supports tamper-proofing (all participating databases validate a transaction, which is then shared and becomes immutable), which means increased security from rogue participants such as IoT device spoofers and impersonators (Spoofing can occur when security is breached through a lowly secured device on a shared IoT network. If the lowly/ unsecured device is hackable, then the whole network is compromised as it will believe that the hacker is encrypted as the intruder is on it through the easily hacked device).

Delving deeper into our Courier Delivery use case – Blockchain and IoT, creating an Internet of Value

In a courier parcel delivery ecosystems, the movement of parcels is tracked every step of the delivery process via IoT devices that reads a barcode, or another form of identification that can be picked up by the sensor. From the original warehouse to a vehicle, a plane, another warehouse, and finally your home.

By using Blockchain, each sensor participates in the chain and records “possession” of the delivery item (and so also the location). Each time it is read by a new sensor, the new location is broadcast to, inserted, then shared and agreed on by the remaining participants on the Blockchain. Every block is subsequently a transaction that is unchangeable once inserted into the blockchain.

Each Blockchain entry (i.e. the barcode, the location of the package and a date-time stamp) is encrypted into the Blockchain. The “possession” steps are tracked no matter who is involved in the package delivery process (from the origin which could be the delivery at an Aus Post outlet, to an Aus Post vehicle to the airport, to QANTAS en route to the US, to a DHL distribution centre in a US airport, and finally to a DHL delivery vehicle en route the destination address). This enhances trust in the system as there is no need to adhere and interface with a single primary system, and package tracking is put on steroids. If you have ever sent anything abroad, you would know that granular tracking effectively ends at the border. This won’t be the case with Blockchain. https://www.draglet.com/Blockchain-applications/smart-contracts/use-cases

Conclusion

It must be noted that Blockchain technology has not been around for very long and is rapidly evolving. Widespread commercialisation beyond cryptocurrencies is still in its infancy. But all indications are that it will be a hugely disruptive technology.

The many examples of important players taking this technology seriously move Blockchain beyond hype:

CxO’s may ask, why to invest in something they cannot yet fully understand, but this was probably a very similar question asked in the 90’s about the internet. The learning curve will no doubt be steep, but that makes investing in targeted R&D and POC’s early all the more important so that they do not get caught off guard once commercialisation starts increasing.

In the next article, Blockchain, lifting the lid on confusing concepts, we will delve a little bit deeper and describe the concepts in more depth.

ETL as a Service – Lifting & Shifting your existing SSIS ETL to Azure

Why a trend towards PaaS over IaaS ETL and how this relates to SSIS

Businesses are moving more towards environments where infrastructure is invisible to the business; where Platforms as a Service (PaaS) are developed, deployed and managed in the cloud to reduce operational infrastructure overhead costs as well as increase the scalability and availability of services.

But why is PaaS more attractive compared to on-premise or Infrastructure as a Service (IaaS) solutions?

  • Savings – PaaS services often use a Pay as You Go (PAYG) billing model, meaning you only pay for what you use and when you use it (for example, there are no costs due to idle server times).
  • Scalability – PaaS provides huge scalability and availability on demand.
  • Less overhead – PaaS requires no patching or maintenance of application infrastructure – it’s all taken care of. Therefore, there is less staff overhead from a maintenance perspective which allows staff to take part in activities that deliver a definite and measurable business benefit. This may include new solution developments, improving existing solutions, or data analytics.

3 images

This did not apply to SSIS

Microsoft’s incumbent data integration technology that handles the extract-, transform and load (ETL) of large volumes of data, SQL Server Integration Services (SSIS), has played a key role in many organisations’ data platforms, but unfortunately has not been available as a Service (PaaS).

So for businesses with existing investments in SSIS, wishing to move their data analytics solutions into the cloud and leverage the advantages of PaaS over IaaS, would not have been able to do so with their existing SSIS workloads. They would have had to embark on a full ETL redevelopment using an alternative data integration technology such as Azure Data Factory, or they would have had to use Azure VMs running SSIS to host their SSIS packages.

The good news

The good news is that Microsoft (MS) recently announced Azure-SSIS Integration Runtime (SSIS-IR) public preview, as well as the public announcement Azure SQL Database Managed Instance (SQLDBMI). Together these two new PaaS Services allow businesses to leverage the advantages of PaaS for their ETL workloads, without a major redevelopment.

In this article

  • We discuss what this means for businesses.
  • We discuss some capabilities still missing in public preview but in the pipeline of work for General Availability of the product.
  • We discuss how the recent announcement of SQLDBMI goes hand in hand with shifting your SSIS solutions into PaaS.
  • We touch on some of the other benefits of SQLDBMI.
  • We walk the reader through creating an SSIS IR instance as well as deploying, running and monitoring solution execution.

What are the business benefits

Savings, Scalability and Less overhead

SSIS workloads can now leverage the attractive cost advantages through a PaaS PAYG model – this includes instantaneous scale up and down to meet varying workloads, paying only for what you use, and no hidden costs through servers being idle.

By moving your SSIS workloads to PaaS, you minimise overhead costs, time and skillsets required to manage its supporting infrastructure. Infrastructure often involves costly hardware- (including depreciation, physical space and upgrades), plus backup-, network-, redundancy- and failover costs. Most of these costs simply do not apply to PaaS solutions.

Your staff can also focus on where there is a maximum business benefit – i.e. the solutions themselves rather than infrastructure or the setup, configuration, maintenance and licensing.

In addition, use what you previously developed

SSIS-IR allows businesses to easily lift and shift existing ETL workloads into the cloud so there is no need for redevelopment in ADF.  Developers familiar with the extensive functionality and components available in SSIS can easily continue to develop using proven technology such as Visual Studio and then deploy to their SSIS-IR instance in the same fashion they would for on-premise solutions.

What does this mean in $ terms

piggybank

A cost analysis between the 3 options (running SSIS in an on-premise server environment, running it on a cloud-based VM or as a PaaS service) based on ETL processes that is required to run twice a day and takes around 1 hour per run (~60 hours a month) shows a clear cost advantage when using PaaS.

Please note that the table below is NOT an official quote and any cost estimates must be obtained from your licenced software provider. Costs taken from the following sources:

CostMatrix

scalability

This option also leaves the business with the flexibility to easily scale up or down they compute power as their needs change. Doubling the computing power would take the annual cost from $550 to $870.

people

Regarding FTE overhead and maintenance – even if a conservative 30% of DBA time is saved by adopting PaaS, this will have a major cost saving and productivity increase to the Business.

What is currently missing? And what is coming?

SSIS-IR is still in public preview.  Therefore, there are some features not yet available as at the time of authoring this article – January 2018. These features are all expected to be available upon General Availability of the product:

  • 3rd party assemblies and API’s are not supported in SSIS-IR in the preview, only the stock SSIS components are supported.
  • Only East US, East US2, West Europe & North Europe host SSIS-IR in the preview.
  • There is no portal GUI available for provisioning your SSIS-IR instance; PowerShell scripts must be used to provision, stop/start and scale up/down.
  • You cannot schedule the execution of the SSIS packages from within SSIS-IR in preview and packages must be run
    • This is where another upcoming release by Microsoft, Azure SQL Database Managed Instance (SQLDBMI) comes to play.

SQLDBMI

The PaaS database service, Azure SQL Database Managed Instance (SQLDBMI) was announced at the recent MS Ignite conference and is set to add some key features into a more powerful version of Azure SQL Database.  This will be similar to what you would normally expect to see in the on-premise version of SQL Server and makes a very compelling case for migrating existing databases to the cloud. One of the key features in SQLDBMI is the SQL Agent, this will enable the scheduling, automation and execution of SSIS-IR packages.

Also in SQLDBMI

Additional to SQL Agent in SQLDBMI there are many other great features which are on their way.  The public preview of SQLDBMI is set to be available around Q1 2018. So, stay tuned for the following:

  • DB Mail
    • An enterprise solution for sending email messages through the SQL Server database engine
  • R / Machine Learning Services
    • Develop and deploy data science solutions that uncover new insights as an in-database service
  • Service Broker
    • Native support for messaging and queuing applications.
  • CDC
    • Change Data Capture records data manipulation operations that are applied to SQL server tables in an easily consumed relational format
  • Replication
    • Enables the copying and distributing of data and database objects from one database to another and the synchronising between them to maintain consistency.
  • Resource Governor
    • Used to manage the SQL Server workload and system resource consumption by specifying limits that incoming application requests can use

Conclusion

Together SSIS-IR and SQLDBMI will make it much easier for businesses to lift and shift existing Business Intelligence architectures (notably data storage in a fully functional Database as a Service environment, and batch based Extract, Transform and Load solutions) straight into the cloud without the need for additional costs of redevelopment. The benefits of reduced operational overhead and running costs, and increased scalability and availability are hard to overlook when comparing on-premise or IaaS based solutions with PaaS based architectures.

For those interested in more of the technical aspects of SSIS-IR, the remainder of this article will walk through a set-up of SSIS-IR. Knowledge of SQL Databases and SQL Server Integration Services are required for the remainder of the article.

Walkthrough of setting up Azure-SSIS Integration Runtime

An existing Azure SQL Server is required – this is where SSISDB will reside.  We preloaded the server with an Azure SQLDB – Adventure works (not required) which were used for some SSIS tests.

SQLDB Provisioning Screen
Resource group with Azure SQL Server and SQLDB

Missing functionality in the preview – Next, because there is no GUI in private or public preview, SSIS instance must be provisioned using PowerShell scripts.  The parameters, variables and example scripts can be found here: https://docs.microsoft.com/en-us/azure/data-factory/tutorial-deploy-ssis-packages-azure

After provisioning, the following objects should appear in your resource group.

SQLDB Provisioning Screen 2
Resource group after provisioning showing ADFv2 and SSISDB

After connecting to your Azure SQL Server through SSMS, you can see that provisioning has created a SSISDB database. To see the Integration Services Catalog, you must connect to your Azure SQL Server and SSISDB database directly through the advanced options.

Integration Services Catalog
Integration Services Catalog menu

You can now go ahead and create folders and deploy packages to the Azure-SSIS Integration Runtime environment just as you would with a regular SSIS server.  We created a few simple test cases in visual studio and deployed the project to the newly created Azure-SSIS.

SSIS Packages
SSIS Project packages

These packages can be configured in the same way as standard integration services, including parameters, environment variables and connection managers.

SSIS Package Configuration
SSIS Package configuration

Running these packages and checking the integration services catalogues reports allow for all the logging associated with standard SSIS.

SSIS Package Execution
SSIS Package execution

SSIS Package Run Log
SSIS Package run log