The STEM gender gap and how we can lead the way for the next generation

Authored by Emma Girvan and Sandra Raznjevic

In our continued partnership with St. Peter’s Collegiate Girls’ School, we were honoured to be invited to attend the inaugural Women in STEM Breakfast on Thursday 23 May, which was hosted by Year 10-12 students.

This fantastic event included over 30 female industry professionals who were invited as mentors.  They were only too happy to share their own personal journeys and experiences as successful women working in STEM related fields. We were all lucky enough to hear from 3 invited guest speakers;  Sarah Brown (State Director, Code Like A Girl), Dr Kristin Alford (Director, MOD) and Dr Bronwyn Hajek (Lecturer/Researcher, University of South Australia) which was incredibly inspiring for all of us to hear how they’ve been able to navigate their respective careers.

We heard accounts of how women are still so dramatically underrepresented across all STEM studies and careers and as mentors for these young, impressionable and highly motivated students, we all felt a sense of responsibility to share and encourage them on their paths to success.

Whilst enjoying some of the culinary delights prepared by some of the Food Technology students (which they were being assessed on!), the girls had the opportunity to practice their networking skills and gain as many insights as they could from the industry mentors around the table. It was fascinating to hear the diversity of each of the girls’ passions and interests and what their career aspirations were. Amongst the girls, we met a budding  Commonwealth Games Australian Archer (maybe Olympian one day if they change the rules), a software developer, a forensic scientist, a designer, and organ transplant surgeon and some professions which…to be honest we’d never heard of!  For us, it was a great opportunity to dispel some of the misconceptions of what a career in IT looks like and how some of their interests and skills could be applied in the least likely of ways.

Studies also indicate that role models can be used to both attract and retain women in STEM. Using women as role models has been found to be more effective in retaining women in STEM[2]. With research indicating fewer than one in five students enrolled in degrees in engineering, physics, mathematical sciences or information and communications technology (ICT) in Australia are women[1], and at a time when technology continues to transform the way we live, work and learn, the need to close the STEM gender gap is more critical than ever.

Women are lost at every stage of the professional ladder in STEM fields, due to a range of factors including stereotypes, discrimination, and workplace culture and structure[1], some of which manifest from early school years. [

Studies also indicate that role models can be used to both attract and retain women in STEM. Using women as role models has been found to be more effective in retaining women in STEM[2].

At Exposé, we can proudly boast that 40% of our staff are women; many of us with young daughters, including our own General Manager. We are very passionate about continuing our partnerships with the colleges and universities in Adelaide particularly, working with young women to provide mentorship and guidance to help steer them on a path which traditionally has seen girls gradually drop off the radar.  So much so, that we will be working with St. Peter’s Collegiate Girls’ School again this year to provide our special data analytics project which encourages young women to think beyond the “nerdy” coding and help desk stigma associated with the IT industry.

After listening to the panel of speakers and chatting to a number of young girls who were keen to pick our brains on ‘a day in the life of a girl working in STEM’, the message at the end of the morning was clear – find something you love doing, then find a way to do it every day, and if you’re lucky enough you might even get paid well for doing it.

For all of us, invited guests included, it was a good reminder to keep giving new things a go, even if it puts you outside of your comfort zone, because you never know where it could lead!  Moreover, it dawned on us, that for anyone who is raising young women, we also need to be mindful that we are also (potentially) raising future Mum’s. So in the back of our minds, there will most likely be a period of our daughter’s life, where she will need to take some time out to create another human being and also potentially manage and nurture this human into adulthood.  We need to support this in our industry to not only entice young females to step over into the (not so) dark side, but to show them that they are supported.  This is extremely relevant to Exposé currently as our General Manager raises her new child, whilst running the business.

[1] International Labour Organization. ABC of women workers’ right and gender equality. (International Labour Organization, Geneva, 2007)

[2] Drury, B. J., Siy, J. O. & Cheryan, S. When do female role models benefit women? The importance of differentiating recruitment from retention in STEM. Psychological Inquiry 22, 265 – 269, doi:10.1080/1047840X.2011.620935 (2011)

Exposé – the 2019 Microsoft Worldwide Partner of the Year, runner up in the category of Power BI

We are delighted to announce that Exposé has been named as the runner up for the 2019 Microsoft Global Partner of the year award in the category Power BI. We’re proud to add this global award to our two previous Microsoft Australian Partner Awards in 2016 and 2017. This achievement is no small feat given how young we are and is truly a testament to our talented and committed team; both in Adelaide and Melbourne.

The SA Water solution we submitted for this global award, followed our tried and tested best practice approach, ensuring a thorough understanding and delivery of business outcomes first, with the technology simply being the enabler. We constantly bend and push the envelope on technology, in this instance Power BI, to deliver the required outcomes, rather than make outcomes bend to technology.

“The modularity and scalability provided by the Power BI and the larger Azure platform allowed us to tailor something pretty unique to our customer and their challenging requirements. It allowed us to create a truly scalable, responsive and extendable IOT based analytical ecosystem that can be scaled out to thousands of devices, leveraging complex alarm rules controlled by users, visual remote monitoring and responsive actions, visual analysis, and now deep learning over the data.” Etienne Oosthuysen, National Manager, Technology and Solutions

“I am incredibly proud of our team for delivering a solution for SA Water which has not only over delivered on our customer’s requirements, but has now been recognised globally as a best of breed solution. Thank you to SA Water for trusting us with your data and allowing us to develop a forward thinking solution.” Kelly Drewett, General Manager

See the nominated SA Water solution case study here.

See a short video of the solution here.

Say that Again? Power BI Commentary extends to Reports

Power BI recently announced the extension of  its commentary capability to Power BI reports. Yes, you can now add comments to both report pages or specific visuals to improve your data discussions!

These conversations are automatically bookmarked, so the report context is retained exactly as the comment was written, complete with the original filters. Reporting by exception is embraced with those mentioned by @mentions receiving a push notification to their mobile device to alert them.

Whilst commentary is nothing new in BI tools – Power BI is a bit late to the game – its here now and we’ve subsequently put it through its paces to see how it stacks up!


The following exposé samples show the analysis for a retail organisation. The data, which updates hourly, is sourced from 3 different on-premise systems and modelled into a user-friendly sales model with a specific focus on Products, Customers and Suppliers & Export. The Head of Sales noticed an unusual spike in sales (in $ terms) back in April and created a comment for his sales managers to see. His sales manager picked up the comment and conducted the visual analysis, finding the reason for the spike. By retaining the conversation, anyone with access to the sales analysis can visually play back what was said and see the context of the discussion visually.

This saves staff time –  they don’t need to rediscover the reason for what may well be a very common question.

In the sections below, we step though these events, culminating in our conclusions on this new functionality in Power BI.

Let’s have a look

The first set of images shows the 4 relevant visuals the Head of Sales would have initially looked at, either on his laptop or on his mobile phone. They analyse sales through the lenses of Product, Customer Country, Export (Supplier Country) and Sales (over time) respectively.

The Head of Sales picks up the unusual spike in April in the 4th visual, Product Sales. And he posts his first comment.

This comment is then picked up by one of the Sales Managers, who conducts some interactive analysis and subsequently responds to the Head of Sales. The Head of Sales is notified, clicks on the comment to see the full visual context – see how selecting the comment plays back the visual as it would have looked appeared when the comment was made, and spotlights the specific played back visual clearly showing the 4 products.

The Head of Sales now has a further comment, asking for clarification as to where these 4 specific products are sold.

This specific Sales Manager (note I simply use one of our guest accounts to represent him) is notified of the comment and does further interactive analysis, and responds.

The Head of Sales is notified of the new comment and clicks on the new comment to see the full visual context – selecting the comment again plays back the visual to what it would have looked like when the comment was made, and spotlights the specific played back visual clearly showing the 2 countries.

This now gives the Head of Sales enough context to understand what lead to the spike. He/ his delegate now jump into Power BI and create a new visual from the user friendly sales model that will continue to track and trend these 4 specific ‘focus’ products within the Germany and US ‘high volume’ markets. This shows them that they are becoming popular and that they should invest in some additional marketing around those 4 products.

How this works

Using commentary requires no update or reinstall. Simply navigate to your report in Power BI Service and create comments. This can be done on the visuals themselves after analysis has been done to retain the context.

On on the report page in totality.

In my sales example here, I used a combination of the report page and specific contextual visual commentary in my discussion. The comments page will show all relevant comments and selecting any one of them will play the report and the context back to the time of the report.


The new commentary capabilities are still object based, and not intimately linked to the data as it was, for example in Business Objects – where commentary is made and written back to the solution based on the actual intersection of data—for example, a Sales Value of Product X for 1st of January 2019, in Vancouver in Canada, by Mary Jackson. The difference, however, could be quite subtle as Power BI could allow for the comment on a visual that shows the Sales Value has been filtered to Product X for 1st of January 2019, in Vancouver in Canada, by Mary Jackson.

One of the main downsides of this object based approach is that the commentary data itself remains inaccessible if you, for example, wanted to use it as raw contextual time based data itself. Disclaimer: I say this data is inaccessible, as I am unaware of where it would be stored or accessed. Happy to be advised of the contrary

The ability to play the report and visuals back to what it looked like when the comment was made is, however, a very nice feature—the reader can as it were, “step back in time” and see what happened when the comment was made. This seems to be the case even as more data is appended to the model (in this case) on an hourly basis.

There is no workflow attached to the commentary, which is quite common in financial reporting where commentary and narrative undergo review and approval.

This feature is not available to public facing reports using the “Embed to Web” functionality. But if you’re interested in looking at the sample reports I used for this user story, they can be viewed and interacted with here.

Databricks: distilling Information from Data

In the first of this series of articles on Databricks, we looked at how Databricks works and the general benefits it brings to organisations ready to do more with their data assets. In this post we build upon this theme in the advanced analytics space. We will also walk through  an interesting (biometric data generated by an Apple Watch) example of how you might use Databricks to distill useful information from complex data.

But first, let’s consider four common reasons why the value of data is not fully realised in an organisation:

Separation between data analysis and data context.

Those who have deep data analytic skills – data engineers, statisticians, data scientists – are often in their own specialised area with a business. This area is separated from those who own and understand data assets. Such a separation is reasonable: most BAU data collection streams don’t have a constant demand for advanced analytical work, and often advanced analytical projects require data sourced from a variety of business functions. Unfortunately, success requires strong engagement between those that deeply understand the data and those that deeply understand the analysis. This sort of strong engagement is difficult to moderate in practice.

We’ve seen cases where because of the short-term criticality of BAU work or underappreciation of R&D work, business data owners are unable to appropriately contribute to a project, leaving advanced analytics team members to make do. We’ve seen cases where all issues requiring data owner clarification are expected to be resolved at the start, and continued issues are taken as a sign that the project is failing. We’ve seen cases where business data knowledge resides solely in the minds of a few experts.

Data analysis requires data context. It’s often said “garbage in, garbage out”, but it’s just as true to say “meaningless data in, meaningless insights out”. Databricks improves this picture by encouraging collaboration between data knowledge holders and data analysts, through its shared notebook-style platform.

Difficulty translating analytical work to production workloads.

Investigation and implementation are two different worlds. Investigation requires flexibility, testing different approaches, and putting “what” before “how”. Implementation requires standards, stability, security and integration into systems that have a wider purpose.

A good example of this difficulty is the (still somewhat) ongoing conflict between the use of Python 2 and Python 3. Python 3 has now almost entirely subsumed Python 2 in functionality, speed, consistency and support. However, due to legacy code, platforms and standards within organisations, there are still inducements to use Python 2, even if a problem is better addressed with Python 3. This same gap can also be found in individual Python modules and R packages. A similar gap can be found in organisational support for Power BI Desktop versions. A more profound gap can be seen if entirely different technologies are used by different areas.

This could either lead to substantial overhead for IT infrastructure sections or substantial barriers to adoption of valuable data science projects. PaaS providers offer to maintain the data analysis platform for organisations, enabling emerging algorithms and data analysis techniques to be utilised without additional infrastructure considerations. Additionally, Databricks supports Python, R, SQL and Scala, which cover the major non-proprietary data analysis languages.

Long advanced analysis iterations.

The two previous issues contribute to a third issue: good advanced analyses take time to yield useful results for the business. By the time a problem is scoped, understood, investigated, confronted and solved the problem may have changed shape or been patched by business rules and process changes enough that the full solution implementation is no longer worth it. Improving the communication between data knowledge holders and data analysts and shortening the distance between investigation and implementation mean that the time between problem and solution is shortened.

What this mean for your organisation is that the business will begin to see more benefits of data science. As confidence and acceptance grow so does the potential impact of data science. After all, more ambitious projects require more support from the business.

Data science accepted as a black box.

Data science is difficult, uncertain and broad. This has three implications. Firstly, a certain amount of unsatisfying results must be expected and accepted. Secondly, there is no single defensible pathway for addressing any given problem. Thirdly, no one person or group can understand every possible pathway for generating solutions. Unfortunately, these implications mean that data science practitioners can be in a precarious position justifying their work. Many decision makers can only judge data science by its immediate results, regardless of the unseen value of the work performed. Unseen value may be recognition of data quality issues or appreciation of better opportunities for data value generation.

We don’t believe in this black box view of data science. Data science can be complicated, but its principles and the justifications within a project should be understood by more than just nominal data scientists. This understanding gap is a problem for an organisation’s maturity in the data science space.

Over recent years wide in-roads have been made into this problem with the rise in usage of notebook-style reports. These reports contain blocks of explanatory text, executable code, code results and mathematical formulas. This mix of functions allows data scientists to better expose the narrative behind their investigation of data. Notable examples of this style are Jupyter Notebooks, R Markdown, or Databricks.

Databricks enables collaboration, platform standardisation and process documentation within an advanced analytics project. Ultimately this means a decreased time between problem identification and solution implementation.

Databricks Example: Biometric Data

For demonstrating Databricks, we have an interesting, real data source: the biometrics collected by our watches and smartphones. You probably also have access to this kind of data; we encourage you to test it out for yourself. For Apple products it can be extracted as an XML file and mounted to the Databricks file system. Not sure how to do this? See our previous article.

Specifically, the data we have is from our national manager for technology, Etienne’s watch and smartphone. Our aim is to extract useful insights from this data. The process we will follow (discussed in the subsequent sections are):

  1. Rationalise the recorded data into an appropriate data structure.
  2. Transform the data to be useful for its intended purpose.
  3. Visualise and understand relationships within the data.
  4. Model these relationships to describe the structure of the data.

Typically, advanced analytics in the business context should not proceed this way. There, a problem or opportunity should be identified first and the model should be in service of this. However here we have the flexibility to decide how we can use the data as we analyse it. This is a blessing and a curse (as we shall see).


The process of converting the XML data into a dataframe could be overlooked. It’s not terribly exciting. But it does demonstrate the simplicity of parallelisation when using Databricks. Databricks is built over Apache Spark, an engine designed for in-memory parallel data processing. The user doesn’t need to concern themselves how work is parallelised*, just focus on what they need done. Work can be described using Scala, Python, R or SQL. In this case study we’ll be using Python, which interacts with Spark using the PySpark API.

Since we’ve previously mounted our XML biometrics summary, we can simply read it in as a text file. Note that there are ways to parse XML files, but to see what we’re working with a text file is a bit easier.

We’ve asked Spark (via sc, a representation of “Spark Context”) to create a Resilient Distributed Dataset (RDD) out of our biometrics text file export.xml. Think of RDDs as Spark’s standard data storage structure, allowing parallel operations across a cluster of machines. In our case our RDD contains 2.25 million lines from export.xml. But what do these lines look like?

A simple random sample of 10 lines shows that each biometric observation is stored within the attributes of a separate record tag in the XML. This means that extracting this into a tabular format can be quite straight forward. All we need to do is identify record tags and extract their attributes. However, we should probably check that all of our record tags are complete first.

We’ve imported re, a Python module for regular expression matching. Using this we can filter our RDD to find records that begin with “<Record” but are not terminated with “>”. Fortunately, it appears that this is not the case. We can also test for the case where there are multiple records in the same line, but we’ll skip this here. Next we just need to filter our RDD to Record tags.

In both of these regular expression checks, I haven’t had to consider how Spark is parallelising these operations. I haven’t had to think any differently from how I would solve this problem in standard Python. I want to check each record has a particular form – so I just import the module I would use normally, and apply it in the Pyspark filter method.

*Okay, not entirely true. Just like in your favourite RDBMS, there are times when the operation of the query engine is important to understand. Also like your favourite RDBMS, you can get away with ignoring the engine most of the time.


We already have our records, but each record is represented as a string. We need to extract features: atomic attributes that can be used to compare similar aspects of different records. A record tag includes features as tag attributes. For example, a record may say unit=”cm”. Extracting the individual features from the record strings in our RDD using regular expressions is fairly straightforward. All we need to do is convert each record string into a dictionary (Python’s standard data structure for key-value pairs) with keys representing the feature names and values representing the feature values. I do this in one (long) line by mapping each record to an appropriate dictionary comprehension:

This has converted our RDD into a dataframe – a table-like data structure, composed of columns of fixed datatypes. By and large, the dataframe is the fundamental data structure for data science investigations, inherited from statistical programming. Much of data science is about quantifying associations between features or predictor variables and variables of interest. Modelling such a relationship is typically done by comparing many examples of these variables, and rows of a dataframe are convenient places to store these examples.

The final call to the display function in the above code block is important. This is the default (and powerful) way to view and visualise your data in Databricks. We’ll come back to this later on.

So we have our raw data converted into a dataframe, but we still need to understand the data that actually comprises this table. Databricks is a great platform for this kind of work. It allows iterative, traceable investigations to be performed, shared and modified. This is perfect for understanding data – a process which must be done step-by-step and is often frustrating to document or interpret after the fact.

Firstly in our step-by-step process, all of our data are currently strings. Clearly this is not suitable for some items, but it’s easily fixed.

The printSchema method indicates that our dataframe now contains time stamps and decimal values where appropriate. This dataframe has, for each row:

  • creationDate: the time the record was written
  • startDate: the time the observation began
  • endDate: the time the observation ended
  • sourceName: the device with which the observation was made
  • type: the kind of biometric data observed
  • unit: the units in which the observation was measured
  • value: the biometric observation itself


So we have a structure for the data, but we haven’t really looked into the substance of the data yet. Questions that we should probably first ask are “what are the kinds of biometric data observed?”, and “how many observations do we have to work with?”. We can answer these with a quick summary. Below we find how many observations exist of each type, and between which dates they were recorded.

We see that some of the measures of energy burned have the most observations:

  1. Active Energy Burned has over 650,000 observations between December 2015 and November 2018
  2. Basal Energy Burned has over 450,000 observations between July 2016 and November 2018
  3. Distance Walking/Running has over 200,000 observations between December 2015 and November 2018
  4. Step Count has about 140,000 observations between December 2015 and November 2018
  5. Heart Rate has about 40,000 observations between December 2015 and November 2017
  6. Other kinds of observations have less than 30,000 observations

This tells us that the most rich insights are likely to be found by studying distance travelled, step count, heart rate and energy burned. We might prefer to consider observations that are measured (like step count) rather than derived (like energy burned), although it might be an interesting analysis in itself to try to find how these derivations are made.

Let’s begin by looking into how step count might relate to heart rate. Presumably, higher step rates should cause higher heart rates, so let’s see whether this is borne out in the data.

I’ve chosen to convert the data from a Spark dataframe to a Pandas dataframe to take advantage of some of the datetime manipulations available. This is an easy point of confusion for a starter in PySpark: Spark and Pandas dataframes are named the same, but operate differently. Primarily, Spark dataframes are distributed so operate faster with larger datasets. On the other hand, Pandas dataframes are generally more flexible. In this case since we’ve restricted our analysis to a subset of our original data that’s small enough to be confident with a Pandas dataframe.

Actually looking at the data now, one problem appears: the data are not coherent. That is, the two kinds of observations are difficult to compare. This manifests in two ways:

  1. Heart rate is a point-in-time measurement, while step count is measured across a period of time. This is a similar incoherence to the one in economics surrounding stock and flow variables. To make the two variables comparable we can assume that the step rate is constant across the period of time the step count is measured. As long as the period of time is fairly short this assumption is probably quite reasonable.
  2. Heart rate and step count appear to be sampled independently. This means that comparing them is difficult because at times where heart rate is known, step count is not always known, and vice versa. In this case we could assume that both types of observation are sampled independently so we can restrict our comparisons to observations of heart rate and step rate that are reasonably close.

Once we have some observations of heart rate and step rate, we can compare them:

On the vertical axis we have heart rate in beats per minute and on the horizontal axis we have pace in steps per second. Points are coloured so that older points are lighter, which allows us to see if there is an obvious change over time. The graph shows that Etienne’s usual heart rate is about 80 bpm, but when running it increases to between 120 and 180. It’s easy to notice an imbalance between usual heart rate observations and elevated heart rate observations – the former are much more prevalent.

There appears to be at least one clear outlier – the point where heart rate is under 40 bpm. There are also a small amount of observations that have normal heart rate and elevated pace or vice versa – these may be artifacts of our imperfect reconciliation of step count and heart rate. We could feed this back to improve the reconciliation process or re-assess the assumptions we made, which would be particularly useful with subject matter expert input.

The graph above shows the observations of step rate over time, with black indicating observations that have elevated heart rates. There are a few interesting characteristics – most obviously, observations are far more dense after July 2016. Also, rather alarmingly, there are only a small number of clusters of observations with elevated heart rates, which means that we cannot treat observations as independent. This is often the case for time series data, and it complicates analysis.

We could instead compare the progression of heart rate changes with pace by looking at each cluster of elevated heart rate records as representative of a single exercise events. However, we would be left with very few events. Rather than attempt to clean up the data further, let’s pivot.

Transformation (Iteration 2)

Data doesn’t reveal all of their secrets immediately. Often this means our analyses need to be done in cycles. In our first cycle we’ve learned:

  1. Data have been collected more completely since mid-2016. Perhaps we should limit our analysis to only the most recent year. This means we should not perhaps attempt to identify long-term changes in the data.
  2. Heart rate and step rate are difficult to reconcile because they often make observations at different times. It would be better to focus on a single type of biometric.
  3. There are only a small number of reconcilable recorded periods of elevated heart rate and step rate. Our focus should be on observations where we have more examples to compare.

Instead of step count and heart rate, let’s instead look at patterns in distance travelled by day since 2017.  This pivot answers each of the above issues: it is limited to more recent data, it focuses on a single type of biometric data, and it allows us to compare on a daily basis. Mercifully, distance travelled is also one of the most prevalent observations in our dataset.

You’d be right to say that this is a 180 degree pivot. We’re now looking at an entirely different direction. This is an artifact of our lack of a driving business problem, and it’s something you should prepare yourself for too if you commission the analysis of data for the sake of exploration. You may find interesting insights, or you may find problems. But without a guiding issue to address there’s a lot of uncertainty about where your analysis may go.

Stepping down from my soapbox, let’s transform our data. What I want to do is to record the distance travelled in every hourly period from 8am to 10pm since 2017. Into a dataframe “df_x”, I’ve placed all distance travelled data for 2017:

In the above we tackle this in three steps:

  1. Define a udf (user defined function) which returns the input number if positive or zero otherwise
  2. Use our udf to iteratively prorate distance travelled biometrics into the whole hour between 8am and 10pm that they fell into, naming these columns “hourTo9”, up to “hourTo22”.
  3. Aggregate all distances travelled into the day they occurred

This leaves us with rows representing individual calendar days and 14 new columns representing the distance travelled during a hour of the day.

Visualisation (Iteration 2)

This section is not just an exploration of the data, but an exploration of Databricks’ display tool, which allows users to change the output from a code step without re-running the code step. Found at the bottom of every output generated by the display command is a small menu:

This allows us to view the data contained in the displayed table in a graphical form. For example, choosing “Scatter” gives us a scatterplot of the data, which we can refine using the “Plot Options” dialogue:

We can use these plot options to explore the relationship between the hourly distance travelled variables we’ve created. For example, given a selection of hours (8am to 9am, 11am to 12pm, 2pm to 3pm, 5pm to 6pm, and 8pm to 9pm), we observe the following relationships:

Notice that long distances travelled in one hour of a day makes it less likely that long distances are travelled in other hours. Notice also that there is a fair skew in distances travelled, which is to be expected since the longest distances travelled can’t be balanced by negative distances travelled. We can make a log(1+x) transformation, which compresses large values to hopefully leave us with less skew:

The features we have are in 14 dimensions, so it’s hard to visualise how they might all interact. Instead, let’s use a clustering algorithm to classify the kinds of days in our dataset. Maybe some days are very sedentary, maybe some days involve walking to work, maybe some days include a run – are we able to classify these days?

There are a lot of clustering algorithms at our disposal: hierarchical, nearest-neighbour, various model-based approaches, etc. These perform differently on different kinds of data. I expect that there are certain routines within days that are captured by the data with some random variation: a set jogging route that occurs at roughly the same time on days of exercise, a regular stroll at lunchtime, a fixed route to the local shops to pick up supplies after work. I think it’s reasonable to expect on days where a particular routine is followed, we’ll see some approximately normal error around the average case for that routine. Because of this, we’ll look at using a Gaussian Mixture model to determine our clusters:

I’ve arbitrarily chosen to cluster into 4 clusters, but we could choose this more rigorously. 4 is enough to show differences between different routines, but not too many for the purpose of demonstration.

The graph above shows the 4 types of routine (labelled as “prediction” 0-3), and their relative frequency for each day of the week. Notably type 1 is much more prevalent on Saturday than other days – as is type 3 for Sunday. Type 2 is much more typical a routine for weekdays, appearing much less on weekends. This indicates that perhaps there is some detectable routine difference between different days of the week. Shocking? Not particularly. But it is affirming to see that the features we’ve derived from the data may capture some of these differences. Let’s look closer.

Above we have the actual profiles of the types of daily routines, hour-by-hour. Each routine has different peaks of activity:

  • Type 0 has sustained activity throughout the day, with a peak around lunchtime (12pm – 2pm).
  • Type 1 has sustained activity during the day with a local minimum around lunchtime, and less activity in the evening.
  • Type 2 has little activity during core business hours, and more activity in the morning (8am – 10am) and evening (5pm-7pm)
  • Type 3 has a notable afternoon peak (3pm – 6pm) after a less active morning, with another smaller spike around lunchtime.

If you were doing a full analysis you would also be concerned about the variability within and between each of these routine types. This could indicate that more routines are required to describe the data, or that some of the smaller peaks are just attributable to random variation rather than actual characteristics of the routine.

Finally, the visualisation above shows the composition of the daily routines over the course of a year, labelled by week number. The main apparent change through the course of the year is for routine type 2, which is more frequent during cooler months. This concords with what we might suspect: less activity during business hours in cooler, wetter months.

Taken together, perhaps we can use the hourly distance features to predict whether a day is more likely a weekday or a weekend. This model might not seem that useful at first, but it could be interesting to see which weekdays are most like weekends – perhaps these correspond with public holidays or annual leave?


Let’s do a quick model to prove that weekends can be classified just with hourly movement data. There are a lot of possible ways to approach this, and a lot of decisions to make and justify. As a demonstrator here we’ll create a single model, but won’t refine it or delve too deeply into it.

Based on the types of routines identified in our cluster analysis, it’s fair to suspect that there may not be a monotonic relationship between the distance travelled in any particular hour and weekend/weekday membership. So rather than using the simplest classification model, logistic regression*, let’s fit a random forest classifier. First, we need to include a label for weekends and weekdays. I choose to call this “label” because by default this is the column name that Pyspark’s machine learning module will expect for classification.

As usual to allow us to check for overfitting, let’s separate the data into a training set and a test set. In this case we have unbalanced classes, so some might want to ensure we’re training on equal numbers of both weekdays and weekends. However, if our training data has the same relative class sizes as the data our model will be generalised to and overall accuracy is important then there isn’t necessarily a problem with unbalanced classes in our training data.

Now let’s prepare for model training. We’ll try a range of different parametrisations of our model, with different numbers of trees, and different numbers of minimum instances per node. Cross-validation is used to identify the best model (where best is based on the BinaryClassificationEvaluator, which uses area under ROC curve by default).

Fitting the model is then simply a matter of applying the cross-validation to our training set:

Finally, we can evaluate how successful our model is.:

So our model is reasonable on our test data, with a test ROC curve covering 0.86 and an overall accuracy of 0.82, which compares favourably to the accuracy of our null model, which would classify all observations as a weekday and have an accuracy of 0.71. There are many more possible avenues to investigate, even within the narrow path we’ve taken here. This is a curse of exploratory analysis.

*To be fair, logistic regression can capture non-monotonicity as well, but this requires modifying features (perhaps adding polynomial functions of features)

Wrapping Up

Databricks gives us a flexible, collaborative and powerful platform for data science, both exploratory and directed. Here we’ve only managed to scratch the surface, but we have shown some of the features that it offers. We also hope we’ve shown some of the ways it addresses common problems businesses face bringing advanced analysis into their way-of-working. Databricks is proving to be an important tool for advanced data analysis.

Databricks: beyond the guff, business benefits and why businesses should care. Here’s a cheat-sheet to get you started

Search for info on Azure Databricks and you’ll likely hear it described along the lines of “a managed Apache Spark platform that brings together data science, data engineering, and data analysis on the Azure platform”. The finer nuances and, importantly, information about the business benefits of this platform can be trickier to come by.  This is where our ‘cheat sheet’ comes in.  This is the first of a series designed to assist you in deciphering this potentially complicated platform. Feel free to also read the second article in the series, distilling information from data, hereafter.

What is it?

Databricks is a managed platform in Azure for running Apache Spark. Apache Spark, for those wondering, is a distributed, general-purpose, cluster-computing framework. It provides in-memory data processing capabilities and development APIs that allow data workers to execute streaming, machine learning or SQL workloads—tasks requiring fast, iterative access to datasets.

There are three common data worker personas: the Data Scientist, the Data Engineer, and the Data Analyst. Through Databricks, they’re able to collaborate on big data projects and acquire, engineer and analyse data, wherever it exists, in parallel. The bigger picture is that they are therefore all able to contribute to a final solution which is then brought to production.

  • Databricks is not a single technology but rather a platform that can, thanks to all its moving parts, personas, languages, etc., appear quite daunting. With the aim of simplifying things, our cheat sheet starts with a high-level snapshot of the workloads performed on Databricks by our Data Scientist, Data Engineer and Data Analyst personas.
  • We’ll then look at some real business benefits and why we think businesses should be paying attention. Lastly, we’ll delve into two related workloads:
    • Data transformation, and
    • Queries for visual analysis.

Our subsequent cheat sheets will start to unpick the remaining workloads.

The image below shows a high-level snapshot of the workloads performed by our three data worker personas. The workloads in the coloured sections form (to varying degrees) the basis for the contents of our cheat sheet.

  • Data engineering forms, in our opinion, the largest of the cohort of workloads:
    • Data acquisition – i.e. how data is acquired for transformation, data analysis and data science using Databricks. This could potentially fall beyond the realms of Databricks due to the fact that data can be leveraged from wherever it exists (for example Azure Blob or Azure Data Lake stores, Amazon S3, etc.) and data may already be hosted in those stores as a result of some preceding ETL process. Databricks can of course also acquire data.
    • Data transformation – discussed later in this article, focussing on the ETL processes within Databricks (ETL within).
  • Data analysis takes on two flavours:
    • Queries – these could overlap heavily with the world of the Data Scientist, especially if the languages used are Python or R and if the intent is machine learning and predictive analytics. But Data Analysts could, of course, also perform queries for ‘on the fly’ data analysis.
    • Queries for visual analysis – queries are also performed to ready data for visual analysis. This is discussed later in this article; however, it must be noted that the lines between this kind of Queries and Data Transformation performed by the Data Engineer can become very blurred. This in itself proves the collaborative and parallel nature Databricks allows.
  • Data science has machine learning and associated algorithms, with predictive and explanatory analytics as the end goal. Here too, queries are performed, and the lines are similarly blurred with the queries performed by the Data Analyst and the Data Engineer.
  • Underpinning all of this are the workloads involved in moving the solutions to production states.

These workloads are logical groupings only aimed at clearing what could otherwise be muddy waters to the untrained eye. Queries may, for example, be performed, then used for transformations, data science and visual analysis.

So, without any further ado, let’s look at why businesses should be watching Databricks very closely!

Why Databricks? – Beyond the guff, business benefits and importantly, why businesses should care

If you search on Google for ‘Apache Spark’ you’ll find loads of buzzwords – “open-source”, “distributed”, “big data”, etc. On first glance, this can look like marketing babble and appear completely removed from a business’s actual data challenges. So let’s dispense with the buzzwords and focus on the business challenges.

Note also: although Apache Spark (and therefore Databricks too) is positioned in the big data camp, its application is not limited to big data workloads. So, if some of the challenges we list below apply to your data landscape (big data or not), read on.

Time to market

Challenge – data warehouses takes too long to deliver business benefit

BenefitDatabricks is naturally geared towards agility via its ability to serve parallel collaboration, which, in turn, leads to improved responsiveness to change. This means that the time it takes to deliver data workloads is reduced

Parallel collaboration rather than seriality

Challenge – participants in the data solutions processes are too dependent on each other to complete their tasks before they can participate. These challenges are a result of serial workloads

Benefit – parallel collaboration delivers maximum agility. It means that the three main data personas, i.e. the Engineer, the Scientist and the Analyst can collaborate on delivering the data elements that will form part of a final data deliverable in parallel. As the Engineer acquires the data, the Analyst, the Scientist and indeed the Engineer start contributing to the logic that transforms and manipulates the data all in parallel. This, in turn, contributes to a reduction in time for solutions to get to market

Responsiveness and nimbleness

Challenge – companies change, requirements change, and business may not know exactly what they want or need from data that is stored in a variety of formats in different locations

Benefit – companies frequently generate thousands of data files, hosted in diverse formats including CSV, JSON, and XML from which analysts need to extract insights

The classic approach to querying this data is to load it into a central data warehouse. But this involves the design and development of databases and ETL. This works well but requires a great deal of upfront effort, and the data warehouse can only host data that fits the designed schema. This is costly, time-consuming and difficult to change.

With the data warehouse approach, insights can only be extracted after the data is transformed upon load.

Databricks presents a different approach and allows insights to be extracted and transformed upon query from vast amounts of data stored cheaply in its native format (such as XML, JSON, CSV, Parquet, and even relational database and live transactional data) in Blob Stores. With Databricks, data is read directly from the raw files, and by using SQL queries, data is cleansed, joined and aggregated – hence the term transform upon query.

Transforming the data each time a query run means this approach is much more geared towards quick turnaround and becomes more responsive to change. BUT, it requires superior performance.


Challenge – workloads (such as queries) serving analytics and data science, are run often and transform the data each time the query runs (transform upon query). Logic dictates that this will not perform as well as data transformed upon load once and the transformed data materialised for reuse.

Benefit – Databricks provides a performant environment that handles the transform upon query paradigm. This is done by utilising a variety of mechanisms, such as:

  • Databricks includes a Spark engine that is faster and performs better through various optimisations at the I/O layer and processing layer:
    • For example, Spark clusters are configured to support many concurrent queries and can be scaled to handle increased demand.
  • It includes high-speed connectors to Azure storage (i.e. Azure Blob and Azure Data Lake stores)
  • It uses the latest generation of Azure hardware (Dv3 VMs), with NvMe SSDs capable of even faster I/O performance.

A managed big data (or in our opinion, all data) platform

Challenge – The data landscape is becoming increasingly complex and fragmented and costly to maintain.

Benefit – “Databricks is a managed platform (in Azure) for running Apache Spark – that means that you neither have to learn complex cluster management concepts nor perform tedious maintenance tasks to take advantage of Spark. Databricks also provides a host of features to help its users to be more productive with Spark. It’s a point and click platform for those that prefer a user interface, such as data scientists or data analysts.” –

Not just Azure Blob Storage – access data where it lives

Challenge – Data is not necessarily stored in Azure Blobs

Benefit – Databricks connections are not limited to Azure Blob or Azure Data Lake stores, but also to Amazon S3 and other data stores such as Postgres, HIVE and MY SQL, Azure SQL Database, Azure Event Hubs, etc. via JDBC (Java Database Connectivity). So, you can immediately start to benefit from the cost, flexibility and performance benefits offered by Databricks for your existing data

Cost of the cluster

Challenge – Big data solutions tend to cost a lot of money

Benefit – The Databricks File System (DBFS), is a layer over your data (where it lives) that allows you to mount the data, making it available to other users in your workspace and persisting the data after a cluster is shut down. Data is not synced, but mounted, which means you do not double pay for storage.

When a Databricks cluster is shut down (which is also done automatically at an interval you specify when not in use), it stops costing you money, so you only pay for what you use

Furthermore, Azure Databricks leverages the economies of scale provided by Azure. Analysis workloads (Interactive workloads for analysing data collaboratively with notebooks) on a Premium F4 instance (4 virtual CPU’s and 8 GB RAM) running 24 x 7 will, for example, only cost you $380 pm. And Data Engineering workloads (Automated workloads for running fast and robust jobs via API or UI) for the same tier will, for example, only cost you $307 pm.

*Note that the pricing above is in AUD and is an estimate only as per the Azure Pricing Calculator.

Australian region

Challenge –some big data solutions such as Azure Data Lake, first generation, is not available in the Australian region as at the date of first publication of this article

Benefit – Databricks can be provisioned in the following Australian regions:

  • Australia Central
  • Australia Central 2
  • Australia East
  • Australia South East

Like everything, there are some downsides/ realities to consider

SQL, R, Python, Scala – can be daunting

SQL has become the “lingua franca” for most Data Engineers and Data Analysts, whereas the same applies to R and Python for Data Scientists. These personas collaborate on Databricks using notebooks as interfaces to the data, which allows them to create runnable code, visualisations and narrative.

Suddenly these personas gain visibility over the code from other personas in the same notebook, and as notebooks can consist of multiple languages, this can seem quite daunting to personas unfamiliar with languages they have not previously used, especially considering that the languages used in Databricks, i.e. R, Python, Scala and SQL, each have their peculiarities.

Obviously this is only an issue if you are unfamiliar with such an environment. For those with good coverage of SQL, R, Python and Scala, this is a benefit as they can work with multiple languages in the same Databricks notebook easily, i.e. personas can use their preferred language of choice irrespective of the choice of other personas. All that needs to be done is to prepend the cell with the appropriate magic command, such as %python, %r, %sql, etc.

From another viewpoint however, this diversity of languages can be a strength for the right business environment: the workflow naturally dissipates technical debt and encourages capability sharing.

Learning curve

There will often be a requirement for personas to become more familiar with a broader set of languages and the notebook environment to make following what is happening in the total notebook easier. This will make for easier collaboration and is inline with a move from pure serial to more parallel workloads.

Case study – Data Transformation and Visual Analysis

The use case described in this section is used as a vehicle for a more technical deep dive into the workloads shown in the coloured sections of the Databricks Workflow image above (i.e. Data Transformations and ETL within Databricks, and Queries for visual analysis).

Our use case – IoT and wearable devices, such as Apple Watches, are currently under a substantial spotlight as there is a lot of interest as to what can be gleaned from the data they produce (see our article June’s story as an example – In our use case, Apple Watch data is brought into Azure from where the datasets will be mounted to Databricks, ETL processes then transforms and loads the data, and finally Queries are performed.

An Apple Watch is used to generate data we will use in this user story. An app on the watch integrates with Azure and streams some data into Azure Blob Storage (this app and stream are not within the scope of this article as Data Acquisition will be discussed in a subsequent article).

The data manifests itself as CSV files in Azure Blob store > Container:

Data Engineering > Data Transformation > ETL within

This section assumes that data is already available in an appropriate store for mounting (in this case Azure Blob store). We notionally call the next steps “ETL within Databricks” as it represents a logical ETL that will extract and validate the data, apply a schema, then load the data ready for use by (for example for analytical querying). ETL within Databricks should not be confused with ETL to get data into Azure in the first place (which will be discussed in a subsequent article).
ETL within Databricks is conceptually the same as the ETL concepts we know from conventional BI workloads, in that you first extract the data, then transform it, and then load it, but it is done in a much nimbler fashion and it adheres to the notion of the transformation of data upon query, rather than upon load.
The common steps associated with our two workloads, i.e. ETL within and queries to ready the data for visual analysis are shown visually in the image below:

Remember that Spark is the engine used by Databricks, and SQL/ Scala/ Python/ R/ Java uses that engine to perform the various workload tasks.

In the sections below, we will first mount our Apple watch data (this is the extract step), we will then transform the data and load it into a table using SQL (the amber route shown above), create a data frame and load it as a parquet file (the green route above). Later we will deal with Analysis of the loaded data, readying the data for, for example, visual analysis. For now let’s focus on the ETL.

The queries shown in each step below are examples of what could be done and should give the reader a starting point from where to build more complicated ETL within Databricks and subsequent queries. Databricks is a massively flexible platform, so the sample queries may be made much more complex or approached in an entirely different way.


In the first step we mount the data held in our Azure Blob store to the Databricks File System (DBFS). This represents the “Mounted Stores in DBFS” step in the image above (we are not focussing on the JDBC step in this use case).

We first generated a SAS URL for the Azure Blob store to use as a variable, then used it in the query.

Mounting means creating a pointer to the store, which means that the data never actually syncs. The mount point is simply a path representing where the Blob Storage container or a folder inside the container is mounted in DBFS.
Optional – We may quickly validate the mount by running the following query to see the contents of the mount point.

Optional – We lastly validate the data in any of the files within our mount by looking at the content of any of the files within our mount point.


As per the Transform steps, there are two options: a SQL path (shown in Amber) and a Scala/ Python/ R/ Java path (shown in Green). The reader can jump to the Scala/ Python/ R/ Java path if wanting to bypass the SQL sections, which to many may seem a bit familiar.

Transform and Load using SQL (Option A)

We use SQL to create a table in DBFS which will “host the data” via metadata, then infer the schema from the files in our Azure Blob store container. Note that the scheme can be explicit rather than inferred. In our use case all our files have the same structure and the schema can therefore be inferred. But in cases where structures differ, then standardisation queries will precede this step.

It is worth noting that in Databricks a table is a collection of structured data. Tables in Databricks are equivalent to Data Frames in Apache Spark.

Optional – We can now perform all manner of familiar SQL queries. It is also worth noting that data can be visualised on the fly using the options in the bottom left corner. In the first example, we review the data we had just loaded, in the second we do a simple record count.

Transform and Load using Scala (Option B)

Tables are familiar to any conventional database operator. Let’s now extend this concept to include Data Frames. A Data Frame is essentially the core Transformation layer in this alternative ETL path – it is a dataset organised into named columns. It is conceptually equivalent to a table in a relational database but with richer optimisations under the hood. Data Frame code follows a “” pattern.
In the next query, we read the data from the mount, we infer the headers (we know that all our files have the same format so no preceding column standardisation is required), we select only certain columns of value to us, and we transform the column names as a subsequent step, as loading the data to Parquet restricts us from using “restricted characters” such as “(” and “,” .

We lastly load the data into a parquet file in DBFS. Whilst blob stores like AWS S3 and Azure Blob are the data storage options of choice for Databricks, Parquet is the storage format of choice. They are highly efficient, column-oriented data format files that show massive performance increases over other options such as CSV. For example Parquet compresses data repeated in a given column and preserves the schema from a write.

Queries for Visual Analysis

Once we have Extracted, Transformed and Loaded the data we can now perform any manner of query-based analysis. We can for example query the Parquet file directly, or we can create a table from the Parquet file and then query that, or we can bake the final query into the Table create.

Let’s first query the Parquet directly:

Now let’s create a table from its Metadata which can then be used by BI tools such as Power BI.

In the final query, we query the table and prepare the data for visual analysis in something like Power BI. We select the maximum number of steps our Apple Watch wearer by day (we only loaded two days’ worth of data).

We will, in subsequent articles introduce many of the other workloads associated with Databricks building on the concepts we used in this article.

Author: Etienne Oosthuysen; Contributor: Rajesh Kotian

Young, female and paving the way for technology in South Australia

We recently joined forces with St. Peter’s Girls’ Collegiate School, facilitating an 8-week data and analytics project for Year 11 students.  This exercise provided the girls with real-world skills development but most importantly, opened their eyes to what a career in IT can look like, dispelling some of the misconceptions in the process.

View the article here: St Peters Girls DA Project

Exposé team members working with the St Peter’s Girls were Andrew Exley, Etienne Oosthuysen, Kelly Drewett, Trevene Leonard

Gov Hack 2018 – Our Winning Emergency Response Exposed

This year we got a team together and entered the 2018 Gov Hack competition.  Over the course of 46 hours, we built a solution that brings together fragmented datasets, some of which are listed below, in an Emergency Response solution, adhering to the spirit of Gov Hack by showing the power of Open Data.

The team consisted of Andrew Exley, Cameron Wells, Etienne Oosthuysen, Jake Deed and Jean-Noel Seneque.

See a short summary of our journey and a condensed version of our video submission here:

The solution contains:

  • The architecture and data platform that allows the datasets to be ingested in a periodic and in a real time manner, stored, and blended to serve a variety of emergency related user stories. 
  • A user interface that can be accessed from anywhere (PC or mobile phone) and allows for real time tracking of emergency events vis-a-vis points of interest (such as your home), the nearest point of safety, rolling social media coverage of the event, other points of interest to assist emergency services respond (such as bodies of water for water bomb runs, traffic and congestion, helipads or airports, etc.)
  • A platform by which data can be analysed for trends by analysts working for the emergency services.

Some examples of the datasets:

  • G-NAF (Geocoded National Address File) which is one of the most ubiquitous and powerful spatial datasets. It contains a full geo-spatial description of each address (including the state, suburb, street, number and coordinate reference (or “geocode”) for all street addresses in Australia).This forms the basis of location of people or places, and the distance of people to places, such as your home to a point of safety during an emergency.
  • Twitter and sentiment, especially during emergency events. This helps determine sentiment during an event, such as the inherent urgency during an emergency.
  • Dams (Angus Catchment) by the Department for Environment and Water in South Australia. The dataset contains polygon data outlining the physical extent of dams and estimated dam capacity (volume range in megalitres). This forms the basis of water bomb runs in the case of a bushfire emergency.
  • Statistical Area Level 1 (SA1) by the Australian Bureau of Statistics. This used in combination with incidents and statistical population to estimate people affected or likely to be affected by incident.
  • Country Fire Service of South Australia live incident feed. This forms the basis of identifying when emergencies occur.

Artificial Intelligence in Aged Care – June’s story

Meet June; long time Adelaidean, keen gardener and grandmother of twelve!  At 86 years ‘young’, June moved from her own home into a local aged care facility following a series of falls that saw her hospitalised over the summer.  June was diagnosed with Parkinson’s disease 18 months ago and following an increasing number of falls, June and her family made the decision to move her into residential care.

As symptoms of Parkinson’s disease progress at different rates for different people, getting June’s treatment plan right has been tricky, complicated by the fact that like many aged care residents, she requires several different medications to manage her health.  June and her carers have noticed that her tremors appear to be triggered by stress or emotional experiences and lessen when she is relaxed.  It also appears that regular exercise and engagement in leisure activities aid in keeping June’s tremors at bay.  As tremors often lead to lack of balance, which is likely to result in a fall, June’s care team have put together a robust healthcare plan which includes regular activity and time spent outdoors on top of her medication and occupational therapy.

The aged care facility where June lives recently embarked upon an initiative with the goal of improving the overall response to incidents such as falls, ensuring that responses are timely and that any incidents are attended to by the correct staff.  CCTV cameras have been installed in the corridors on the higher dependency floors, such as the one June lives on.  The CCTV is used to track residents’ movements via location tracking as well as emotions via facial recognition.  Residents of these sections have also been given smart devices to wear that track real-time data such as number of steps taken, standing vs walking rate and heart rate.

When dealing with personal data, it is of paramount importance to ensure its security.  Additional precautionary measures will be taken to ensure the security of June’s personal data so that it will be accessed for authorised purposes only.  Steps need to be taken so that June’s personal data is not shared or used for any commercial gain, for example, as a way to categorise June, possibly affecting her insurance premiums based on her risk as a patient.

Given the knowledge we have around the impact of stress on the incidence of tremors, the data from the CCTV coupled with June’s smart device will trigger an alert to the team lead in charge of her zone, should the variables compute to show an increased likelihood of stress.  The team lead is then able to ensure not only that there are sufficient carers positioned in high risk zones, that they are also equipped to deal with a possible fall. Furthermore, the wearable device shows the care team when June is outside and how much sunlight – linked to positive mental health – June is getting.  The data also enables the team to see links between steps and heart rate.  If it is found, as an example, that steps are going down and heart rate increasing, this could be a sign of a potential health issue, which would enable the appropriate medical intervention to happen proactively.

This scenario illustrates a proactive solution that benefits June and other residents in terms of the level of care they receive, not only through better response to incidents but in helping to prevent incidents happening in the first place.  At an organisational level, management also get insights that assist them in planning and resourcing more effectively as well as the ongoing process improvements brought about by machine learning.

Stay tuned for a follow up instalment as we explore the technical aspects of the business case!

Author: Sophia Siegele; Contributor: Shishir Sarfare

Artificial Intelligence and Occupational Health and Safety – AI an enabler or a threat

We increasingly hear statements like, “machines are smarter than us” and “they will take over our jobs”. The fact of the matter is that computers can simply compute faster, and more accurately than humans can. So, in the short video below, we instead focus on how machines can be used to assist us do our jobs better, rather than viewing AI as an imminent threat. It shows how AI can assist in better occupational health and safety in the hospitality industry. It does however apply to many use cases across many industries, and positions AI as an enabler. Also see an extended description of the solution after the video demo.

Image and video recognition – a new dimension of data analytics

With the introduction of video, image and video streaming analytics, the realm of advanced data analytics and artificial intelligence just stepped up a notch.

All the big players are currently competing to provide the best and most powerful versions;   Microsoft with Azure Cognitive Services APIs, Amazon with AWS Rekognition, Google Cloud Video Intelligence as well as IBM with Intelligent Video Analytics.

Not only can we analyse textual or numerical data historically or in real time, we’re now able to extend this to use cases of videos and images. Currently, there are API’s available to carry out these conceptual tasks:

  • Face Detection

o   Identify a person from a repository / collection of faces

o   Celebrity recognition

  • Facial Analysis

o   Identify emotion, age, and other demographics within individual faces

  • Object, Scene and Activity Detection

o   Return objects the algorithm has identified within specific frames i.e. cars, hats, animals

o   Return location settings i.e. kitchen, beach, mountain

o   Return activities from video frame i.e. riding, cycling, swimming

  • Tracking

o   Track movement/path of people within a video

  • Unsafe Content Detection

o   Auto moderate inappropriate content i.e. Adult only content

  • Text Detection

o   Recognise text from images

The business benefits

Thanks to cloud computing, this complex and resource demanding functionality can be used with relative ease by businesses.  Instead of having to develop complex systems and processes to accomplish such tasks, a business can now leverage the intelligence and immense processing power of cloud products, freeing them up to focus on how best to apply the output.

In a nutshell, vendors offering video and image services are essentially providing users API’s which can interact with the several located cloud hosts they maintain globally. All the user needs to do, therefore, is provide the input and manage the responses provided by the many calls that can be made using the provided API’s. The exposé team currently have the required skills and capability to ‘plug and play’ with these API’s with many use cases already outlined.

Potential use cases

As capable as these functions already are, improvements are happening all the time.  While the potential scope is staggering, the following cases are based on the currently available. There are potentially many, many more – the sky really is the limit.

Cardless, pinless entry using facial recognition only

This is a camera used to view a person’s face, which then gets integrated with the facial recognition API’s.  This then sends a response, which can be used to either open the entry or leave it shut. Not only does this improve security, preventing the use of someone else’s card, or pin number, but if someone were to follow another person through the entry, security can be immediately alerted. Additional cameras can be placed throughout the secure location to ensure that only authorised people are within the specified area.

Our own test drive use case

As an extension of the above cardless, pinless entry using facial recognition only use case, additional API’s can be used to not only determine if a person is authorised to enter a secure area, but to check if they are wearing the correct safety equipment. The value this brings to various occupational health and safety functions is evident.

We have performed the following scenario ourselves, using a selection of API’s to provide the alert. The video above demonstrates a chef who the API recognises using face detection.  Another API is then used to determine that he is wearing the required head wear (a chef’s hat). As soon as the chef is seen in the kitchen not wearing the appropriate attire, an alert is sent to his manager to report the incident.

Technical jargon

To provide some understanding of how this scenario plays out architecturally, here is the conceptual architecture used in the solution showcased in the referenced Video.

Architecture Pre-requisite:

·        Face Repository / Collection

Images of faces of people in the organisation. The vendors solution maps facial features, e.g. distance between eyes, and stores this information against a specific face. This is required by the succeeding video analytics as it needs to be able to recognise a face from various angles, distances and scenes. Associated with the faces are other metadata such as name, date range for permission to be on site, and even extra information such as work hours.

Architecture of the AI Process:

·        Video or Images storage

Store the video to be processed within the vendors storage location within the cloud, so it is accessible to the API’s that will be subsequently used to analyse the video/image.

·        Face Detection and Recognition API’s

Run the video/images through the Face Detection and Recognition API to determine where a face is detected and if a particular face is matched from the Face Repository / Collection.  This will return the timestamp and bounding box of the identified faces as output.

·        Frame splitting

Use the face detection output and 3rd party video library to extract the relevant frames from the video to be sent off to additional API’s for further analysis.  Within each frames timestamp create a subset of images from the detected faces bounding box, there could be 1 or more faces detected in a frame.  The bounding box extract will be expanded to encompass the face and area above the head ready for the next step.

·        Object Detection API’s

Run object detection over the extracted subset of images from the frame.  In our scenario we’re looking to detect if the person is wearing their required kitchen attire (Chef hat) or not.  We can use this output in combination with the person detected to send an appropriate alert.

·        Messaging Service

Once it has been detected that a person is not wearing the appropriate attire within the kitchen an alert mechanism can be triggered to send to management or other persons via e-mail, SMS or other mediums. In our video we have received an alert via SMS on the managers phone.

Below we have highlighted the components of the Architecture in a diagram:


These are just a couple of examples of how we can interact with such powerful functionality; all available in the cloud. It really does open the door to a plethora of different ways we can interact with videos and images and automate responses. Moreover, it’s an illustration of how we can analyse what is occurring in our data, extracted from a new medium – which adds an exciting new dynamic!

Video and image analytics opens up immense possibilities to not only further analyse but to automate tasks within your organisation. Leveraging this capability, the exposé team can apply our experience to your organisation, enabling you to harness some of the most advanced cloud services being produced by the big vendors. As we mentioned earlier, this is a space that will only continue to evolve and improve with more possibilities in the near future.

Do not hesitate to call us to see how we may be able to help.


Contributors to this solution and blog entry:

Jake Deed –

Cameron Wells –

Etienne Oosthuysen –

Chris Antonello –