Defining Analytics

Data Analytics used to be a buzzword. It no longer is. It is now a standard business requirement. Over the past decade we have gone from asking “Does your business need data analytics?”, to “How can data analytics transform your business?” Simply put, analytics has become essential to the growth and development of the modern organisation. Today, businesses of all sizes can employ analytics to gather insights on the health of processes, and to identify opportunities for improvement and expansion. But first, before we discuss how analytics can add value to a business, let’s define the term “data analytics”.

A different definition

The word “analytics” has been in regular use for a few decades now, however, it can be traced back many centuries to the Greek “ἀναλυτικά” (Latin “analyticvs”), which means “the science of analysis”. In this age, we define analytics simply as “the method of logical analysis”.

Today, the word most commonly coincides with the term “data analytics”, which refers to the process, or a set of processes, for collecting, storing, processing, and interpreting data. This is the most common definition, and it has evolved over the years, depending on the technological capability of that time. It is, in my opinion, quite reductive. I, however, would prefer to expand that definition…

Data analytics is a framework for organisations to acquire, store, process, and interpret data, for the purpose of analysing incidents, gathering insights, identifying patterns, predicting events, and reporting relevant business metrics for further action.

A framework?

You read that right. There’s a number of definitions for “data analytics” available over the internet. Most will define data analytics as a process, and they are all right. These definitions have mostly come from industry leaders – organisations and people – who have invested heavily and worked in the field of data science for decades, so there is no denying their “correctness”. However, I believe they diminish the modern analytics process down to simply “analysing” data.

I like to think of data analytics as a framework over which businesses can logically build their analytical capability, by methodically developing systems for addressing each step of the data lifecycle, starting from data generation and acquisition to data archiving.

Why redefine what isn’t wrong?

When we think of data analytics as a process, we are more likely to focus on the “analysis” portion of analytics. Data analytics, however, is a lot more than analysis of data. It involves many steps, or rather sub-processes, leading up to analysis, and a few more following it. Each of these sub-processes requires careful considerations and implementation that can affect the quality and outcome of every other step in the overall analytics workflow.

Not only that, thinking of data analytics as a process leads many to believe that analytics is a linear exercise; that it starts with gathering data, and ends with visualising the results of analysis. This is quite far from real-world scenarios. In most projects, data moves from one sub-process to another, often branching here, and converging there, or returning to a prior sub-process, or jumping a few sub-process. How data moves through a data analytics system will depend on a number of business rules and requirements.

Data analytics is a framework for organisations to acquire, store, process, and interpret data, for the purpose of analysing incidents, gathering insights, identifying patterns, predicting events, and reporting relevant business metrics for further action.

Looking at data analytics as a framework encourages us to acknowledge the various sub-processes within the analytics workflow, and how they interact with each other. It also allows us to take into account how each of these sub-processes may share resources with different business systems outside of the workflow. Or technical and business teams could plan how existing resources which are not currently used for analytics can be integrated into a workflow.

So rather than picturing data analytics as a bidirectional numberline, you could think of it as a two-dimensional plane that overlaps with other two dimensional planes in certain areas, and intersects with them in other places.

Parts to a Whole

Moving beyond definitions and arguments, let’s look at the different sub-processes that make data analytics possible. While real-world applications and implementations of data analytics can look vastly different, they almost always include the following components:

  1. Data Acquisition & Source Integration
  2. Data Storage & Warehousing
  3. Data Cleaning & Preliminary Analysis
  4. Feature Engineering
  5. Statistical Analysis & Machine Learning
  6. Data Visualisation & Reporting
  7. Data Archiving

All of these sub-processes contribute towards the efficiency of your analytics workflow, which is why a data analytics project must include specialists with an understanding of the whole. Decisions made for and at each segment could either derail or simplify the end result.

Data Acquisition & Source Integration

No matter the size of your business, there are always operations that generate relevant data. There will also be operations that access and process data originating outside of your business. The goal of data analytics is to make sense of this data, understand how your business is performing, and finally improve business operations. A data analytics workflow must accommodate systems that can record data generated across different teams and activities. This includes data generated by the analytics workflow itself. An analytics workflow must also include tools that can integrate data sources which are external to your business.

Data Storage & Warehousing

Once your business has identified valid sources of data, and found ways to access these sources, it needs to find ways to store this data in a manner that is manageable and effective over the long term. A whole lot of data generated by business processes are “raw” in nature. This data needs to be organised so that it can be easily and repetitively accessed and processed. It needs to become meaningful. A business’ data storage strategy can affect their analytics capability. Poorly developed storage systems can slow down the analytics workflow and unnecessarily increase operational expenditure. An efficient data storage strategy will take into account the volume of data generated and the velocity at which the data is generated. It will also manage data backups for disaster management and recovery.

Within the purview of data analytics there is also data warehousing – data storage that is specifically maintained for analytics. More on this in future posts.

Data Cleaning & Preliminary Analysis

While the data storage process brings order and structure to the data ingested from multiple sources, it is not always ready for the purpose of analytics. Data, especially textual data, can often have issues such as spelling mistakes, special characters, whitespaces, etc. which can undermine the results of analytics. Data ingestion can also face errors related to date formats, rounding of numerical values, truncation of text strings, etc. All of these need to be handled either when the data is being ingested into storage, or right after it has been ingested. At this stage, it is also important to run some preliminary analysis on the data. This helps understand how the data is distributed (particularly when the end-goal involves statistical analysis), and how much of the data is either redundant, missing, or simply irrelevant, and if more cleaning is required.

It is quite common for these first three sub-processes – data acquisition & source integration, data storage & warehousing, and data cleaning & preliminary analysis – to be implemented as one unified process within the analytics workflow.

The following three – feature engineering, statistical analysis & machine learning, and data visualisation & reporting – constitute the “analysis” portion of a data analytics workflow.

Feature Engineering

In many cases data that has been cleaned is good enough for statistical analysis, reporting, and visualisation; however, it is not always optimal. Businesses might want to extrapolate more information from the cleaned data, such that this information makes it easier to filter data reports or explain trends and historical events. Basically, existing data can be manipulated to create new “features” and “attributes” that have more value than the cleaned data from which it was created. In fact, data analytics and machine learning systems rely heavily on feature engineering processes. Efficient feature engineering can greatly enrich the quality of your business’ reports and forecasts.

Statistical Analysis & Machine Learning

Analytical systems will frequently employ statistical methods for analysing data and identifying features that are relevant to the business questions that need answering. They also help determine which features in the data have strong predictive power and must be used in machine learning algorithms. A good portion of data analytics is oriented towards forecasting how business processes will perform and what preemptive steps are required to ensure business continuity. For this reason data analytics teams often have at least one data scientist and one machine learning engineer who can hypothesize and test the business hypothesis by using statistical methods.

Data Visualisation & Reporting

The results of analysis (variances, correlations, scores, etc.) can be quite difficult to read. Business stakeholders with the power to make decisions are not always acquainted with the technical knowledge required to make sense of statistical analysis. To accommodate such an audience, it is necessary that the results are presented as reports (verbose and visual) that drive the message home. An analytics workflow will consist of at least one dashboard or report, depending on the business requirement. The development of dashboards and reporting solutions can be quite cumbersome, and is impacted by the underlying data storage (warehouse architecture, data models, etc.) and so great care must be taken to ensure that your storage solution is not overcomplicated.

Data Archiving

A much overlooked process within data analytics projects, data archiving – not to be mistaken with data backup – can quicken the response times of your analytics solutions. By archiving data that is no longer in frequent use, considered “offline”, we take them out of operational use, leaving behind aggregations and snapshots instead which do not need to be re-processed along with newer operational data. Reduced processing and archived storage often translates to reduced operational expense. Decisions about data archiving need to be made early in the analytics workflow development, and must be covered alongside the storage solution discussed earlier.

Not so simple

As mentioned earlier, there is the ever-lingering temptation to regard data analytics as a simple end-to-end pipeline. It seldom works so simply. Data analytics workflows, given the business requirement, can be quite complex. There are processes and considerations we have not covered in the preceding sections of this post, but the ones mentioned are the typical building blocks of an analytics workflow. When developing an analytics solution teams have also to consider aspects such as data security and data compliance standards of local governments where servers are hosted.

Keeping all of this in mind, here is how I’d visualise a basic analytics workflow.

alt_text

Don’t let that image frighten you. It’s only meant to give you an idea of how data flows in a data analytics system. An efficient analytics workflow automates data movement from one subprocess to another, and minimises human intervention, allowing analysts and business stakeholders to focus their attention on high-priority issues. It also helps individuals and teams make better decisions by shifting their focus from scouring crude data to analysing reports produced by the system.

Get in touch

At Exposé we pride ourselves in developing bespoke data analytics solutions that cater specifically to your business’ requirements. Whether you are looking to integrate reporting tools, automate your data workflows, or modernise your existing tech infrastructure by moving to the cloud, we can provide you the service you require. Get in touch with us for a confidential initial discussion.

Azure Purview, does it fill the data governance blind spot for Microsoft? (Part 2)

Welcome back to Part 2 of this blog on Azure Purview. In this instalment continuing from Part 1, I will go through a detailed review and highlights of Purview.

Azure Purview – classifications and sensitivity:

  • Classifications using Purview’s own 100+ prebuilt classifications, or BYO ones, are used to mark and identify data of a specific type that is found within your data estate. We could easily see where all the credit card and Australian Tax File numbers were located across the data lake.
  • Sensitivity labels define how sensitive certain data is and they are applied when one or more classifications and conditions are found together. We could clearly find the data with a sensitivity of ‘secret’ – an immediate application of this could be to support the IRAP data classifications as defined by the Australian Signals Directorate and PCI in the financial sector.

When the scan completes all data meeting classification rules will be discoverable, whereas Purview’s sensitivity labelling needs a couple of hours to reflect the new assets and auto label sensitivity after which time it too will be discoverable. It is also possible to view insight reports for classifications and sensitivity labels.

Azure Purview classification and sensitivity label examples
Azure Purview classification and sensitivity label examples

Azure Purview – business glossary

We could easily employ a glossary to overlay business friendly terms over the metadata so that we converted the physical vocabulary with a standard vocabulary the business can understand. Remember data is the business’ asset more so than that of ICT, so a business vocabulary is important.

Purview has, as I previously mentioned, more than 100 system classifiers it uses during scans to automatically detect the system and it can also use your own BYO classifications and apply them to data assets and schemas. But it was easy to override these with the business glossary and anything a human override was never replaced by subsequent automated scans. We for example overrode CC# with Credit Card Number.

It is also in the glossary where data stewards and owners are set, two core elements of effective data governance.

Terms can also be subject to workflow and approvals so that it does not become a free for all.

Azure Purview – show lineage at various levels

We could see a bird’s-eye view of the data estate, including very impressive lineage at the asset, column and process, levels, as well as the full data map:

  • At the data asset/ entity level, i.e., where the entity/ asset came from.
Azure Purview Entity level lineage
Azure Purview Entity level lineage
  • At the column level. i.e., where the attribute came from.
Azure Purview column level lineage
Azure Purview column level lineage
  • At the process level, i.e., how data was moved/ orchestrated.
Azure Purview process level lineage
Azure Purview process level lineage

Azure Purview – gain insights

It is also easy to see insights across many of these concepts – across the data assets, glossary, scans, classifications, sensitivity, and file extensions.


The images below show only some insight examples for the glossary, for classification and for sensitivity:

Azure Purview Glossary insights
Azure Purview Glossary insights
Azure Purview Classification Insights
Azure Purview Classification Insights
Azure Purview Sensitivity Insights
Azure Purview Sensitivity Insights

Conclusion – is Azure Purview a worthy data governance tool

Does Purview hit the mark? As I said before, there were, as at the date of authoring this, some kinks Microsoft needed to sort out and my correspondence with their product team suggests they are working on this. So, looking at it from a pure data cataloguing perspective, it ticks many boxes and at a very compelling price point.

But data governance is broader than just cataloguing, and even though Purview crosses the boundary into some aspects that would not normally sit within a data catalog (which is a good thing), other areas still require attention, notably master data, data quality and data security. BUT we all know this is just the first module, so watch this space!

Originally published by Etienne Oosthuysen at https://www.makingmeaning.info/post/azure-purview-does-it-fill-the-data-governance-blind-spot-for-microsoft

Azure Purview, does it fill the data governance blind spot for Microsoft? (Part 1)

Data enablement – can Azure Purview help

For several years now, I have been evangelising about the need for data enablement in organisations. In a nutshell, this means getting data to consumers (be it final consumers of reports, or consumers who go on to create data solutions such as data engineers, data scientists and business focussed data modellers) faster, and accepting that data transformations, data modelling and new data solutions (models, reports, dashboards, etc.) will occur as part of a highly agile, rapid development process and by multiple data workers, some of whom are external to ICT.

Technology has now reached that part in the maturity curve, where this enablement will accelerate and become the norm, replacing old school, linear data workloads performed by central BI/ ICT teams only. Core to these technologies is Data Lakes for storage at scale (the ‘lake’ of your ‘lake house’), Synapse or Databricks for on-demand transformations and the virtual data warehouse layer (the ‘house’ of your ‘lake house’), and Power BI for business models and visualisations (the ‘front door’ to the whole thing) and of course resources to move data into, and around the ecosystem.

BUT all this enablement now demands a massive rethink of governance – both in terms of methodology as well as technology. Long and laborious theory heavy data governance methodologies simply won’t keep up with the rapid internal growth of the internal data market and the many workers across the organisation who take part in data related activities. An alternative, much more pragmatic methodology is required and must be supported by technology that posses two crucial things: (1) Artificial Intelligence to simplify and accelerate the process of data cataloguing and classification, and (2) crowd sourcing so that users across the business can quickly add to the collective knowledge of the data assets of the business. And it is in the technology space where Azure had a massive blind spot.

Introducing Azure Purview.

The word Purview simply means the ‘range of vision’ and when it comes to data, then the greater the range of this vision and the clearer the objects you see, the better. Will Purview live up to this definition of its more generic namesake and will it cover the blind spot I previously mentioned?

The current generally available version is the first of multiple planned modules for Purview, i.e., the Data Catalog module. This first module supports AI based cataloguing and classification of data across your data estate, curation, and data insights. Users will in addition be able to use and maintain business glossaries, expressions to classify data based on patterns beyond the out-of-the-box classifications (let’s call these bring your own (BYO) expressions to cover additional patterns), provide visibility over ownership and custodianship, show lineage, etc.

This will have immediate benefit to anyone seeking pragmatic data governance as it will immediately provide a heap of knowledge about the data in your data estate via 100+ out of the box scanning rules, something that would have required resource intensive and error prone human activity, plus it enables a data worker to augment/ override the AI scanning in a crowd sourcing ecosystem, or by allowing data workers to BYO scanning rules.


In the recent road test, we dumped a whole load of data into Azure Data Lake and set Purview scanning loose over it to do its AI and built-in classifier magic. The results looked pretty good and goes a long way to fulfil that requirement I mentioned before for pragmatic and accelerated data governance.

In the next blog, I will go through a detailed review and highlights of Azure Purview. Stay tuned!

Reasons to avoid gradient backgrounds on Power BI reports

Gradient backgrounds are all the rage in dashboard design. They appear to be floating around everywhere.

Looks cool right?

Perhaps. I am not a fan, and for some time have found it hard to articulate to others why this is the case. My verbal utterances were never much more than “it makes me go cross-eyed and headache-y” or “it makes my stomach churn”. Not convincing enough reasons to ask the report designer to pause and reflect.

In contemplating the matter more deeply I set about putting my thoughts to paper.

Gradients are great in design because they add interest, texture, warmth and in some cases depth to an otherwise flat subject. Given that our outside world is not flat, monotone or uniformly illuminated, it’s easy to see why we might be drawn to a visual with such cues. According to colour psychology, various gradient colours and mixes will elicit various emotional and visceral responses. Although the exact response desired in our audience is hard to target when there are cultural and individual differences in colour associations and preferences.

So why is it that I have such an averse response to gradient backgrounds on a dashboard? Why do I feel headache-y and even nauseous?

I doubt I’ll ever truly know the answer, but I can experiment and postulate.

My Experiment

In thinking of how gradients might impact visual perception, I am reminded of this infamous optical illusion from my undergraduate psychology classes.

https://en.wikipedia.org/wiki/Checker_shadow_illusion

Squares A and B are the same shade, however, square B is depicted in shadow tricking us into thinking it is several shades lighter. Here, our brains are trying to make sense of luminance and colour constancy.

Imagine then, what is happening with our perceptive systems when trying to discern foreground colours against a background of varying shades.

Let’s take a look…

Here I have a bar graph recycled from a previous blog post. You’ll note that as the bars move further away from each other the contrast appears to change.

Not so easy to tell apart anymore.

Now let’s put a gradient background on the page.

Not only does the gradient here affect contrast and saturation, making one block less visible than the other, but the surrounding colour variation has my brain questioning whether the blocks are in fact the same colour. One looks a slightly bluer hue than the other.

“Ah, Kerry” you say, “that’s only because its green on green!”

No worries, let’s change to blue.

Hmm… not much to see here except that the text has become harder to read on the left side of the report. Pain to correct, so let’s try another background.

Eewww. The imagery this colour mixture conjures up is gag-worthy.

Quick, let’s try and find some relief from that pink by lightening it up.

Oh goodness my eyes! Now I’m getting that eye-strain induced headache feeling again…

What’s happening here?

Again, I can only speculate and describe my experiences. What I do know is that everything on the page is competing for my attention. The boxes have broken up the gradient, which before showed a gradual change in hue and brightness. The boxes have now made this change more abrupt. Scanning across the page, I feel my pupils or other parts of my perceptive system are rapidly trying to adjust to the variations in contrast and brightness causing strain. The more luminous colours are triggering a sharp pain at the back of my head and are lingering even after closing my eyes.

I am drawn to the richer pink and purple colours to the left of the report. When reading left to right, these colours stay clearly in my peripheral vision vying for my attention, my eyes keep wanting to snap back. I’m therefore using a considerable amount of concentration trying to block out distraction and read the text on the screen.

The boxes also confuse, which I believe might be due to depth perception. Things that are hazy, naturally, seem further away. Gradations in brightness and saturation are also visual cues for light, shadow and distance. Due to the transparent nature of the boxes, when scanning the page left to right, up then down, my brain can’t decide whether the boxes are in the background or foreground and how far to the background or foreground they should be… as the box keeps moving on me in terms of depth perception, I am starting to feel a sensation similar to motion sickness…

Section 1 feels like the box should be sitting further into the background; Section 2 feels like the box should be further into the foreground

Back to my experiment and I decide to change from bar graphs to line graphs.

Wait, what’s that?!

Hmmm….

Question: How many shades of green are there in the below image?

Answer: Three.

I’d thought I pause at this stage to do some research. From playing around with charts on various background colours I was able to manipulate my brain into seeing different hues for the same object just by dragging it around on the screen. There must be some scientific study into this and a name to call the phenomenon.

I couldn’t find a fancy-schmancy latin name but I did find some research on colour saturation and colour adaption theory. I also came across a book by Josef Albers The Interaction of Colour and blog posts (here and here) from which I derived the below images. These images illustrate how the same colour can be perceived to be different and how two colours can be perceived to be the same by merely adjusting context.

My learnings

So, at the end of my experimentation and research, what did I learn?

  1. High-energy, luminant colours irritate and overstimulate our senses, particularly for those with photosensitivity. Blue light is considered the most irritating.
  2. Rapid changes in light frequency and certain high contrast visual patterns can trigger migraines
  3. We are visually drawn to brighter, more saturated colours above the rest
  4. Gradations in brightness and saturation are also visual cues for light, shadow and distance
  5. When a coloured object is placed against a coloured background, the background may change the objects hue, saturation and brightness.

Each of these things in turn can be designed for on a plain Power BI background. However, when using gradients (particularly multi-coloured gradients), it becomes not only a challenge for the report designer who must try to create consistent contrast across the canvas, but for the report consumer too. We use variations in colour as visual cues to understand where we are in our natural environment. “Layering” charts on these gradients (without respect to the underlying gradient) confuses these cues. The consumer then expends a considerable amount of energy trying to eliminate background noise and create constancy, before they can even begin to interpret meaning, thus, as in my case, can quickly become overwhelmed.

When it comes to building dashboards, it’s important to remember objectives. A dashboard is intended to facilitate rapid assessment of key metrics at a glance. Gradients need to aid this rapid assessment, else they are merely decoration or distraction. Whilst decoration is important on dashboards as a means to engage users, unless it is intended to market or sell a product, it should be kept to a minimum. Any decoration that may influence a consumer’s visceral or emotional response when assessing metrics and making decisions should be very carefully considered. For business users, a neutral and minimalistic design is key.

Well, there you have it. A visual exploration of why I dislike gradient backgrounds on dashboards. Excuse me whilst I go rest my eyes for a while.

7 Favourite Power BI Summit Sessions from the Purple Peeps

The Power BI Summit was hosted by RADACAD with some 1000+ attendees and 100+ speakers – a mix of the Power BI Product team, Microsoft MVPs and Professionals.


The sessions were targeted for beginner to intermediate level users with some more advanced topics available. By far the biggest value came from Q&A sessions, hearing from other developers and consultants on their lessons learned for Power BI projects within various organisations.

The Power BI Summit is one of the many events that data analytics professionals around the world look forward to each year. With over 100 sessions offered in this year’s conference, check out which ones were our consultants’ favourites:

Contributors: Jean-Noel Seneque, Emma Girvan and Kerry Kolosko

Power BI Dataflows? Why you need to implement it!

By Jean-Noel

The reason why we have Power BI Dataflows

  • Dataflows provide an abstraction layer that handles all the logic, complexities and query steps
  • Dataflows is a Power Query process that runs in the cloud
  • Self-Service Data Preparation for non-technical user personas

Power BI Dataflows is appealing to power users compared to Power BI Desktop because of the added features that have not been released in the desktop yet.

Visual Query Folding Indicators, essentially the Power Query, send back the transformations to the data source to do the heavy lifting. This helps present a better view of the data to be consumed and less work on the Power BI side.

Visual Query Folding

Diagram View provides an overview of a lineage of the data inside the dataflow. You also get a visual overview of the transformation that is occurring and have the ability to extend the dataflow by adding more data, spreadsheets, flat files to the dataflow. You are not connecting to a live dataset.

Data Refresh?

  • Works the same as Dataset
  • We can trigger it manually or set a schedule
  • It does an upstream stream where it will up the data flows it depends on Incremental refresh is available to premium customers
  • The only issue right now is getting error information, the only place to see this is in the refresh history which is not really intuitive

How does the Query Reusability work?

  • You can copy and paste the query from Power BI Desktop but not Power BI Dataflows
  • Linked/Computed Entities (available to premium/PPU Only). This provides the ability to utilise the In-Lake Compute which helps transform the data in the lake
  • Reference points to the existing entity, uses its data and logic but you can build on top of it
  • You can link to other entities in other dataflows
  • It is always Import mode
  • Security is the responsibility of the dataset designer
  • The only issue right now is getting error information, the only place to see this is in the refresh history which is not really intuitive

Building great data experiences in Power BI – Tips from the Product Team

By Jean-Noel

Dissecting what makes great data experiences and the 8 essential steps to achieve that goal.

Goals of great data experiences

  • Transfer of information, some insights you have found and that the users have interpreted correctly
  • Not just to answer the question being asked but why they asked that question in the first place
  • Good leaders will always keep asking questions and have follow-up questions and a great data experience will help let them answer those questions themselves
  • Think of the decisions being taken as a result of that information being transferred
  • What are the users motivations for making those decisions and how is it connected to the business processes that they are in charge of?

Essential #1 – who are your end-users and what do they need from these reports?

Essential #2 – design your reports with the 4 W’s in mind: Who? Where? When? What? The number one tip is to stay focussed and minimise any clusters and distractions. Check out these layouts and get inspired. https://powerbi.tips/layouts

Essentials #3 – Right chart for the right purpose

Essential #4 – Allow report interactivity and exploration

Essential #5 – Translate from insights to actions

Essential #6 – This is a form of communication, make it easy to read and understand

Essential #7 – Make the reports available for the on-the-go users; anywhere, anytime

Essential #8 – Performance and optimisation

Designing for High Performance

By Emma

Here are some common steps to improve data model performance:

  • Avoid using high precision/cardinality columns as it creates a level of detail that impacts the ability for the technology to compress the data effectively (i.e., Date/time fields with seconds/milliseconds, numeric values with many decimal places)
    • Consider separating out the data into separate columns if it is required and use Power Query to join the data back together as required.
    • If you do not need to have an infinite number of decimal places in the source data, create a step to limit the number to only what is required.
  • Consider using composite models instead of pure Direct Query data imports if some of your data does not need to be near real time.
    • A possible solution could be to bring in Dimension table data with import mode and Fact table data with Direct Query mode.
  • Turn off the auto date/time feature in the Global Options and create custom date tables instead. Power BI creates an internal date table behind the scenes for every date/time field in the dataset. This option has a default setting of ON so it is a good idea to turn this off.
  • Disable meaningless interactions between visuals on the page by turning off interactions using the Edit Interactions option.

Bringing Visualisation Grammar to Power BI

By Kerry

This session involved a walk-through of a preview feature visual, which like Charticulator, allows Power BI Report Developers to build their own custom visuals in Power BI.  

  • Unlike Charticulator, the visual is grammar-based, (rather than click and drop) leveraging Vega languages with far more extensive capabilities. 
  • It requires light coding but no programming experience, making it accessible to most. 
  • The visual runs in desktop as you type with no external dependencies like the R and Python extensions. The possibilities for this visual are exciting, as it allows fine grain customisations on interactivity, faceting, and variable parameters. With this visual, developers can plot distributions in ways that are not currently available in native desktop or appsource visualisations, in a fraction of the time it would have taken to develop via other means.  

Combining Multiple Data Sets in Power Query

By Emma

  • Did you know there is a hidden trick (and best practice) for importing multiple worksheets from the same Excel file without clicking each individual worksheet?
  • By clicking on the header name in the import table process and using Power Query steps to ‘clean up’ the dataset, you will be creating a more scalable data import process if more tables need to be added in the future.
  • You will also eliminate duplication of transformations across multiple separate tables.

Power BI Tips, Tricks & Hacks

By Emma

  • Did you know that if you want to list a measure in more than one display folder in the Power BI model, because it may make sense to the report developer to find it in various places, you can do this by separating each display folder name with a semicolon?! It does not duplicate the measure, but simply makes it available for selection from multiple places in the model.
  • Did you know by using calculation groups, you can save a stack of time creating a series of time intelligence measures where the logic is exactly the same, other than the date period? Think of all the many financial measures you have created where you have spent time copying and pasting the same thing and changing 3 little letters for each set.
    • This also allows for the use of dynamically selected measures which can then also have the appropriate field formatting applied if you have a mix of $$ and % values. But use caution as there could be a performance impact.
  • Did you know there is a way to use conditional formatting with a measure on the line chart visual, despite there not being an option in Power BI to do this?
    • Create your visual using a column chart, apply your conditional formatting and then change to a line chart visual.
  • Did you know you can apply transparency colours by adding 2 extra digits to the hex value when creating measure for use in conditional formatting?
    For example:
    Cndf – White Trans = “#FFFFFF00”
    Tip: use 00 for 100% and 80 for 50% (e.g., FFFFFF00, FFFFFF80)

AI in Power BI

By Kerry

This session provided an introduction to AI assisted visuals and delved into ML Capabilities within Power BI Premium. 

  • This was a fantastic session walking through various use cases for the feature set allowing the audience to imagine how these could be applied to their own data.  
  • Anomaly detection, smart narratives and predictive analytic capabilities were also discussed in this session. 

Exposé at Ignite 2021 – our main takeaways

The Microsoft technology stack, specifically Azure, has over the past few years provided many best-of-breeds solutions to common data problems and tackled the data analytics domain with a different approach to its competitors (AWS and GCP). And recent changes saw them consolidate and tightly couple some of the key data resources which makes for an even more compelling story.

This is something that Exposé have been evangelising out there in a market for quite some time now, and even though we had access to various product roadmaps and were therefore fortunate enough to architect solutions for our customers with a future state in mind, it was great to see some of these roadmap areas reaffirmed and in action at Ignite 2021.

So, looking through the lens of data analytics professionals, here are our six most important takeaways from Ignite 2021. We state in each section why data driven organisations should be interested.

  • Reinventing of Data Governance – Azure Purview.
  • Ensuring performance of your new modern data warehouse – Power BI and Azure Synapse Analytics.
  • Bringing BING to data analytics – Azure Cognitive Search (Semantic Search).
  • Bringing premium capabilities to users – Power BI Premium Gen2.
  • Reining in the uncontrolled release of Excel through Power BI Connectivity – Power BI.
  • Add cores to boost your BI performance – Power BI Premium Gen2.

Exposé will also delve into some of these areas in more detail in subsequent articles.

Contributors: Andrew Exley and Etienne Oosthuysen

Data governance across your digital landscape

Technology – Azure Purview

Reinventing Data Governance.

By Etienne

Why – As we acquire and store increasing volumes of data, greater numbers of data workers will use more data to answer increasingly varied business questions at scale and within a quick turnaround. And each data activity has the potential to contribute to the collective knowledge across the business. So, with the explosion of raw and enhanced data, governing data is more important than ever. But it must be done in a pragmatic way as we no longer have the luxury of old-school data workloads such as conventional data warehousing which ensured data was as-perfect-as-it-could-be, and every single transformation was designed, understood, and documented. Those old-school workloads simply cannot deal with modern data realities and the speed with which data solutions must be stood up. This means that the governance of data also had to be reinvented to keep up.

Summary – Azure Purview allows for management of all the data in your data estate (on-premises, Azure Data Lake Gen 2, Azure SQL Database, Cosmos DB, and even AWS and GCP) through a single pane of glass where all infrastructure and services required are provisioned as single PaaS platform and user experience.

Core to Purview is an active (metadata) Data Map, which records where all the data assets are automatically via built in scans and classification.

Take away – for on-premises data, Purview allows for the hosting of scanning and classification services on-premises so that it all happens behind your own firewall.

Data Map then drives Data Catalog which supports self-service data discovery, and,

Data Insights which provides a bird’s eye view of where data assets are and what kind of data assets are present.   

Take away – it is not only stored data that forms part of the objects that can be included in the Data Map, but also other workloads such: Azure Data Factory (this means data lineage is automatically populated from origination through to consumption (e.g., Power BI)), and Power BI reports and datasets (this means consumption objects themselves can form part of the discoverable metadata, cataloguing and insights).

Even deeper integration between Azure Synapse Analytics and Power BI for superior performance

Technology – Azure Synapse Analytics and Power BI

Ensuring performance of your new modern data warehouse.

By Etienne

Why – Reporting over big data exposed through a modern data warehouse is no longer some obscure unique requirement, as we see the requirement for this more and more with our customers, especially as data volumes explode and visualisations are needed over business logic already created in the modern data warehouse fast.

Summary – these requirements are what the modern data warehouse (as a service – such as Synapse in Azure, Redshift in AWS, or other services such as Databricks or Snowflake) are meant to overcome. But performance, even though already hugely improved, still remained elusive in some cases when data ingestion, through data warehouse, and into data visualisation are viewed as a comprehensive data pipeline. But now, the integration between Synapse and Power BI includes adaptive learning.

Take away – Power BI collects query patterns and determines materialised views for optimal performance and shares them with Synapse. Synapse creates the materialised views and directs Power BI queries to run over the views which delivers significant performance advantages. See below how the query performance improved with almost 500% between the original query shown in red, and the query after adaptive learning shown in blue.

In addition, Power BI, Synapse and Azure Log Analytics also now provide the ability to correlate queries so that performance bottlenecks can be identified.

Take away – It is now possible to monitor the integrated Power BI and Synapse ecosystem via Azure Log Analytics both at the tenancy level and at the workspace level. Both Power BI and Synapse sends all the performance telemetry to the Log Analytics workspace. In the past, these queries had to be analysed independently, but now, the Power BI queries, and the Synapse queries are correlated so that performance analysis can be performed on the end-to-end pipeline (covering DAX, M-query and SQL) so that it becomes easy to determine any bottlenecks.

Discovery of data content improved with Semantic Search and its role within data analytics

Technology – Azure Cognitive Search (Semantic Search)

Bringing BING to data analytics.

By Etienne

Why – Azure Cognitive Services, already widely used, used key word search in its Cognitive Search service. This was not as accurate as the algorithms used by, for example BING.

Leveraging the same capabilities which are used in BING and achieving the same search engine kind of outcome but over your data. This is one of the most exciting new innovations to Azure Cognitive Services in a while. Azure Semantic Search now better understands what users are looking for and achieves better search accuracy compared to the older key word search methods.

Take away – A data worker can quickly enable OCT, Text Analytics and Object detection without the need for ML coding including over data in your Azure Data Lake Gen 2.

SummaryData ingestion allows data to be pulled into, or push data into the search index using the push API. Content can include non-uniform data, information in files, even images. Index appropriate information is then extracted and pre-built ML algorithms are invoked to understand latent structure in the data. Semantic relevance, captions, and answers are then added as part of Semantic Search.

With normal key word search, the ambiguity of the word “Capital” creates a clear issue when one looks at the search term. I.e., capital gains tax in France has no relevance to the original search term.

Take away –Semantic Search includes learned ranking models which provides associations in a much more accurate way compared to key word search.

When Semantic Search is enabled, the learned ranking models provides for a much more accurate search results as seen on the right below.

Take away – Captions are also shown (in bold) highlighting the context of our search, and even a suggested answer. Similar to a conventional search engine.

Take away – this seems far removed from data analytics, but it is not, it is in fact increasingly relevant. Baking this kind of intelligent search over your data into your broader data analytic ecosystem alongside your reporting solutions is important when one considers data workers no longer consume data via reporting solutions only. Microsoft are already doing some of this via Power BI’s Q&A (natural language queries), but with Semantic Search, this is put on steroids.

Premium Per User License for Power BI

Technology – Power BI Premium

Bringing premium capabilities to users.

By Andrew

Why – Over the past few years Microsoft has been adding features to Power BI that were enabling data analysts to work with data in a way that was normally limited to complex systems managed by data engineers. These capabilities have allowed organisations to accelerate their time to value by enabling these controls to be simplified and pushed further into the business. Unfortunately, a lot of these features had been out of reach of most organisations due to a cost barrier and, until now, there was no alternative method to enable these empowering features for other Power BI clients. From 2 April 2021, Microsoft will be launching the Premium Per User license at USD$20 per user per month which will enable organisations to provide and pay for these features on a per individual basis.

Summary – Microsoft enables features that are usually limited to high-cost analytics systems to organisations at a price point that is based on a per user model.

Take away – Utilising Premium capabilities opens further options on how an organisation can utilise its data to achieve on business outcomes. Organisations now have access to features such as:

Increased model size limits up to 100GB

Refreshes up to 48 per day (currently unlimited if done via API)

Paginated reports

Enhanced Dataflow features

XMLA endpoint connectivity

Excel Online connected to a Power BI Dataset

Technology – Power BI

Rein in the uncontrolled release of Excel from Power BI Connectivity.

By Andrew

Why – Power BI is growing to become the default service for gathering and transforming organisational data for presentation. Microsoft is making it capable for Excel Online to be connected directly to a Power BI Dataset. The benefit to this is to ensure a single point of truth of data is used consistently throughout the organisation and by utilising it with the connected version of Excel, it will allow people comfortable with using Excel for analysis to engage with organisationally curated data.

Summary – Excel Online is now able to integrate with organisation Power BI Datasets.

Take away – By utilising Excel Online with Power BI datasets it will help organisations rein in the uncontrolled release of Excel documents and enable the use of organisation controlled single sources of truth for their data.

Auto-scaling for Gen2 Premium

Technology – Power BI

Add cores to boost your performance.

By Andrew

Why – Previously when buying Power BI Premium, organisations bought based on an expected requirement of compute capacity. This lead them to making a binary choice, purchase based on a cost efficient model of buying for normal expected usage or over spend to cater for known spikes. With this new announcement, Microsoft has removed that choice requirement. Organisations can now allow extra cores to be added to boost the compute capacity when needed by autoscaling on demand. The cores will be added automatically when the demand threshold requires it. The extra cores are made available for 24 hours after initial start to the capacity and are billed through the organisation’s Azure billing.

Summary – Microsoft enables organisations to control their spending while meeting user demands for fast and efficient reporting.

Take away – By allowing additional cores to come online when required, organisations can ensure effective performance for their report requirements. The price Microsoft has released for this is USD$85 per vCore per 24 hours. Organisations can control how many cores are to be made available to a capacity. The use of these extra cores is meant to cater to spikes in requirements and shouldn’t follow a daily usage pattern. If an organisation finds that the auto-scaling is being consumed on a regular basis, it’s recommended to look at the models being used within that capacity and look for ways of improving their efficiency.

Please look out for detailed articles on the topics described here in the near future.

exposé Digital Twin – an ABA100 Winner for New Product Innovation in The Australian Business Awards 2020

We are excited to announce that the exposé Digital Twin has been recognised as an ABA100 Winner in The Australian Business Awards 2020 for New Product Innovation. Now in their fifteenth year, The Australian Business Awards program recognises organisations for the successful development of new and improved products and services and for implementing business processes that improve efficiency and performance. This award affirms our commitment to continuous improvement and innovative business processes.

Being one of Australia’s most prominent data analytics companies, exposé is proudly South Australian based with a national reach. Our Digital Twin delivers its one-of-a-kind comprehensive, all-round solution that is set to revolutionise the way organisations interact with their data.

“Our platform is tightly coupled with the Internet of Things (IoT) and Artificial Intelligence (AI) and it offers a highly immersive user experience, which means users can traverse the physical twin through a virtual reality immersion.” Etienne Oosthuysen, Chief Technology Officer

For more information on The Australian Business Awards and the 2020 ABA100® winners, visit australianbusinessawards.com.au.

See our Digital Twin video here
See our Digital Twin solution here


Azure Synapse Analytics – the essential Spark cheat sheet

In this article, I take the Apache Spark service for a test drive. It is the third in our Synapse series: The first article provides an overview of Azure Synapse, and in our second, we take the SQL on-demand feature for a test drive and provided some resulting observations.

This article contains the Synapse Spark test drive as well as cheat sheet that describes how to get up and running step-by-step, ending with some observations on performance and cost.

Synapse Spark Architecture

The Spark pool in Azure Synapse is a provisioned and fully managed Spark service, utilising in-memory cache to significantly improve query performance over disk storage.

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview

As Spark pools are a provisioned service, you pay for the resources provisioned and these can be automatically started up and paused when the Spark pool is in use. This can be controlled per Spark pool via the two configurations Auto-pause and Scale, which I will discuss later in this post.

Based on the Azure Pricing calculator, https://azure.microsoft.com/en-au/pricing/calculator/, the cost is based on a combination of the instance size, number of instances and hours of usage.

Why Spark pool?

  • Data scientists can run large amounts of data through ML models
  • Performance, the architecture auto-scales so you do not have to worry about infrastructures, managing clusters and so on.
  • Data orchestration effortless inclusion of spark workloads in notebook

Who Benefits?

  • Data scientists: collaborating, sharing and operationalising workflows that use cutting edge machine learning and statistical techniques in the language that fits the job is made simple.
  • Data engineers: complex data transformations can easily be replicated across different workflows and datasets in an obvious way.
  • Business users: even without a deep technical background, the end users of the data can understand and participate broadly in the preparation of their data and advise and sense-check with the capabilities that only a subject matter expert has.
  • Your business: data science processes within Azure Synapse are visible, understandable and maintainable, dismantling the ivory silo data science often occupies in organisations.

Steps to get up and running

I have already provisioned a data lake, the Azure Synapse Analytics workspace and some raw parquet files. In this section, I will:

  1. Access my Azure Synapse Analytics workspace.
  2. Provision a new Spark pool
  3. Create a new Notebook with Python as the chosen runtime language
  4. Configure my notebook session
  5. Add a cell and create a connection to my Data Lake
  6. Run some cells to test queries on my parquet files in the Data Lake
  7. Run another runtime language in the same notebook

Step 1 – Access my Synapse workspace

Access my workspace via the URL https://web.azuresynapse.net/

I am required to specify my Azure Active Directory tenancy, my Azure Subscription, and finally my Azure Synapse Workspace.

Before users can access the data through the Workspace, their access control must first be set appropriately. This is best done through Security Groups, but in this quick test drive, I used named users.

When I created Azure Synapse Analytics, I specified the data lake I want to use, this is shown under Data > Linked > data lake > containers. I can, of course, link other datasets, for example, those in other storage accounts or data lakes here too.

Step 2 – Create a new Apache Spark pool

In the Manage section of the Synapse workspace, I navigated to the Apache Spark pools and started the wizard to create a new Spark pool.

On the first screen ‘Basics’ of the Spark pool provisioning I had to take careful note of the multiple options available, more specifically;

Autoscale

If enabled, then depending on the current usage and load, the number of nodes used will increase/decrease. If disabled, then you can set a pre-determined number of nodes to use

Node size

This will determine the size of each node. For a quick reference, currently there are 3 sizes available, small, medium and large with the rough cost in AUD per hour for each node being $0.99, $1.97 and $3.95 respectively.

Number of Nodes

This determines the number of nodes that will be consumed when the Spark pool is online. As I selected to Enable the Autoscale setting above it, I get to now choose a range of nodes which determines the minimum and maximum number of nodes that can be utilised by the Spark pool.

If I were to Disable the Autoscale setting, I would only get to select the maximum number of nodes the Spark pool can use at a time.

Both options have a minimum of 3 nodes limit.

For the purpose of our tests, I selected the Medium sized node, enabled Auto-scale and left the default 3 to 40 number of nodes.

Continuing on to the ‘Additional Settings’ I left the default settings here.  

The main configuration setting that caught my focus was the Auto-pause option in which you can define how long the Spark pool will stay idle for before it automatically pauses.

Review and create the Spark pool.

Step 3 – Create a new Notebook with Python as the runtime language

Advanced analytics (including exploration and transformation of large datasets and machine learning using Spark) is delivered through a notebooks environment that is very similar to Databricks. Like Databricks you choose your default language, attach your notebooks to compute resources, and run through cells of code or explanatory text.

To get started, I added a new notebook in the Develop section of the Synapse workspace.

At a glance, there are some notable differences in language choices when compared to Databricks

  • R language is not available yet in the notebook as an option (this appears to be on the horizon though)
  • Synapse additionally allows you to write your notebook in C#

Both Synapse and Databricks notebooks allow code running Python, Scala and SQL.

Synapse Spark notebooks also allow us to use different runtime languages within the same notebook, using Magic commands to specify which language to use for a specific cell. An example of this in Step 7.

More information on this can be found in the following Microsoft documentation.https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks#develop-notebooks

Step 4 – Configure my notebook session

Notebooks give us the option to configure the compute for the session as we develop. Compute configuration in Synapse is a dream – you specify the pool from which compute is borrowed, and the amount you want to borrow and for how long. This is a really nice experience.

Step 5 – Add a cell and create a connection to my Data Lake

In the first instance, I added two cells, one for describing the notebook and the second one to create a connection to my Data Lake files with a quick count.

Documentation cells – in Databricks, you must know Markdown Syntax in order to write your documentation cells and format them. In Synapse, there are nice helpers so that for example, you don’t have to remember which brackets are which when writing hyperlinks and so on.

Connectivity – to connect to the Data Lake, I’ve used the Azure Data Lake Storage (ADLS) path to connect directly. This is a really convenient feature as it inherits my Azure Role Based Access Control (RBAC) permissions on the Data Lake for reading and writing, meaning that controlling data access can be done at the data source, without having to worry about mounting blob storage using shared access signatures like in Databricks (although access via a shared access signature can also be put in place.

The ADLS path is in the following format:

abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>

Also adding a quick row count in the parquet files, running the cell initially takes around 2 minutes and 15 seconds as it needs to spin up the spark pool and corresponding nodes (after seeing this happen a few times over the course of a few days, this spin up time varied from durations of 1 minute and sometimes taking up to 4 minutes). A subsequent run only takes just under 3 seconds to count the rows from 19 parquet files.

Step 6 – Run some cells to test queries on my parquet files in the Data Lake

In the previous step, I also added a reader on the parquet files within my Data Lake container. Let’s first display the contents using the code

display(sdf_kWh)

Running the notebook cell gives us a preview of the data directly under the corresponding cell, taking just 3.5 seconds to execute and display the top 1000 records.

From the simple display, I now run a simple aggregate with ordering, which takes 9 seconds to run.

Using another set of parquet files in the same Data Lake, I ran a slightly more complex query, which returns in around 11 seconds.

Step 7 – Run another runtime language in the same notebook

In Synapse, a notebook allows us to run different runtime languages in different cells, using ‘magic commands’ that can be specified at the start of the cell.

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks

I can access the data frames created in a previous cell using Spark Python within the same notebook and query it with a simple select statement in SQL syntax.

Or an aggregation in SQL syntax.

Other Observations

Publishing

Once all my cells were coded and working as intended, I proceeded to publish the notebook. Comparing the publishing paradigm to Databricks, Synapse works by having users publish their changes, giving the opportunity to test, but reverting the changes does not appear to be simple currently. Databricks exposes a full history of changes to notebooks, which is useful for assessing who changes what and reverting accidental changes.

Orchestration

Orchestration is another nice experience Synapse has to offer. After developing a notebook that accomplishes a task such as training a classification model, you can immediately add your notebook to a Data Factory pipeline as a task and choose the circumstances that should trigger your notebook to be run again. This is very simple to do and very easy to understand, especially given the excellent Data Factory pipeline interfaces that Azure.

Flexibility

Databricks gives the user quite good visibility of the resources on which the cluster is running – it allows the user to run shell commands on the Spark driver node and provides some utility methods for controlling Databricks (such as adding input parameters to notebooks). This does not appear to be the experience that Synapse is going for, potentially to some detriment. In one engagement, it proved very useful that Databricks exposed the shell of the machine running code as it enabled us to send commands to a separate piece of specialised optimisation software that we compiled on the machine.

Performance Observation

Although the use cases in this blog are limited and the size of data is quite small, it did give an indication of basic performance with simple commands. I gave up on speed testing since caching and random variation is just too much of an effect. Spark works faster than a single machine on large enough data, that’s all we really need to know.

A summary of the performance we’ve seen so far using 2 executors and 8 cores on medium sized instances.

TaskDuration
Initial spin up of the pool1 to 4 minutes
Row count of ~600k recordsUnder 3 seconds
Display top 1000 rows3.5 seconds
Aggregate over Dates9 seconds

Cost Observation

Pricing of the Spark pool is calculated by up time of the pool, so you only pay for when there is activity (a running notebook/application) on the Spark pool and also inclusive of the idle minutes configured in the Auto-pause functionality, in my case this is 15 minutes.

With a decent amount of use sporadically over two weeks, I observed a cost of nearly $100 AUD. Bear in mind that I did utilise the Medium sized instances with auto-scaling set to a maximum of 40 nodes, which was in hindsight overkill for what it was actually used it for.

Azure Synapse Analytics – the essential SQL on-demand cheat sheet

Our first article http://blog.exposedata.com.au/2020/05/21/azure-synapse-analytics-insights-for-all-and-breaking-down-silos/ introduced Azure Synapse Analytics and some of its core concepts. In this second article, I take the new SQL on-demand feature, currently in Preview, for a test drive.

Disclaimer: as Azure Synapse Analytics is still in Public Preview, some areas may not yet function as it will in a full General Availability stage.

This article contains the Synapse SQL on-demand test drive as well as a cheat sheet that describes how to get up and running step-by-step. I then conclude with some observations, including performance and cost.

But first let’s look at important architecture concepts of the SQL components of Azure Synapse Analytics, the clear benefits of using the new SQL on-demand feature, and who will benefit from it. For those not interested in these background concepts, just skip to the “Steps to get up and running” section later in this article.

Synapse SQL Architecture

Azure Synapse Analytics is a “limitless analytics service that brings together enterprise data warehousing and big data analytics. It gives you the freedom to query data…, using either serverless on-demand compute or provisioned resources—at scale.” https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/

It has two analytics runtimes; Synapse SQL for T-SQL workloads and Synapse Spark for Scala, Python, R and .NET. This article focusses on Synapse SQL, and more specifically the SQL on-demand consumption model.

Synapse SQL leverages Azure Storage, or in this case, Azure Data Lake Gen 2, to store your data. This means that storage and compute charges are incurred separately.

Synapse SQL’s node-based architecture allows applications to connect and issue T-SQL commands to a Control node, which is the single point of entry for Synapse SQL.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/overview-architecture

The Control node of the SQL pool consumption model (also called provisioned) utilises a massive parallel processing (MPP) engine to optimise queries for parallel processing and then passes operations to Compute nodes to do their work in parallel. SQL Pools allows for querying files in your data lake in a read-only manner, but it also allows you to ingest data into SQL itself, and sharding them using a Hash, Round Robin or Replicate pattern.

As SQL Pool is a provisioned service, you pay for the resources provisioned and these can be scaled up or down to meet changes in compute demand, or even paused to save costs during periods of no usage.

The Control node of the SQL on-demand consumption model (also called serverless) on the other hand utilises a distributed query processing (DQP) engine to optimise and orchestrate the distribution of queries by splitting them into smaller queries, executed on Compute nodes. SQL on-demand allows for querying files in your data lake in a read-only manner.

SQL on-demand is, as the name suggests, an on-demand service where you pay per query. You are therefore not required to pick a particular size as is the case with SQL Pool, because the system automatically adjusts. The Azure Pricing calculator, https://azure.microsoft.com/en-us/pricing/calculator/, currently shows the cost to query 1TB of data as being A$8.92. I give my observations re cost and performance later in this article.

Now let’s focus on SQL on-demand more specifically.

Why SQL on-demand

I can think of several reasons why a business would want to consider Synapse SQL on- demand. Some of these might be:

  • It is very useful if you want to discover and explore the data in your data lake which could exist in various formats (Parquet, CSV and JSON), so you can plan how to extract insights from it. This might be the first step towards your logical data warehouse, or towards changes or additions to a previously created logical data warehouse.
  • You can build a logical data warehouse by creating a relational abstraction (almost like a virtual data warehouse) on top of raw or disparate data in your data lake without relocating the data.
  • You can transform your data to satisfy whichever model you want for your logical data warehouse (for example star schemas, slowly changing dimensions, conformed dimensions, etc.) upon query rather than upon load, which was the regime used in legacy data warehouses. This is done by using simple, scalable, and performant T-SQL (for example as views) against your data in your data lake, so it can be consumed by BI and other tools or even loaded into a relational data store in case there is a driver to materialise the data (for example into Synapse SQL Pool, Azure SQL Database, etc.).
  • Cost management, as you pay only for what you use.
  • Performance, the architecture auto-scales so you do not have to worry about infrastructures, managing clusters, etc.

Who will benefit from SQL on-demand?

  • Data Engineers can explore the lake, then transform the data in ad-hoc queries or build a logical data warehouse with reusable queries.
  • Data Scientists can explore the lake to build up context about the contents and structure of the data in the lake and ultimately contribute to the work of the data engineer. Features such as OPENROWSET and automatic schema inference are useful in this scenario.
  • Data Analysts can explore data and Spark external tables created by Data Scientists or Data Engineers using familiar T-SQL language or their favourite tools that support connection to SQL on-demand.
  • BI Professionals can quickly create Power BI reports on top of data in the lake and Spark tables.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview

Is T-SQL used in Synapse SQL the same as normal T-SQL?

Mostly, yes. Synapse SQL on-demand offers a T-SQL querying surface area, which in some areas are more extensive compared to the T-SQL we are already familiar with, mostly to accommodate the need to query semi-structured and unstructured data. On the other hand, some aspects of T-SQL we are already familiar with are not supported due to the design of SQL on-demand.

High-level T-SQL language differences between consumption models of Synapse SQL are described here: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/overview-features.

Let’s now look at getting up and running.

Steps to get up and running

I have already provisioned both a data lake and Azure Synapse Analytics. In this section, I will:

  1. Access my Azure Synapse Analytics workspace.
  2. Then load five raw data parquet files, each containing approx. 1,000 records to my data lake.
  3. Then access the data lake through Synapse and do a simple query over a single file in the data lake.
    1. Part of this sees me set appropriate RBAC roles on the data lake.
  4. Then extend the query to include all relevant files.
  5. Then create the SQL on-demand database and convert the extended query into a reusable view.
  6. Then publish the changes.
  7. Then connect to the SQL on-demand database through Power BI and create a simple report.
  8. Then extend the dataset from 5,000 records to approx. 50,000.
  9. And test performance over a much larger dataset, i.e. 500,000 records, followed by a new section on performance enhancements and side by side comparisons.

Step 1 – Access my Synapse workspace

Access my workspace via the URL https://web.azuresynapse.net/

I am required to specify my Azure Active Directory tenancy, my Azure Subscription, and finally my Azure Synapse Workspace.

Before users can access the data through the Workspace, their access control must first be set appropriately. This is best done through Security Groups, but in this quick test drive, I used named users.

When I created Azure Synapse Analytics, I specified the data lake I want to use, this is shown under Data > Linked > data lake > containers. I can, of course, link other datasets, for example, those in other storage accounts or data lakes here too.

Step 2 – load data to my data lake

I have a data lake container called “rawparquet” where I loaded 5 parquet files containing the same data structure. If I right-click on any of the Parquet files, I can see some useful starter options.

https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-synapse-studio

Step 3 – Initial query test (access the data lake)

I right-clicked and selected “Select TOP 100 rows”, which created the following query:

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://xxxxxxdatalakegen2.dfs.core.windows.net/rawparquet/userdata2.parquet',
        FORMAT='PARQUET'
    ) AS [r];

The first time I ran this query, I got this error:

This was because my Azure Active Directory identity doesn’t have rights to access the file. By default, SQL on-demand is trying to access the file using my Azure Active Directory identity. To resolve this issue, I need to have the proper rights to access the file.

To resolve this, I granted both ‘Storage Blob Data Contributor’ and the ‘Storage Blob Data Reader’ role on the storage account (i.e. the data lake).

https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-synapse-studio

and https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/access-control.

Those steps resolved the error.

Step 4 – Extend the SQL

In my theoretical use case, I have a Data Factory pipeline that loads user data from the source into my data lake in Parquet format. I currently have 5 separate Parquet files in my data lake.

The query mentioned previously obviously targeted a specific file explicitly, i.e. “userdata2.parquet”

In my scenario, my Parquet files are all delta files, and I want to query the full set.

I now simply extend the query by removing the “TOP 100 *” and open the OPENROWSET part of the query to the whole Container, not just the specific file. It now looks like this:

SELECT
     *
FROM
    OPENROWSET(
        BULK 'https://xxxxxxdatalakegen2.dfs.core.windows.net/rawparquet/',
        FORMAT='PARQUET'
    ) AS [r];

Step 5 – Now let’s create a database and views dedicated for my SQL on-demand queries

This database serves as my Logical Data Warehouse built over my Data Lake.

I firstly ensure that SQL on-demand and Master is selected:

CREATE DATABASE SQL_on_demand_demo 

I now create a view that will expose all the data in my dedicated Container, i.e. “rawparquet” as a single SQL dataset for use by (for example) Power BI.

I firstly ensure that SQL on-demand and the new database SQL_on_demand_demo is selected.

I now run the create view script:

CREATE VIEW dbo.vw_UserData as 
SELECT
     *
FROM
    OPENROWSET(
        BULK 'https://xxxxxxdatalakegen2.dfs.core.windows.net/rawparquet/',
        FORMAT='PARQUET'
    ) AS [r];

I now test the view by running a simple select statement:

Select * from dbo.vw_UserData 

Step 6 – Publish changes

Select Publish to move all the changes to the live environment.

If you now refresh your Data pane, you will see the new database and view appear as an on-demand database. Here you will be able to see both Provisioned (SQL Pool) and on-demand databases:

My data volumes at this stage is still very low, only 5,000 records. But we will first hook Power BI on to Synapse, and then throw more data at it to see how it performs.

Step 7 – Query through Power BI

It is possible to build interactive Power BI reports right here in the Synpase workspace, but for now I am going to go old school and create a Direct Query report from the view we created, essentially querying the data in the data lake via the logical data warehouse, SQL_on_demand_demo.

  1. To connect:
    1. Open a new Power BI Desktop file.
    2. Select Get Data
    3. Select Azure SQL Database.
    4. Find the server name
      1. Navigate to your Synapse Workspace
      2. Copy the SQL on-demand endpoint from the overview menu
    5. Paste it into the Server field in the Power BI Get Data dialog box
    6. Leave the database name blank
    7. Remember to select Direct Query if the processing must be handed over to Synapse, and the footprint of Power BI must be kept to a minimum.
    8. Select Microsoft Account as the authentication method and sign in with your organisational account.
    9. Now select the view vw_UserData
    10. Transform, then load, or simply load the data.
  2. Create a simple report, which now runs in Direct Query mode:

Step 8 – Add more files to the data lake and see if it simply flows into the final report

I made arbitrary copies of the original Parquet files in the “rawparquet” container and increased the volume of files from 5 to 55, and as they are copies, they obviously all have the same structure.

I simply refreshed the Power BI Report and the results were instantaneous.

Step 9 – Performance over a much larger dataset

For this, I am going to publish the report to Power BI Service to eliminate any potential issues with connectivity or my local machine.

The published dataset must authenticate using OAuth2.

Once the report is published, I select the ‘Female’ pie slice and the full report renders in approx. 4 seconds. This means the query generated by Power BI is sent to Synapse Analytics, using the SQL on demand mode and its SQL Query to query the multiple Parquet files in the data lake and return the data back to Power BI to render.

I now again arbitrarily increase the number of files from 55 to 500.

Refreshing this new dataset containing 498,901 took 17 seconds.

Selecting the same ‘Female’ pie slice initially rendered the full report in approx. 35 seconds. And then in approx. 1 second after that. The same pattern is observed for the other slices.

I am now going to try and improve this performance.

Performance enhancements and side by side comparison

The performance noted above is okay considering the record volumes and the separation of stored data in my data lake and the compute services; but I want the performance to be substantially better, and I want to compare the performance with a competitor product (* note that the competitor product is not mentioned as the purpose of this article is a test drive of Azure Synapse Analytics SQL on-demand, and not a full scale competitor analysis).

To improve performance I followed two best practice guidelines: (a) I decreased the number of Parquet files the system has to contend with and of course increased the record volumes within each file, and (b) I collocated the data lake and the
Azure Synapse Analytics in the same region.

Tests 1 and 2 shows the impact of performance enhancements, whereas tests 3 and 4 represents my observations when the two competitors, i.e. Synapse in test 3 and the competitor in test 4 are compared side by side.

Test summaries

Test 1 – large number of records and files, not collocated, Azure Synapse, Azure Data Lake, Power BI

Record volumes – 500,000
Number of Parquet Files – 500
Azure Data Lake Gen 2 region – Australia Southeast
Azure Synapse Analytics – Australia East

Results:
Initial refresh – 17 seconds
Refresh on initial visual interaction – 35 seconds
Refresh on subsequent visual interaction – 1 second

Test 2 – large number of records, decreased numbers of files, not collocated, Azure Synapse, Azure Data Lake, Power BI

Record volumes – 500,000
Number of Parquet Files – 20
Azure Data Lake Gen 2 region – Australia Southeast
Azure Synapse Analytics – Australia East

Results:
Initial refresh – 9 seconds
Refresh on initial visual interaction – 4 seconds
Refresh on subsequent visual interaction – less than 1 second

Test 3 – large number of records and files, collocated, Azure Synapse, Azure Data Lake, Power BI

Record volumes – 500,000
Number of Parquet Files – 20
Azure Data Lake Gen 2 region – Australia East
Azure Synapse Analytics – Australia East

Results:
Initial refresh – 3 seconds
Refresh on initial visual interaction – 2.5 seconds
Refresh on subsequent visual interaction – less than 1 second

Test 4 – large number of records and files, collocated, Competitor product, Azure Data Lake, Power BI

Record volumes – 500,000
Number of Parquet Files – 20
Azure Data Lake Gen 2 region – Australia East
Azure Synapse Analytics – Australia East

Results:
Initial refresh – 4 seconds
Refresh on initial visual interaction – 3 seconds
Refresh on subsequent visual interaction – less than 1 second

Performance Conclusion

The results in the table above show that Azure Synapse performed best in a side by side competitor analysis – see tests 3 and 4.

We describe this as a side-by-side test as both Synapse and the compared competitor analytic services are located in the same Azure region as the data lake, and the same parquet files are used for both.

Cost observation

With the SQL on-demand consumption model, you pay only for the queries you use, and Microsoft describes the service as auto scaling to meet your requirements. Running numerous queries in the steps described and across the course of three days seemed to have incurred only very nominal query charges when analysing cost analysis on the particular resource group hosting both the data lake and Azure Synapse Analytics.

I did initially observe higher than expected storage costs, but this, it turns out related to a provisioned SQL Pool, which had no relation to this SQL on-demand use case. Once that unrelated data was deleted, we were left only with the very nominal storage charge across the large record volumes in the Parquet files in the data lake.

All in all a very cost effective solution!

Conclusion

  • Getting up and running with Synapse SQL on-demand once data is loaded to the data lake was a very simple task.
  • I ran a number of queries over a large dataset over the course of five days. The observed cost was negligible compared to what would be expected with a provisioned consumption model provided by SQL Pools.
  • The ability to use T-SQL to query data lake files, and the ability to create a logical data warehouse provides for a very compelling operating model.
  • Access via Power BI was simple.
  • Performance was really good after performance adjustments as described in the “Performance enhancements and side by side comparison” section.
  • A logical data warehouse holds huge advantages compared to materialised data as it opens up the concept to reporting over data streams, real time data from LOB systems, increased design responsiveness, and many others.

Exposé will continue to test drive other aspects of Azure Synapse Analytics such as the Spark Pool runtime for Data Scientists and future integration with the Data Catalog replacement.

Azure Synapse Analytics – Insights for all and breaking down silos

(And a party down the lakehouse)

Cloud databases are a way for enterprises to avoid large capital expenditures, they can be provisioned quickly, and they can provide performance at scale. But data workloads continue to change, fast, which means conventional databases alone (including those running in data warehouse configurations) can no longer cope with this fast-changing demand.

Exposé have over the past few years written and spoke extensively about why conventional data warehousing is no longer fit for purpose (http://blog.exposedata.com.au/2017/02/09/is-the-data-warehouse-dead/ and http://blog.exposedata.com.au/2018/11/07/databricks-cheat-sheet-1-concepts-business-benefits-gettings-started/ as examples). The future data warehouse must at least:

  • Be able to cope with data ranging from relational through to unstructured.
  • Be able to host data ingested in a latent manner (e.g. daily) as well as real-time streams, and everything in between.
  • Be able to host data in its raw form, at scale, and at low cost populated by extract and load (EL) or data streams.
  • Provide the mechanisms to curate, validate and transform the data (I.e. the “T” of ELT).
  • Be able to scale up to meet increasing demand, and back down during times of low demand.
  • Be able to integrate seamlessly into modern workloads that rely on the DW; these include AI, visualisations, governance and data sharing.

What is Azure Synapse Analytics?

Say hello to Azure Synapse Analytics now in public preview – https://aka.ms/Synapse_Insights4All

Microsoft describes Azure Synapse Analytics as a “limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data…, using either serverless on-demand compute or provisioned resources—at scale.” https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/

It is the new version of the Azure SQL DW and gives Microsoft a stronger competitor platform against AWS Redshift, Google Big Query and Snowflake. For background, a comparison between the three competitors can be found at https://gigaom.com/report/data-warehouse-cloud-benchmark/ – note that it was done for Synapse’s predecessor Azure SQL DW.

No, what is it really?

Microsoft’s description of “limitless analytics service that brings together enterprise data warehousing and Big Data analytics” can be translated as two siblings that historically hosted two different types of data; i.e. highly relational (the ‘enterprise data warehousing’ or SQL workloads) and all other including semi-structured and unstructured data (i.e. the ‘big data’ workloads in data lakes), unified in a workspace that allows the user to query and use both SQL/ relational and big data with languages that they are comfortable with (SQL, Python, .NET, Java, Scala and R). It breaks down the barriers between the DW and the data lake. Now that is HUGE (don’t act like you’re not impressed). Imagine that…a “lakehouse”.

This is shown conceptually in the image below.

Okay, so it’s SQL and Spark, wrapped into a clever unified limitless compute workspace? No, it’s a bit more

Firstly, it includes Data Integration, so it not only unifies the differing data types (i.e. relational and big data), but it also includes the means to ingest and orchestrate this data using Azure Data Factory, which has become so pervasive in the market, natively inside Synapse (called Data Integration). Does this mean batch data loading only? No, you can of course load your realtime data streams to your data lake using some kind of IOT Hub/ Event Hub/ Stream Analytics configuration or achieve low latent data feeds into your data lake using Logic Apps or Power Automate.

Secondly, it not only integrates with Power BI, it actually includes Power BI as part of Synapse. In fact, interactive Power BI reports and semantic models can be developed within the Azure Synapse Studio. Imagine the ability to quickly ingest both structured and unstructured data into your data lake, either move the data into SQL (the data warehouse) or leave it in raw form in the lake. Then you have the ability to explore the data using a serverless SQL environment whether the data resides in the data lake or in the data warehouse, and potentially do this all in a Direct Query mode. This not only reduces the Power BI model footprint and hands the grunt over to Synapse, but also allows for much more real time reporting over your data. https://azure.microsoft.com/en-au/resources/power-bi-professionals-guide-to-azure-synapse-analytics/

Thirdly, it integrates seamlessly with Azure Machine Learning for those who need to use data from the unified platform for predictive analytics or deep learning and share results back into the platform for wider reuse. Including using Azure Data Share for a seamless and secure data sharing environment with other users of Azure.

Ah okay, so…

It unifies the DW and the data lake (real time, latent, and data of any type) and it also brings Data Integration and Data Visualisation into that unified platform. It then seamlessly integrates with Machine Learning and Data Share. So, its SQL, Spark, ADF and Power BI all at the same party, or ahem…lakehouse 😊 where you can ingest, explore, prepare, train, manage, and visualise data through a single pane of glass. Yes, we are bursting with excitement too!

Let’s get technical

Let’s look at some of the technical aspects of Synapse:

  • Users can query data using either serverless on-demand compute or provisioned resources.
    • Serverless on-demand compute (technically this is called SQL on demand) allows you to pay per query and use T-SQL to query data from your data lake in Azure rather than provision resources ahead of time. The cost for this is noted as approximately $8.90 per TB processed – https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/. This feature is still in Preview.
    • Provisioned resources, in line with the incumbent SQL DW data warehouse unit (DWU) regime that allows the user to provision resources based on workload estimates, but able to scale up or down, or pause within minutes (technically this is called SQL Pools) – see https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/. This feature is still in General Availability.
  • Azure Synapse Studio supports a user’s ability to ingest, explore, analyse and visualise data using a single sleek user interface, which is sort of a mix between the Azure Data Factory and Databricks UI’s.
  • Users can explore the data using a refreshed version of Azure Data Explorer.
  • Users can transform data using both T-SQL (data engineers) and Spark Notebooks (data scientists).
  • On the security front, there is threat detection, transparent data encryption, always-on encryption, fine-grained access control via column-level and native row-level security, as well as dynamic data masking to automatically protect sensitive data in real-time.

Please see this important fact sheet and a list of capabilities in General Availability vs those in Preview – https://azure.microsoft.com/en-us/services/synapse-analytics/#overview

Also, please see our essential cheat sheet for the Synapse SQL on-demand test drive where we put that exciting new service through it’s paces and help you get up an running, quickly – http://blog.exposedata.com.au/2020/06/01/azure-synapse-analytics-the-essential-sql-on-demand-cheat-sheet/

What are the business benefits?

They are numerous, but in our humble opinion, and it must be noted that we do have extensive experience in data warehouses and modern data platforms, these are:

  • The lakehouse that unifies Spark and SQL engines, PLUS the ability to query them through a single pane of glass is something the industry have been asking for, for a long time as it breaks down data silos. As a result, is also breaks down skill silos, as those familiar with SQL can continue using SQL and those that prefer Python, Scala, Spark SQL, or .Net can do so as well…all from the same analytics service.
  • The new serverless on-demand compute model allows users to use T-SQL to execute serverless queries over their data lake and pay for what they use. This coupled with the Provisioned Resources model enables customers with multiple ways to analyse data so they can choose the most cost-effective option for each use case.
  • Security including column-level security, native row-level security, dynamic data masking, data discovery and classification is all included at no additional cost to customers.

How can Exposé help?

We are Australia’s premium data analytics company and have, since our inception, made sure we fully understand changes in the data analytics market so that we can continue to tailor our best of breed architectures and solutions to our customers’ benefit. Below is some of the highlights of our journey:

  • We were the first consultancy to coin the phrase “friends don’t let friends build old school and expensive data warehouses” – we were passionate about finding solutions that were truly modern and delivered the best ROI.
  • We were one of the first consultancies in Australia to understand the value that Databricks could play in modern data workloads, championed it, and facilitated one of the most high-profile solutions which have won our client multiple awards.
  • We were selected as the runner up for the Power BI Global Partner of the year 2019 due to our big data smart analytics solution we created for our customer that embraced many leading-edge Azure services, culminating in a Power BI analytical and monitoring solution.
  • We went on to create a modular and industry agnostic Digital Twin product, built on Azure big data services, and bringing together Power BI & gaming engines in an immersive user experience that seamlessly ties into existing customer Azure investments.

It is this passion, our focus on R&D and you the customer, which makes us a good partner in your Synapse journey.