Time Zones and Daylight Savings in Power BI Service

One of the most important and often-overlooked aspects of Business Intelligence reporting (especially in the modern age of cloud systems) is latency of data. There are numerous examples of how this can be detrimental, the most common of which being the possibility of decisions being made with out of date information. For businesses where timing is everything, this can be a serious problem. Unfortunately, the nature of data movement and processing means that it is very rarely ‘live’ in the true sense of the word, so delays between data generation and report consumption are mostly unavoidable.

For this reason, the two very first things I put in every report are a measure for the time the data was last refreshed, and the time that the report was generated. These measures provide visibility to the latency of the data, and serve a critical role in making sure that everyone has the same understanding of what they’re looking at.

Time zones, I think you’ll now realize, are somewhat of a spanner in the delicately interlocked cogs of these measures. And as someone that always selects ‘no’ whenever my phone wants to know my location, I’m the first to recognize that a solution can’t rely on location services doing the math for us. In this blog I’m going to reveal my methodology for keeping everyone on the same page.

Time we can rely on: UTC

In the Power BI Service, there is one time to rule them all – UTC. If you’ve every published a report in a time zone that’s not close to UTC (like we are in Australia) with a measure based on now() or today(), you may have noticed that you get an unexpected result when the report refreshes. This is because the Power BI Service runs on UTC, and as far as I can determine there’s no automatic time zone logic for consumers that can be implemented. The good news is that knowing this behaviour, we can adjust accordingly.

The only caveat to the above is the unfortunate fact that in lots of time zones around the world, we have daylight savings. So not only do we have to adjust for the time zone, we also have to make sure that daylight savings are taken into account.

So how do we implement this?

The short answer is by using our friend the Date Dimension. There is plenty of information online as to why you should absolutely be using Date Dimensions in all of your models, so I won’t go over that here, but we can consider this application another really good reason for using them.

Let’s do that then!

To make a short answer much, much longer, we’re going to build our very own date dimension. So let’s get started by opening our favourite dimension creation tool; Microsoft Excel.

Before you discount the rest of this post entirely thinking I must be thick if I don’t know that you can create dynamic date dimensions using DAX in Power BI, rest assured I’m right there with you. There are a few very good reasons that we’re going to create a static table in Excel, which I’ll cover shortly.

To start with I’ve created a range of date values (I typically go from 1980-01-01 to 2049-12-31) which should be wide enough to cover most practical applications in 2022. If you’re reading this in 2065, adjust accordingly. I’ve also pre-filled some useful normal date hierarchy information for which I’ll post the code and eagerly await your criticism.

Pro Tip: Enter your start date in cell A2 and then navigate to Home -> Fill -> Series and set your last date as a stop value to avoid a scrolling RSI

As promised, the code to populate these fields;
Day Overall: start at 1 and fill down
Day of Week: =WEEKDAY(A2)
Day of Week Name: =TEXT(A2, “dddd”)
Day of Month: =DAY(A2)
Month: =MONTH(A2)
Month Name: =TEXT(A2, “mmmm”)
Year: =YEAR(A2)
Day Of Year: =A2-DATE(H2,1,1)+1
Fiscal Day: =A2-DATE(H2+(F2<7)*-1,7,1)+1
Fiscal Month: =CHOOSE(F2,7,8,9,10,11,12,1,2,3,4,5,6)
Fiscal Quarter: =FLOOR((K2-1)/3,1)+1
Fiscal Year: =YEAR(A2)+(MONTH(A2)>6)
Fortnight: =FLOOR(B2/14,1)+1
Important note, in Australia our fiscal year runs July 1 – June 30

Are you having fun yet? Feel free to add more columns as required for your particular applications.

Now comes the slightly curly bit – adding the time zones in. If you’re in a location in which there is no daylight savings, congratulations! You can just add the time zone offset, fill down, and away we go. Me in Adelaide? I have a bit more work to do. Our time zone changes on the first Sunday in April, and the first Sunday in October. Wouldn’t want to make it a consistent date would we, that’d be far too easy. So first we have to identify those days, which we can do fairly easily with a bulky and not at all elegant collection of IF statements:

=IF(D2=”Sunday”,IF(E2<=7,IF(G2=”April”,9.5,IF(G2=”October”,10.5))))

This fills our new Time Zone Offset column with a bunch of FALSEs, except on the first Sunday of April and October:

From here we can copy and paste the values (right click -> paste values so we don’t maintain the formulas) into a new column. We should also replace the top value (P2 in this example) with the expected time offset as we’ll be filling down. CTRL+H to find and replace all of the ‘FALSE’ values with blank, which we can then select by highlighting the entire new column and going to Home -> Find & Select -> Go To Special -> Blanksand hitting OK. Now we have all of the blank values highlighted, we’re going to fill down by entering the formula which will pull from the value above it, and hitting CTRL+Enterto fill the cells likewise:

We now should have an accurate time zone offset for every day in our table! Go ahead and save a copy as a csv for use wherever you need.

This still doesn’t help me in Power BI

As mentioned above, there are perfectly valid ways to do this dynamically in Power BI. And I can already hear you thinking ‘The connection to a CSV is going to break immediately if it’s not hosted in Sharepoint or otherwise’. Fear not, because we’re not actually going to use the CSV past it being an original source – this is the reason we’ve thus far done everything outside of Power BI. If you’re still confused don’t worry, it’ll all become clear momentarily.

Static Date Dimension in Power BI

As mentioned above, a connection to a CSV in Power BI is almost always a bad play. We also don’t want to use time, energy, and resources unnecessarily to refresh something that is very unlikely to change (because you’ve already communicated with the wider business and established the scope of date filtering needs right?). So the best case scenario would be to have this dimension just contained in the Power BI report, unchanging and useful. We can do this with small amounts of data by utilizing the ‘Enter Data’ functionality, but that’s limited to 3000 rows – we need much more than that. Luckily, with most things tech-related there’s a workaround! (Author’s note: the following process reminds me a bit of the Quake fast inverse square root – if you don’t know what that is and you like stories of when tech and black magic meet it’s worth a Google)

Step 1 – Open Power BI and import

This should be fairly basic, we’re going to open a new report in Power BI and import our CSV into the Power Query Editor. The table created should match the name of the CSV, if it doesn’t then either rename or take note of it, we’re going to use it in the next step.

Step 2 – Magic M conversions

We’re now going to create a new blank query and enter the following M code into the formula bar:

= Text.From( Binary.Compress( Binary.FromText( Text.From( Json.FromValue( <TABLE_NAME_FROM_PREVIOUS_STEP_GOES_HERE> ) ), BinaryEncoding.Base64 ), Compression.GZip ) )

This is going to populate a bunch of characters that match what I would expect the output of 1000 monkeys on 1000 typewrites would be if they were properly caffeinated. Don’t panic, this is part of the process:

You will be able to copy directly from the Power Query window, so go ahead and grab that colossal wall of text

Step 3 – Make the elephant reappear!

Once we have our gigantic wall of text we’re going to go back into the Power Query Editor and create a new blank query again, containing the following M:

= Json.Document( Text.FromBinary( Binary.Decompress( Binary.FromText( “<MASSIVE_WALL_OF_TEXT_GOES_HERE>” ), Compression.GZip ) ) )

If you’ve appropriately aligned your pentagram you should be presented with your date table back (after the appropriate expansion from a list):

This data is totally self-hosted in Power BI, independent of any connections, and 100% portable. We can now delete the first two queries (DateTable and Query1), rename our final Date Dimension Table and save.

Side notes: this works for all data, and I believe it’s technically limitless. Of course practically there is a limit, but since it’s performance-independent you’re mostly going to be constrained by how unwieldy the file size gets (though for most applications the date tables are some of the smallest pieces of data). Additionally, because this is held within the Power BI file you can no longer Connect Live to Datasets in the Power BI Service, so if you’re doing composite modelling you’ll need to create these at the same time as you do your other Extract & Load. There are a heap of business applications that this technique may be applicable for – static dimensions are great candidates. Doing so can greatly reduce the amount of data needing to be consumed from source systems, and will likely reduce refresh times accordingly.

What time is it?

All of this and we still really haven’t touched on the crux of the problem, which is that Power BI Service isn’t maintaining my time zone. We best address that.

In all of my reports I include two measures; the datetime the data was last refreshed, and the datetime the report was created. The former is generated by another query in the Power Query editor, as follows:

And the latter is just a measure made in DAX. When we create these measures if we left them unadjusted they would appear fine in our development environment, because Microsoft is pulling from our system settings to apply offsets. But as soon as we publish to Power BI Service, that’s going to be shifted back to UTC. We now have in our Date Dimension the offset, so we can utilize that to adjust the times – all we need to do is lookup what the offset should be for today.

The measures I typically create are relatively straightforward. For ease of comprehension, I typically do the lookup in a separate measure, as follows:

Time offset = LOOKUPVALUE(‘Date Dimension'[Time Offset (Adelaide)],’Date Dimension'[Date],UTCTODAY())/24

The important thing to note here is that the value returned by the lookup needs to be divided by 24 to give an offset that can be applied to a datetime. From there it’s simple to create the accurate refresh measures:

Date Last Refreshed = “Data Last Refreshed: ” & FORMAT(LASTDATE(‘Date Last Refreshed'[Column1]) + [Time offset], “yyyy-MM-dd h:mm AMPM”)

Report Generation Date (Adjusted) = “Report Generated: ” & FORMAT(UTCNOW() + [Time offset], “dd/MM/yyyy h:mm AMPM”)

And publishing to Power BI Service we can see that the refresh times match the local timezone:

There are of course limitations to this approach. Consumers in a different time zone will still have the report not match their local time with this basic implementation – more advanced implementations with user-specific RLS can be much better. But for the simple purpose of giving your users an understanding of how fresh the information they’re viewing is, this is a very tidy way to do so.

Is visual storytelling with data the new reporting, a Power BI story (Part 2)

In the first article, I describe the key concepts involved in Visual Storytelling and why data visualisation is not necessarily visual story telling.

Can this be done in Power BI?

We see so many great examples of visual storytelling in digital news articles, for example this one from ABC News – How life has changed for people your age – ABC News (Australian Broadcasting Corporation), where stories and data merge and come to life.

Is this possible through a BI platform such as Power BI?

To answer that, one must look at the audience of a BI solution. The audience is normally the exec who will use the visual data story in a board meeting, or a senior manager when communicating the key messages to their teams, or a team manager that needs to monitor the performance of her/his area, an operational worker that relies on a report for operational tasks, and so on. Those audiences typically do not work with text heavy artefacts, as is the case with casual news article readers (for example readers of the aforementioned ABC news article), but instead rely on a visually informative reporting landscape. So the question is RATHER can Power BI can bake visual data stories into the day-to-day reporting landscape?

Let’s revisit the backstory and the four narrative elements (characters, setting, conflict/ win, and resolution) of our fake energy provider, ACME, and see if we could craft something that conveyed this narrative and lead to actionable insight?

The characters are the customers between 25 and 44, environmentally conscious Australian public and therefor our fake company’s customers in this cohort, and the target audience is ACME’s management (decision makers) and marketing departments.

The storyframe shows our 25 to 44 cohort as being our second largest customer segment, with a high population growth trend and high support for phasing out fossil fuels. It is also worth noting that the future for the 25 to 44 cohort, i.e. the 00 to 24 segments shows the highest population change, so its worth keeping that additional cohort in mind in this analysis.

untitled image

Further story framing shows that our setting is our 25 to 44 segment, and specifically in NSW as that is where the highest percentage of customer churn is occurring.

untitled image

The conflict is confirmed through key trend analysis that shows the population growth for our segments as well as the correlation between customer loss, and social media sentiment across data that monitors topics such as Fracking and ACME (our fake company).

untitled image

Selecting New South Wales confirms our setting and the conflict analysed.

untitled image

Our resolution is highlighted by the final piece in our Power BI visual story, but in stead of drilling into the 25 to 44 segment for New South Wales, we look at the future of the future 25 to 44 segment, i.e. the 00 to 24 segment. We can see a significant population increase in our cohort (future customers), and an expected further customer churn in this cohort of a call to action and marketing around it is not taken. The call to actions is clearly spelled out.

untitled image

So, can Power BI be used for effective visual story telling with data?

My view is yes, and it provides more. Unlike other methods of visual story telling which delivers a great user experience but are mostly static, the Power BI visuals shown are interactive which means it can be analysed for any state, age group, date, etc. and they will change as the data changes (notably customer numbers), including the forecasts. And that is the major benefit of merging the concepts of visual story telling with data and interactive reporting tools such as Power BI.

PowerPoint integration into Power BI was announced as part of the key note at Power BI Summit. Although full details are not yet available, this will enhance the visual storytelling capabilities of Power BI even further.

untitled image

All of these report pages were built from underlying data (ABS, Lowy Institute, Pop sci, and dummy customer data).

Additional references

Where does Australia’s gas come from? And where can we expect new gas projects? | Climate Council

Data Storytelling: How to Tell a Story with Data (hbs.edu)

Green electricity guide – Energy | CHOICE


This article was originally published here https://www.makingmeaning.info/post/is-visual-storytelling-with-data-the-new-reporting-a-power-bi-story

Is visual storytelling with data the new reporting, a Power BI story (Part 1)

Can Power BI bake visual data stories into the day-to-day reporting landscape?

Dashboards, the data visualisation kind, have now been around for a long time and is the go to solution that execs and managers rely on for a quick snapshot of how things are tracking in their organisation, department, team, or project.

There has been a trend where data workers or consumers use the term ‘dashboard’ to also describe highly visual and interactive reports, quite possibly due to the fact that both dashboards and highly visual and interactive reports are in fact….erm…highly visual. And both provide visual snapshots of how things are tracking in the organisation, department, team, or project. But, it is, strictly speaking, not correct to confuse the two as they are two separate constructs, yet related to each other. One way to differentiate between dashboards and interactive reports would through the role they can play in visual storytelling.

A second question arises, and that is whether visual storytelling with data is simply the new way to describe dashboards and reports (i.e. data visualisation).

This article lifts the lid on these concepts and shows why visual storytelling is not simply the new word for data visualisation, but it also shows how data visualisation and the tools used for it (in this case Power BI) can be used for visual storytelling. In other words all data visualisations do not necessarily tell visual stories, but they can. This article describes how, with Power BI via examples.

Dashboard

A dashboard is used to monitor performance. They must provide easy to understand, at a quick glance, information (even real-time), to drive subsequent investigations. Power BI describes a dashboard as “a tool businesses use to help track, analyse, and display data, usually to gain deeper insight into the overall wellbeing of the organization, a department, or even a specific process…Just like in a car, dashboards indicate how far along you are on your journey and how long it may take to get where you want to go.”

Mico Yuk, a leader in the field of the power of data visualization, who I was fortunate enough to see delivering a keynote a few years back at SQL Pass in Santa Clara, says that a dashboard must have “a current state (where am I now), a trend (how did I get here), a forecast (where will I end up at this rate) and what-if (what can I do to change my goals)” – sorta in line with the Power BI description.

Many people will tell you that dashboards tell stories. Sorry, but by themselves, they simply don’t. Mico for example evangelises the concept of a BI Storyboard that uses a framework whereby dashboards and reports are designed and woven together to tell an entire story. Her framework does not include only dashboards. They are merely the starting points that draws your attention to something, that is then analysed further.

The quick glance key metrics used in dashboards (such as a result, its’s trends, outliers, etc.) is not storytelling but rather the framing of important information for the user. Brent Dykes, calls dashboards exploratory, rather than explanatory. They frame the types of potential insights and stories that can emerge from the data. They therefore provide Storyframing rather than storytelling.

Storytelling

Once storyframing occurs, i.e. once attention is drawn to something, the interactive report take up the mantle, and provides full interactive data visualisation, drill down to deeper levels of detail, and drill through to context so that the analyst can investigate and understand. Platforms, such as Power BI, supports seamless transition from dashboard (storyframing) to interactive report because the analyst who could have been triggered by the storyframing, now wants to go on a journey of discovery in an as seamless manner as possible.

Is storytelling simply data visualisation, rebranded?

So, now that we understand the difference between dashboards (storyframing) and interactive visual reports (data visualisation), we can assume that visual storytelling is just data visualisation, right? Nup!!

Data visualisation is just that, a visual representation of the substance of the metrics and could include charts, tables, and the science behind the visuals such as shapes, colours, sizing and placement.

Storytelling with data goes further. It pivots around the ideas of purpose and narrative. Each piece of data included must contribute to the intended purpose. Anything (e.g. data, metric, visual, etc) that do not directly contribute to the purpose is superfluous. And the method of presentation of a visual story must be by way of a narrative that focus on the purpose.

And that, i.e. narrative (focussing on the purpose) is the fundamental difference between data visualisation and visual storytelling. Of course the latter includes data visualisation, but data visualisation is not visual storytelling merely because it is visual. It must tell that purpose focussed story.

How do you develop a purpose focussed story visually?

Narrative, or storylines, will typically contain characters, a setting, conflict, and a resolution. This is the same with data storytelling.

I am going to illustrate this by way of a backstory.

Imagine an energy provider has seen a spike in customer churn, specifically those under 45, the primary customer base. My dashboard may initially have drawn the attention to the downward trend in customer numbers. Deeper insights through interactive reports showed a time correlation between the start date of the downward trend and an increase in negative social media chatter about the company’s planned fracking in Australia.

Crafting the narrative to illustrate this important issue includes:

The characters are the customers between 25 and 44, environmentally conscious Australian public, and the company’s management and marketing departments.

The setting includes the customers, the a focus on the 25 to 44 year old segment.

The conflict includes research showing how younger customers are increasingly environmentally conscious, for example it was found that 40% of Australians would be willing to pay more for ethical and sustainable products with this portion higher amongst younger consumers. It must also include the correlation between the upward trend in social media chatter (and the downward trend in the sentiment), and the downward trend in customer numbers. Showing the potential customer base, I.e. all home owners and renters in Australia between 25 and 44, as well as the population projections will add additional important context.

The resolution must focus on the company’s plans regarding conventional gas (i.e. gas held in porous reservoir formations in rock) vs unconventional gas (i.e. gas is found in more complex geological formations that limit the ability of gas to be extracted and requiring extra steps to extract the gas, such as fracking – which is the most environmentally unfriendly) and its participation in GreenPower, a scheme that is accredited by the federal government, that helps customers support renewable energy generation (albeit at the expense of higher electricity prices) and the company’s related Green Electricity Guide score being one of the highest in Australia.

Note that the conflict could be replaced with a win too, for example if a campaign exceeds a goal, then the conflict would be replaced with a win.

The next step is to create your narrative by clever use of data visualisations that takes the audience on a journey through each of these these elements (the characters, the settings, the conflict/ win and the resolution). I will show how this is done a little later on in this article.

So what’s the big deal, is data visualisation not as effective as visual storytelling?

Data visualisation can definitely lead to very effective insights, in my backstory the analyst discovered the correlation between the downward trend in customer numbers and an increase in negative social media chatter, but without the accompanying narrative, data visualisations often lack a clear call to, and direction of action. My view is if data solutions lack clear calls to actions they are often meaningless or their intentions are lost.

Is visual story telling therefore a one off created artefact to communicate the narrative to an audience?

One could be excused for thinking that visual storytelling is a one-off affair when you look at the backstory above, but there is an increasing trend to use the concepts of visual storytelling for dashboard and visual interactive report design and development. Mico Yuk for example promotes the idea of storyboarding when deciding on the most appropriate dashboard/ report development.

Dashboards and reports are typically not created as one-off solutions, as the metrics and trends they showcase changes as the data changes through time. This means that for developers of solutions such as Power BI, it is important to think about their solutions through the lens of narrative, storyframing and storytelling. Storytelling in data solutions is therefore now expected and pretty important.

In the next article, I describe how this can be done in Power BI.


This article was originally published here https://www.makingmeaning.info/post/is-visual-storytelling-with-data-the-new-reporting-a-power-bi-story

Our Favourite Sessions from the Power BI Summit 2022

The Power BI Summit is a multi-day conference, the largest of its kind, run virtually across 5 days. It brings together the best and brightest including Microsoft’s Power BI team’s product group, community experts and MVPs.

With over 100 sessions and presenters, a huge range of topics were covered across technical, analytical, and business perspectives.

Below are some favourites from our consultants, Teresa Pankhurst and Emma Girvan:

Empowering everyone to get started with Power BI

By Teresa

This session covered some very quick and simple ways to introduce new users to Power BI. My favourite was the Quick Creation feature (in Preview)…

By using the Quick Creation feature in Power BI Service, a user can create a report with charts and analysis with just a few clicks, without needing any prior knowledge or skills in reporting building or coding – all without even downloading PBI Desktop!

The Quick Creation feature is as simple as the user pasting data from their clipboard (or entering it in). From there Power BI will create a data model; after analysing the data to determine which columns work well together and will produce the best insights, Power BI will create a Quick Summary (see below). The Quick Summary is a pre-built report that you can either use for analysis, share as is or refine to meet your requirements.

Visit this site for more information.

Power BI Embedded

By Teresa

This session covered the end-to-end considerations of the when, why, what, and how of choosing to use Power BI Embedded, including licencing and security.

So, what is Power BI Embedded?

Firstly, it’s important to know that Power BI Embedded is not an addition or extension of Power BI Service but is a distinct offering within Azure. Power BI Embedded is a good choice when you want to share reports with users who do not have PBI accounts (such as customers) and/or you want to embed a report in a custom application.

There are two forms of embedding – Embedding for Organisations, which is for use internally within an organisation, requires users to be authenticated by Power BI and requires a monthly subscription commitment; and Embedding for Customers, which is primarily aimed at ISVs (Independent Software Vendors), it doesn’t require user authentication with Power BI, but works with RLS (Row Level Security) and has more flexible licencing (it can be switched off – pay by the hour).

Want to know more? Microsoft have some great resources including a tutorial on how to set up Embedding for Customers step by step:

Power BI Embedded

Embedding for Customers

Source: https://powerbi.microsoft.com/sv-se/blog/power-bi-developer-community-august-update/

Using What-if Parameters for Scenario Planning

By Emma and Teresa

Have you ever been in the modelling section of Power BI and wondered what the What-if parameters section is?

What-if parameters allows users to interact with a report to change the value of measures. They are particularly useful in scenario planning, allowing you to see what happens if… for example, a what-if scenario may be that you want to see how many additional staff are required to increase production by x% or what the optimal price may be for maximising profit? This is where a well-built What-if parameter can bring great benefit for little effort.

When generating a What-if parameter, a new calculated table, which includes a column and measure, is generated. The What-if measure is then applied to model measures (cost, sales, etc) and a user adjusts the What-if visual which the model measures will respond. For example, a revenue modifier will increase/decrease the revenue and cost measures, whereas a cost modifier will only increase/decrease the cost measure.

You can also get creative when using two What-If parameters to analyse different scenarios. This example illustrates how, by using the Cost and Revenue What-If modifiers as your row and column labels in a matrix, in conjunction with a blank measure (eg. BlankMeasure = “”), you can apply conditional formatting to produce a Modifier Matrix that looks like a heat map. This is a great way to allow users to analyse several possible outcomes.

The above chart illustrates the Power BI matrix visualisation which incorporates two What-If parameters plus a blank measure, with conditional formatting applied.

The above image shows how to apply the conditional formatting for the desired result.

The above chart shows 8 different scenarios selected using Ctrl+Click with corresponding figures displayed in a supporting table. It is easy to see what the Effect and Markup is for different Revenue and Cost What-If modifiers.

HOT OFF THE PRESS!

Look out for a new Power BI Desktop parameter feature coming soon (in the next couple of months) which will make life much easier when it comes to creating dynamic report parameters. No longer will we need to use complex DAX functions to create selection measures manually.

Here’s a sneak preview of how it will work.

There will be a new option called ‘Fields’ under the New parameter option from the Modelling menu.

Here you can add a bunch of measures that you want to use in your slicer.

This will create a new table with a new measure in it which can then be used as the Values attribute on your visualisations and slicers.

So easy….and no DAX! (Well apart from the original measures you plan to use)

Must Have Custom Visuals

By Teresa

Users of Power BI will be familiar with the standard set of visualisations that come with Power BI and the large variety of custom visuals available from the Microsoft Marketplace. In this session, the presenters took us through two custom visuals that are actually custom visual designers and they allow you to build your own custom visualisations!

Charticulator is a Power BI Certified, low/no code custom visual which provides a canvas and tools for you to build bespoke charts according to your requirements. If you know what you want and can drag, drop, and click then Charticulator is for you. Charticulator is a supported Microsoft product, visit the website for inspiration and information charticulator.com

Source: https://charticulator.com/gallery/index.html

Deneb is also a Power BI Certified custom visual and although it uses the Vega or Vega-lite languages to create custom visuals, there is a great community of templates and assistance including example visuals with associated code which you can find here. It requires more than the drag, drop, and click skill set, so is perfect for users who are keen to dip into the coding side of custom visuals without needing the skills of a web developer. Having said that, the Vega-lite language is very simple and the supporting website has extensive documentation on the properties available, including graphics on what they do.

Source: https://vega.github.io/vega/examples/

Hybrid Tables

By Emma

Hybrid tables combine the performance of VertiPaq in-memory caches with the capabilities of DirectQuery, allowing users to unlock massive datasets for real-time, interactive analysis. This feature is only available to Premium or Premium Per User customers and is currently in public preview with no date set for general availability.

This session walked and talked us through how the new hybrid table feature leverages the incremental refresh technology.

It utilises one table partitioned into segments, similar to how incremental refresh works, however the difference is the most recent segment uses Direct Query method while the other segments use Import method.

It is recommended as best practice that all dimension tables that join to a hybrid table are set to Dual storage mode so that Power BI can generate efficient native SQL queries when querying the DirectQuery partition, therefore minimising performance issues.

A table configured as Dual storage mode is both Import and DirectQuery, and this setting allows the Power BI service to determine the most efficient mode to use on a query-by-query basis.

There are some limitations to be aware of though.

  • For high cardinality columns you may experience poor performance when using;
    • Distinct Count
    • Min/Max over string columns
    • Analyse in Excel
  • The function ApproximateDistinctCount is not supported
  • A hybrid table cannot have aggregation tables and cannot be an aggregation table itself
  • A hybrid table cannot have calculated columns

Read up on the Microsoft documentation site for more information on hybrid tables.

Top Tips and Tricks

By Emma

With Power BI always on the move bringing us new tools, features, functionality and all things shiny, it’s sometimes easy to forget some of the more…let’s say….less glamorous aspects of our daily BI lives. Remembering to follow best practices and good design principles from the get-go will make your life so much easier further along the development cycle.

Here’s just a few tips and tricks that may not be new to you but may have disappeared from the forefront of your mind while you’ve been building your awesome BI solutions recently.

  • When performing your data preparation tasks
    • Setup configuration parameters (i.e. Source data file path).
      Connection strings can be used in many places so if that were to change in the future, you would then have to repeat that change in all those places where it is referenced. Avoid rework and duplication at all costs.
    • Remove extra columns that are not needed. They take up valuable space and can impact performance. Listen out for the “give me everything now and I’ll work out what I want later” phrase that we’ve all heard before.
    • Set the correct column data types, particularly number of decimals if relevant
      • If not needing more than 4 decimal points, use Fixed decimal number
    • Use business friendly column names – it will save you needing to change them in the visual layer or at the very least help the report developer decide the correct attributes to use in the visualisations.
    • Setup parameters for Test/Production data so you can easily switch between environments and help with an efficient development and deployment process
  • Settings
    • Turn off the global setting for date/time. Power BI creates a hidden date table for every date field in your dataset whether you need it or not. Since you should always have a Date Dimension in your dataset, there should be no need to use the auto date/time feature.
    • Turn off Auto Relationship in the Current File settings because Power BI will often create an incorrect relationship and trip you up.
    • Avoid using the ‘Both’ setting for cross filtering (when creating relationships). You need to be sure why you need a cross filter to both tables and fully understand the risks and nuances with doing so.
  • Data Model Design
    • Add a Date Table and mark it as a date table (or time intelligence won’t work)
    • Set column categorisation (Web URL, Geography, etc.)
    • Change summarisation of columns that don’t require it (set columns to Don’t Summarise). For example, you will never want to add up a post code.
    • Create hierarchies to make it more efficient to add several levels of a set of attributes (i.e. Geographical attributes such as country, state, city, address)
    • Hide unwanted columns (i.e. Those used in a hierarchy). These are most often the key columns used in creating relationships but could also be the fields used in hierarchies.
  • Visualisation Design
    • Use report templates (for consistent look & feel)
    • Use explicit measures – avoid using columns from Fact table to perform aggregations
    • Create folders for measures to easily group them up together. It makes it much easier to find them when you’re building out the visualisations.
    • Hide visual headers for slicers/text boxes/buttons, etc. where the user does not need to use any of the visual header features.
    • Avoid using visual level filters where possible to allow for full transparency across commonly used slicers (use filters in the measures where possible).

Power BI – Paginated RESTful API as a Data Source

Introduction

Power BI has an ever-expanding list of supported data connections, Databases, Online Services, Python Scripts, and more. At first glance, it appears that they have pretty much everything covered. There is one notable exception however, and it came to light in a recent engagement; RESTful API connectivity. ‘Not to worry though’ I hear you say, ‘PowerQuery M solves this problem for us!’. Well, yes, but it’s a bit more complicated than that. Luckily this post is here to provide spiritual and technical guidance through the various spiky pits that litter the pathway to successfully having a RESTful Data Source refreshing on a schedule in the Power BI Service. But before we get to the how, we must first understand the why.

To make a short story slightly longer (and provide context), exposé was recently engaged with a client who was looking to get more insight into the performance of their development team. As a primarily Microsoft-based organisation, the reporting platform that was most utilised (and best understood) was Power BI, and the flexibility in development that Power BI provides was something they wanted to maintain. However, as part of the movement towards more measured project management and bug tracking, the business had decided to migrate to a Jira platform, which does not natively connect to Power BI. Whilst Jira does have a (quite good in my opinion) native reporting platform, the business wished to keep all of their reporting in the Power BI Service, for both the reasons listed above and minimisation of technical debt (report sprawl through multiple platforms is a one-way ticket to an expensive amalgamation effort). So we ended up in somewhat of a square peg and a round hole situation.

The business benefits to solving a problem of this nature are plainly obvious. In any business there are going to be multiple data sources that may not necessarily play nicely with each other, and this can cause serious dilemmas when trying to gain holistic insights into how the business is running. For larger organisations needing high numbers of reports it may be prudent to engage a team of Data Engineers who can cultivate, format, and model the data to be easily imported and visualised by the Visualisation specialists (in fact, this is recommended). For smaller organisations or small numbers of reports however, this large-scale operation may not be financially viable. Luckily most web-based services have a common data output pipe that we can attach to in such circumstances, being a RESTful API. Connecting this pipe to a Power BI Service-hosted dataset, however, does come with a number of challenges.

Challenge #1: Power BI has no native RESTful connection

When we create a dataset in PowerBI (standalone datasets are almost always better, but that’s a topic for another post), we need to connect to data. But how do we do this when there’s no native connection? Searching for ‘API’, ‘REST’, or (in this instance) ‘Jira’ in the ‘Get Data’ connection source window gives us a frustrating 0 results. But RESTful APIs are served via Web request, so we can utilise the ‘Web’ connection and start on our journey.

When we open a Web data connection we’re presented with both a ‘Basic’ and ‘Advanced’ series of parameters. For the purpose of this we’re going to do most of our configuring under the hood, so we can start with a Basic connection. If you’re reading this I’m assuming you’re somewhat familiar with how APIs work, so you can simply plug the web URL of the service’s API into the ‘URL’ box. For Jira (to get a list of all issues), this takes the format of https://<site URL>/rest/api/2/search?jql=

Clicking ‘OK’ will then prompt us to choose our authentication method, and if we click ‘Organisational account’ we can enter our username and password and data will start to flow! Congratulations, you’ve connected to a RESTful API, simply publish and start building reports…

… not really. Unfortunately, this leads nicely into the next challenge.

Challenge #2: Avoiding Organisational Authentication

There are a number of reasons to avoid using personal organisational authentication when creating a dataset, but one is good enough: if you develop something and your account is disabled, things break. To avoid this it’s best practice to use an authentication token, which Jira is happy to provide. Ideally this is linked to a service account for the same reason.

The easiest way to inject this token into our new data connection is via the PowerQuery editor. Clicking on the source we just created you should see something in the formula bar along the lines of:

= Json.Document(Web.Contents(“https://<site URL>/rest/api/2/search?jql=”))

We can place our Authentication token into the headers of this request by modifying it to the following:

= Json.Document(Web.Contents(“https://<site URL>/rest/api/2/search?jql=”, [Headers=[Authorization=”Basic <authentication token>”]]))

This should now allow us to change the Data Source credential type to ‘Anonymous’ (File -> Options and Settings -> Data source settings -> Edit Permissions -> Credentials -> Edit… -> select Anonymous and save) while still being able to refresh the data. At least, locally…

Challenge #3: Refreshing the Data Source in Power BI Service

We now have some data that we can play with, expand on, chop, change, move, and model. Most of the time when we get to this point it’s time to start building measures and calculated columns, sanitise things, publish to the Power BI Service and away we go. In this particular instance however, we’re not quite there yet.

The reason for this is that if we published the Dataset as it stands, we would not be able to refresh it (nor schedule a refresh) in the Service. This is because the Power BI Service will throw an error if we try to refresh a dynamic data sources. What this basically means is that the Service must be able to validate that the query runs successfully without needing authorisation before applying the authorisation token. Microsoft documentation of this issue is available here.

This means that a small manipulation is required to the M code to break the URL up, allowing for the first part of the URI to be verified as runnable by the Service before adding our direct linking to the information we’re trying to get. This is done as follows:

In our ‘source’ step, we need to modify the formula (via the formula bar) to separate the Web.Contents() string into sections – the first (main) section that can be accessed without needing authorisation, and a RelativePath section that takes us the rest of the way. The format is as follows (pseudocode):

Web.Contents(“https://<publicly accessible URL>”, [Headers[Authorization=”Basic <authentication token>”], RelativePath=”/<path to be appended to public url>”])

In our Jira example, our RelativePath is “/rest/api/2/search?jql=”. By modifying the source step for our dataset and publishing to the Service, we are now able to refresh (and schedule refreshes on) our data! Well, at least the first 50 results.

P.S. On the first publishing and refreshing this data source you will run into an authentication issue in the service indicating that you need to provide credentials. This is fixed by navigating to the settings of the data source, selecting ‘Anonymous’ as our Authentication method, and opting to skip test connection.

Challenge #3: Overcoming Pagination

The default behaviour of a lot of cloud services is to serve paginated results from their APIs (hence the 50 results above). This is probably wise from the point of view of the services, as it prevents poorly (or lazily) designed queries from consuming unnecessary bandwidth (and potentially cost). However, for our applications we really need more than one page of data, so we need to find some way to overcome this.

In some instances, we are able to modify the page size up to a more reasonable number (say, 100,000 results). However, the Jira (and most larger cloud-based SaaS providors) developers saw this coming and hard limited the number of results returned by the service to a maximum of 100. For our application that means we have to take a bit more of a programmatic approach, as we definitely want more than that.

To best understand the next bit it is important that we have a bit better understanding of PowerQuery M. At its core it is a functional programming language, which has a bunch of features that are really not explored fully due to the immense drag and drop (or no-code) capabilities contained in the PowerQuery engine. These programmatic characteristics have a multitude of uses that are rarely utilised, and I would encourage more BI Developers dive under the hood to see what potential improvements can be made to their data wrangling. For this application we want to utilise these programmatic characteristics by way of writing a loop function that returns all of our results.

There are a few good blog posts that explain how we can create functions in PowerQuery M (including specific code examples, which I won’t go over here) to solve problems, including some that cover looping and pagination. The general methodology is to create a lovely query that utilises parameters to pass information to a function which loops through the pages and requests all the information from the API, which it then concatenates back into one dataset. There is however a problem with this approach, which is that refreshing functions in the Power BI Service is not currently supported.

There’s one final piece to this puzzle, which is hidden a bit in plain sight; the invocation of custom functions as a query step. Instead of creating a single almighty query that does everything we need, in order to have functionality in the service we need to break it into two steps; firstly generate all of the parameter information (i.e. number of pages, number of results per page, etc.), and then pass that information to a function that then retrieves the data from the API. The best way (I’ve found) to do this is by use of a custom column. Again we can modify the M code to pass more information (parameters) in to the function if we need to, which will help us jam more functionality in and supersede the ‘normal’ use of a custom column.

Congratulations, you can now use REST APIs as a Data source in Power BI

To recap, the strategy we have employed has been formed in three parts, enforcing a number of best practices that will hopefully be obvious as to why they needed to be used;

  1. Circumvent the use of personal accounts in preference of organisational tokens
  2. Modify our query to allow for successful refreshing of our data source in Power BI
  3. Use the programming functionality of PowerQuery M to creatively overcome stagnation in pagination by way of a custom function and custom column.

This strategy is an extremely effective way to simply, quickly, and cheaply move cloud-based platform data into Power BI, without necessitating a large-scale architectural solution. By no means should this be used as a replacement for such a solution if the use case dictates, but for small applications or interim measures on critical data it is extremely effective. Between the engagement for which this method was developed and the writing of this blog I have also confirmed this works for all tested RESTful APIs, so the example here utilising Jira is just an example and not a limitation.

Limitations and Considerations

There are a couple of limitations to the use of this technique that are noteworthy – primarily around visibility, frequency, and cost.

Visibility

This technique does require some specialised knowledge to develop, modify, and upkeep. Whereas a more conventional data engineering solution may have commonly-held language such as Python or Scala at its core, using PowerQuery M to this level may limit the pool of developers that are able to maintain the solution. This is especially important in a consulting context – some consultancies may see this as job security but at exposé we want to empower the businesses we work with, not hamstring them.

Frequency

This is not really applicable in the context of microbatch or live data refresh periods, which is true of nearly all RESTful API solutions. Due to the number of back and forward communications required to retrieve the data from the platform, and the type of retrieval going on, there are far fewer optimisations that can be made to reduce query refresh time. This is not to say it can’t be done, but there is definitely a lower limit that is largely dependent on the implementation of the API. For significant datasets this will be reflected in the refresh time.

Cost

Some platforms charge on a per-API-call basis. Due to the pagination we can be making a significant number of calls to the API, so it is important to understand the costing structure and make sure the business is aware as such; no-one wants to be stuck with an unexpected large cost. An important query optimisation with respect to this is making sure that our page sizes are as large as they can be. In our Jira example, the default page size is 50 and maximum is 100, meaning that we would need twice as many API calls (and potentially twice the cost) if we left the page sizing as default. An additional optimisation that can be made is limiting the frequency of refreshes, which will impact cost similarly.

Ethics in artificial intelligence (AI)

In this article I will explore some important concepts including:

  • What AI actually is and why it’s important.
  • The different types of AI systems.
  • Some examples of ethics in AI, already here and some potential consequences of AI in society.
  • What are the cloud vendors doing?
  • What can you do?

But before we can decide what is wrong and what is right we need to understand what AI actually is.

What is Artificial Intelligence (AI)

Official definitions of AI ranges from “…systems or machines that mimic human intelligence to perform tasks and can iteratively improve themselves based on the information they collect” to “the ability of a digital computer…to perform tasks commonly associated with (human) characteristics, such as the ability to reason, discover meaning, generalize, or learn from past experience”. The theme in most common definitions is AI ‘mimics’ human behaviour. That is pretty important, I will come back to it later.

I won’t go into loads of detail describing deep learning and machine learning as subsets of AI – it’s just important to understand that these layers are the mechanism at our disposal when we create AI solutions, and so the ethics discussed here must be considered when we create these solutions.

https://www.geospatialworld.net/blogs/difference-between-ai%EF%BB%BF-machine-learning-and-deep-learning/

Why is AI and its ethics important?

The boom in data and the maturity of cloud computing (innovative technologies as well as computing power) is driving an increasing adoption of AI. A PWC Global CEO Survey projects a potential contribution to the global economy from AI to be in excess of $15 trillion by 2030. This significant contribution to economic growth therefore means we need to start thinking about how we think and talk about AI and Data.

As AI becomes a larger part of our lives, increasingly it will replace functions that humans perform. This is true by definition: AI is designed to mimic human behaviour. However, while AI may replace human behaviours, should it replace the humanity in these behaviours? This is the question that we all must contend with. Do we inevitably lose nuance, fairness, empathy and understanding when we lose the human? This is where AI ethics comes in, and consideration and continued development of ethics in AI is how we ensure that our brave new world is not Brave New World.

AI Ethics is concerned with the principles around the development and use of AI that ensure that:

  • The use of AI improves the lives of individuals and societies
  • AI systems are fair and respect the dignity and autonomy of people throughout the systems’ lifecycles
  • Decisions made and actions taken by or due to an AI system are reasonable, able to be understood and challenged, and have an accountable party

When we talk about AI ethics, it is worth understanding more about AI and its interface with ethics. Starting with AI, we often classify it into narrow and general intelligence:

Artificial Narrow Intelligence (ANI)

(Also called weak AI) is focused on performing a specific task and represents most of the AI that surrounds us today. ANI is not conscious, it has no sentience, and it is not driven by emotion. It operates within a pre-determined, pre-defined range of behaviours and cannot operate beyond those parameters. A recent example of ANI would be the new Azure Purview data catalog engine that classifies data, but only according to patterns and expressions available to it, for example 16 character numerical data fields are likely credit card numbers.

But ANI is not ‘intelligent’. For example, the data catalog engine will never ponder the meaning of life (unless it’s contained within a credit card number) – and it certainly wont participate in the complicated realm of philosophy. But within the domain of identifying and classifying data types, Azure Purview appears intelligent. Likewise, asking a virtual assistant what the time in New York is right now will provide an accurate and maybe even conversational answer because the virtual assistant is an ANI within the domain of simple conversation. There is an appearance of intelligence, but this is because parameters of the question is clear, calculable and unambiguous – we’ve all experienced how a chat bot behaves when we stray outside the conversations it is trained to have.

gif

ANI can therefore only do what it is designed to do, for example Purview can process my request, enter it into its own index catalog, and return the location of credit card numbers in my datasets. It is not truly intelligent, even though it can definitely be very sophisticated. VERY sophisticated in fact, think of IBM’s Deep Blue, which beat chess grandmaster Garry Kasparov at chess in 1996, or Google Deepmind’s AlphaZero which is able to teach itself board games to a superhuman level in a matter of hours. Evenso, ANI is restricted to the space of problems it is designed to produce solutions to.

Sophisticated ANI has has made our lives easier by relieving us from mundane, repetitive, frustrating or unsafe tasks. Look at Spot, a complicated AI system (and a robot dog) who will be able to inspect powerlines for South Australia Power Networks.

But some would argue that the fact that ANI can replace the need for humans to perform repetitive tasks does not mean it does not pose a perceived threat. ANI can power robots, like Spot, to replicate tasks in an assembly line, or to inspect powerlines based on variables, but what does that mean to the livelihoods of thousands of workers who used to perform those tasks. This creates an ethical dilemma.

Artificial General Intelligence(AGI)

(Also called strong AI or singularity) is focused on performing and adapting to general intelligent tasks, much like a human can. As humans we can think abstractly, we can strategise and tap into our consciousness and memories when we decide on a course of action and we can acquire new skills ourselves. AGI will require systems to be able to comprehend, and learn in a general way (not just restricted to board games, for example) when deciding on a course of action. This type of AI has not been realised yet. However, the race to AGI has led to many machine learning innovations that underpin current AI, including deep learning algorithms – https://www.google.com.au/amp/s/fortune.com/2021/09/07/deepmind-agi-eye-on-ai/amp/.

Still, these complex algorithms only know what they’ve been shown. For example it takes thousands of labelled photos to train an image recognition model. So there is still a long way to go before we see real AGI.

Some experts predict that the development of true AI could be achieved as early as 2030 with an emergence of AGI by the year 2060https://www.google.com.au/amp/s/www.forbes.com/sites/cognitiveworld/2019/06/10/how-far-are-we-from-achieving-artificial-general-intelligence/amp/ .

Others argue that AGI is not possible at all as we don’t even fully understand our own brains. Diego Klabjan, a professor at Northwestern University for example states: “currently, computers can handle a little more than 10,000 words. So, a few million neurons. But human brains have billions of neurons that are connected in a very intriguing and complex way, and the current state-of-the-art [technology] is just straightforward connections following very easy patterns. So going from a few million neurons to billions of neurons with current hardware and software technologies — I don’t see that happening” – https://www.analyticsinsight.net/experts-foresee-future-ai-agi/

Some view AI through the prism of the threats it poses. Stephen Hawking warned that AI could wipe us out once they become too clever: “If AI itself begins designing better AI than human programmers, the result could be machines whose intelligence exceeds ours by more than ours exceeds that of snails.” And Elon Musk stated that “AGI is humanity’s biggest existential threat”. Efforts to bring it about, he has said, are like “summoning the demon.”

Whether you beleve AGI is possible or not possible, or even believe it’s going to wipe us out, this article will not provide a running commentary on the validity of the opposing arguments. The focus here is rather on those systems that mimic human behaviour, and AGI is not required for this. ANI is perfectly sufficient at mimicking human behaviour in a limited way, and this is more than sufficient to begin considering ethical implications. Especially for “AGI-like” systems like virtual assistants that are intended to appear generally intelligent, but are in fact only extremely sophisticated, containing multiple ANI systems working in synergy. These tools we typically call Siri, Google Assistant or Cortana and they’ve entered our homes and talk to our children.

However, virtual assistants are not the only examples of current complex applications of AI systems.

Optimised, personalized healthcare treatment recommendations

Even though there are many opportunities for AI in healthcare, think of the scenario where AI can diagnose skin cancer more accurately, faster and more efficiently than a dermatologist can (https://www.nature.com/articles/nature21056 ), a World Health Organisation (WHO) report raises concerns about AI in healthcare including “unethical collection and use of health data, biases encoded in algorithms, and risks to patient safety, cybersecurity, and the environment”. The report raises concerns re the subordination of the rights and interests of patients in favour of powerful companies with commercial interests- https://www.healthcareitnews.com/news/emea/who-warns-about-risks-ai-healthcare

Even though AI has come leaps and bounds in advancements of medical care, healthcare policy and ethical guidelines are lagging behind the progress AI has made in this area and one major theme to be addressed is how to balance the benefits of AI with the risks of AI technology, and a second is how to interpret and deal with legal conflicts that arise with the use of AI in health care (for example can a machine be liable for medical malpractice) – https://journalofethics.ama-assn.org/article/ethical-dimensions-using-artificial-intelligence-health-care/2019-02

Driverless delivery trucks, drones and autonomous cars

Another example involves scenarios where our judgement means we can sometimes act illegally as the judgment is not only informed by what is legal, and what is not, but the decision is also wrapped into morals and sentiment, sprinkled with experience, and lathered with emotion.

Imagine a busy road and what appears to be a box in the road. The current position of the road has a non-overtake solid line, and no oncoming traffic. To avoid the obstructing box we’d simply drift a little into the opposite lane and drive around it. But an automated car might come to a full stop, as it dutifully observes traffic laws that prohibit crossing a solid line. This unexpected move would avoid driving over the box, but likely cause a crash with the human drivers behind it.

Pushing this example a little further, if the automated car were forced into a scenario where it has to make difficult decisions, say between colliding with different pedestrians, how should it resolve this? Do AI agents need to have an ethical (or even moral) framework built into their behaviour?

Algorithmic bias

AI used is already being used in healthcare, but its not yet hugely pervasive. And autonomous cars is still in its infancy. But data scientists all over the world have already been creating machine learning algorithms for predictions and decision support for years. So when it comes to ethics, this is, in my humble opinion, the area in most urgent need of ethical consideration.

Algorithms are created and trained by humans, and as humans we are knowingly, or unknowingly biased, which may be reflected in the algorithms. Or rather, the data passed through the algorithms itself may contain bias due to the way in which it was collected, processed or stored.

Amazon, for example, scrapped a recruitment algorithm a few years back which was trained on data submitted by applicants over a 10-year period, and most of those applicants were men, so the data itself contained bias, which was reflected in the results which was in essence discriminatory towards women applicants. The solution was not rating candidates in a gender-neutral way because it was built on data accumulated from CVs, mostly from men. – https://www.bbc.com/news/technology-45809919

And in 2020, Microsoft’s AI new aggregator, MSN came under fire for pairing a photo of Leigh-Anne Pinnock with an article of her little Mix bandmate, Jade Thirwall. Both women are mixed race. In the same year Twitter had to remove its automatic image cropping algorithm because it appeared to preference certain races – https://www.bbc.com/news/technology-57192898.

Some research is starting to reveal issues with decision making that relies heavily on algorithms as these algorithms often replicate, or in some cases even amplify, human biases, particularly those affecting protected groups.

These examples (and there are many more), the explosion of data, and the increasing adoption of machine learning and AI, shows why it is important to start thinking about frameworks around ethics in your AI sooner rather than later. But lets first look at what the main vendors are doing in this space: Microsoft, Google, Amazon and IBM.

What are the main cloud data and AI platform vendors doing about ethics in AI?

A few years back IBM launched a tool that can scan for signs of bias and make recommend adjustments to the data or algorithm. The open source nature of the tool and the ability to see, via a visual dashboard, how algorithms are making decisions and which factors are being used in making the final recommendations also means that transparency is improved.

Similarly, Google launched the what-if tool to analyse ML models and to manually edit examples from a dataset and see the effect of those changes. It can detect misclassifications, unfairness in binary classification models and model performance across subgroups.

Microsoft too launched Fairlearn SDK that will detect and alert people to AI algorithms that may be treating them based on their race or gender, or other bias.

But beyond tools, what about supporting frameworks?

Microsoft (Azure) and Google Cloud Platform (GCP) are tackling this explicitly through frameworks and principles:

Microsoft for example state that they are ”committed to the advancement of AI driven by ethical principles that put people first” and created principles and tool kits to help with ethical implementation of AI- https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1%3aprimaryr6

Google state that “AI is transforming industries and solving important, real-world challenges at scale. This vast opportunity carries with it a deep responsibility to build AI that works for everyone.” with their own principles and resources that help with responsible AI – https://cloud.google.com/responsible-ai

While its rivals Microsoft and Google published their frameworks for ethical AI that guide their work on AI, AWS, in stead, put the onus on its customers to use the technologies appropriately: “AWS offers some best practice advice relating to its customers’ use of data, but has stopped short of laying out its own guiding principles. It is up to clients to decide whether their use of AWS tools is ethical, said the company’s head of solution architecture in ANZ, Dr Peter Stanski.”https://www2.computerworld.com.au/article/661203/aws-ethical-about-ai-we-just-don-t-talk-about-it-say-apac-execs/. But they are not complacent by any means as in 2020 they announced a moratorium on police using their facial recognition software until there are stringer regulations to govern ethical use of face recognition software https://www.aboutamazon.com/news/policy-news-views/we-are-implementing-a-one-year-moratorium-on-police-use-of-rekognition?ots=1&tag=curbedcom06-20&linkCode=w50

So what should you do?

Dependant on your AI platform of choice, there may be tools/ technologies in place to help you analyse your models, highlighting issues you may not be aware of. You must develop and embed your own official policy and an operating framework regarding the ethical use and creation of AI and there may be tool kits and other resources in place that you can use to inform such a framework (I.e. leverage the legwork and research already done by others as a starting point – for example Developing the AI Ethics Framework and principles by the Australian Department of Industry, Science, Energy and Resources). Such an operating framework, must be embedded into your organisation and the data workers (people).

Operational framework – create a framework that changes the paradigm from creating great AI solutions towards creating great AI solutions and ensure that ethics is a core consideration for the creation and management of such solutions:

  • Determine the challenges that may be unique to your organisation.
  • Create an official policy and commitment towards ethics in AI.
  • Develop procedures on how you will manage the initiation, creation, testing and ongoing management of AI solutions – including those provided by technology.

People – ensure your AI related data workers are familiar with ethical challenges re AI, broadly, as well as those that could be unique to your organisation. Ensure your operational framework is embedded into your organisation via the appropriate change management.

  • Make research available for all to consume.
  • Ensure your data workers are across the approaches and tools available on your platform of choice.
  • Embedded your operational framework into your data worker workloads (similar to how you would embed data security policies and regimes into your organisation).
  • Embrace collaboration from your data workers and use their experiences, as well as new information in the broader data domain to review and refresh your operational framework on a regular basis.

Technology – use the technologies at your disposal to augment your operational framework.

  • Use technology to highlight issues with AI and your models (for example the Google what-if tool or the Microsoft SDK).
  • Understand and deploy the appropriate toolkits to augment technology and processes.

Data – ethically collect, store and use data

  • Collect and store data with the consent of the party from which the data is being collected.
  • Consider analysis of data during collection of data to eliminate potential sources of bias.
  • Use good governance of data, ensuring appropriate controls exist for confidentiality, privacy, security and quality

Conclusion

AI will surely become more present in all of our lives as we approach the possible advent of AGI, but even as the technical capability to deliver AGI is developed, we all have a duty to consider the ethical impacts that AI has. Particularly we need to consider the ethical aspects of creating and using AI, and the ongoing ethical ramifications of biased AI. To serve this the tools and frameworks we develop and use must include ethics.

In many ways the ethical aspects of AI are more fiendish problems than the purely technical aspects of AI. The developments in this space from Microsoft, Google, Amazon, IBM and others are an important investment in a fair, humane future, but the onus is on us all to develop operating frameworks that ensure AI benefits us all.

Additional References

https://ai.googleblog.com/2018/09/the-what-if-tool-code-free-probing-of.html

https://www.bbc.com/news/technology-45561955

https://www.microsoft.com/en-us/ai/responsible-ai-resources

https://www.independent.co.uk/life-style/gadgets-and-tech/news/stephen-hawking-artificial-intelligence-could-wipe-out-humanity-when-it-gets-too-clever-humans-could-become-ants-being-stepped-a6686496.html

https://www.europeanbusinessreview.com/find-what-elon-musk-said-about-real-world-ai/

https://www.aiethicist.org/frameworks-guidelines-toolkits

Acknowledgement

Thank you to Rory Tarnow-Mordi who contributed and advised.

This article was originally published here – Ethics in artificial intelligence (AI) (makingmeaning.info)

The Purple Peeps Seven Favourites from Ignite 2021

What a wonderful couple of days at Ignite! It’s fascinating to hear all these exciting announcements and top tips from experts globally. One of the best things about being a Microsoft Partner is the access to the technologies, products and tech community, and all these together, empower us to drive impact so we can accelerate our customers’ data journey.

As we come to the end of this year’s Ignite, of all the sessions around data and AI, here are the ones our team liked most.

Etienne’s Best Picks

One of our focus areas over the past year has been Synapse, and more recently Purview. Exposé has in fact recently concluded a major program of work, where Synapse performed a central analytical function unifying multiple data across sources, contending with billions of records, and making visual insights available to users within seconds.

It is therefore fantastic to see the various improvements and announcements regarding Synapse, and a closer integration with Purview. Here are some highlights:

Azure Synapse Link

Link provides near real-time integration from Dataverse, Cosmos DB and SQL. With Link, operational data can be integrated into Synapse Analytics without needing to build ETL pipelines. Link for Dataverse is now GA, Link for Cosmos DB is in Private preview, and Link for SQL Server 2022 is expected soon through the upcoming CPT. The business benefit here is an immediate data analytics layer over your operational data. This again changes the whole paradigm around what a modern data warehouse is as it brings operational data in real time through such a modern data warehouse.

Azure Synapse Database templates

Database templates is now in preview. These templates are designed around common industry data warehouse models and will accelerate the build of your modern data warehouse. It contains a set of database templates that includes base entities that can be used as a blueprint (or a starter) for your data warehouse. It also includes a code free Designer that will allow users to map, transform and load the data into these models without code. The business benefit here is a massively accelerated path towards a data warehouse design (a starter model, that can of course be changed to suit your specific needs) that applies to your industry, plus a code free way to load your data into the resulting models.

Azure Synapse Data Explorer

Now forming a part of Synapse, Data Explorer allows you to analyse text heavy log or event data in near real time and contextualise that data with your other business data to assess impact and gain insights. Structured and unstructured data is easily mashed together, and it does this in near real time and at scale. The business benefit lies in the ability to contextualise your event or log data with business data quickly and at scale which adds another vantage point within your modern data warehouse.

Azure Purview Catalogue Search now in Synapse

Purview, only recently released, already has over 57bn data catalogued. It is going to become an invaluable piece of technology for data discovery and governance. The business benefit means the data workers, such as a data engineer, working in Synapse building authoritative data transformations, but do not yet know what is available has the ability to search and discover the datasets across his data ecosystem, right there within his workspace, and this is a huge productivity gain. Please feel free to see our recent video discussion on Purview & modern data governance here.


Jake’s Best Picks

Azure OpenAI Service

Azure OpenAI Service is now  bringing OpenAI’s powerful GPT-3 natural language model to the Azure platform. Already used in some of Microsoft’s existing services and GitHub’s Copilot tool that helps developers write code for them, developers can now leverage this functionality in their own models. The business benefit here is OpenAI service now brings this powerful natural language model to customers within Azure’s enterprise capabilities, scale and security.

Azure Percept

Azure Percept – seamlessly builds and manages intelligence to the edge with Azure AI and Machine Learning with a complete platform including software and hardware components to accelerate IoT development and prototyping. The business benefit here is a complete package solution providing the pre-configured hardware IoT devices, pre-trained models and software for configuring and managing these devices that allows customers to fast-track AI development at the edge.


Willem’s Best Picks

Attending this year’s MS Ignite left me with two primary takeaways. I am firstly excited to see the continuing investment by Microsoft into the Azure Synapse product. It is proving to be a powerhouse on the platform and has been the cornerstone of each of my Azure deliveries to clients within the past year or two. Secondly, often when we consult with business users (not just IT departments) and we see an increasing desire to get closer to data. It is becoming more common to see ‘citizen developers’ within businesses now. Thus, the Power Platform plays a significant role as it helps move things along with low-code projects while still working within some framework that satisfies the IT security requirements.

Power Platform

This platform has again become more powerful (pun intended) with its low- and no-code approach to common data workloads. Some additions include: the Process Advisor to Power Automate; the pay-as-you-go pricing model for Power Apps; and the Power BI app for MS Teams. Process Advisor gives an easy low-code starting point for clients to engage with a commonly asked question about analysing a specific process to look for bottlenecks. The business benefit here is that Power Apps is now more approachable for ‘proof of value’ type projects with lower cost instead of per-user licensing models. Users now have access to their Power BI environment within MS Teams and are able to collaborate right there (where they are already working) without having to swap to the browser.

Honorary mention

The Microsoft and Qlik collaboration for data integration shows a great example of how one can combine the forces of these two providers two solve business problems. The Qlik Replicate and Azure Synapse demo will be especially valuable to those clients working with SAP environments and who are looking at moving towards data lake or data warehousing in Azure Synapse.

Defining Analytics

Data Analytics used to be a buzzword. It no longer is. It is now a standard business requirement. Over the past decade we have gone from asking “Does your business need data analytics?”, to “How can data analytics transform your business?” Simply put, analytics has become essential to the growth and development of the modern organisation. Today, businesses of all sizes can employ analytics to gather insights on the health of processes, and to identify opportunities for improvement and expansion. But first, before we discuss how analytics can add value to a business, let’s define the term “data analytics”.

A different definition

The word “analytics” has been in regular use for a few decades now, however, it can be traced back many centuries to the Greek “ἀναλυτικά” (Latin “analyticvs”), which means “the science of analysis”. In this age, we define analytics simply as “the method of logical analysis”.

Today, the word most commonly coincides with the term “data analytics”, which refers to the process, or a set of processes, for collecting, storing, processing, and interpreting data. This is the most common definition, and it has evolved over the years, depending on the technological capability of that time. It is, in my opinion, quite reductive. I, however, would prefer to expand that definition…

Data analytics is a framework for organisations to acquire, store, process, and interpret data, for the purpose of analysing incidents, gathering insights, identifying patterns, predicting events, and reporting relevant business metrics for further action.

A framework?

You read that right. There’s a number of definitions for “data analytics” available over the internet. Most will define data analytics as a process, and they are all right. These definitions have mostly come from industry leaders – organisations and people – who have invested heavily and worked in the field of data science for decades, so there is no denying their “correctness”. However, I believe they diminish the modern analytics process down to simply “analysing” data.

I like to think of data analytics as a framework over which businesses can logically build their analytical capability, by methodically developing systems for addressing each step of the data lifecycle, starting from data generation and acquisition to data archiving.

Why redefine what isn’t wrong?

When we think of data analytics as a process, we are more likely to focus on the “analysis” portion of analytics. Data analytics, however, is a lot more than analysis of data. It involves many steps, or rather sub-processes, leading up to analysis, and a few more following it. Each of these sub-processes requires careful considerations and implementation that can affect the quality and outcome of every other step in the overall analytics workflow.

Not only that, thinking of data analytics as a process leads many to believe that analytics is a linear exercise; that it starts with gathering data, and ends with visualising the results of analysis. This is quite far from real-world scenarios. In most projects, data moves from one sub-process to another, often branching here, and converging there, or returning to a prior sub-process, or jumping a few sub-process. How data moves through a data analytics system will depend on a number of business rules and requirements.

Data analytics is a framework for organisations to acquire, store, process, and interpret data, for the purpose of analysing incidents, gathering insights, identifying patterns, predicting events, and reporting relevant business metrics for further action.

Looking at data analytics as a framework encourages us to acknowledge the various sub-processes within the analytics workflow, and how they interact with each other. It also allows us to take into account how each of these sub-processes may share resources with different business systems outside of the workflow. Or technical and business teams could plan how existing resources which are not currently used for analytics can be integrated into a workflow.

So rather than picturing data analytics as a bidirectional numberline, you could think of it as a two-dimensional plane that overlaps with other two dimensional planes in certain areas, and intersects with them in other places.

Parts to a Whole

Moving beyond definitions and arguments, let’s look at the different sub-processes that make data analytics possible. While real-world applications and implementations of data analytics can look vastly different, they almost always include the following components:

  1. Data Acquisition & Source Integration
  2. Data Storage & Warehousing
  3. Data Cleaning & Preliminary Analysis
  4. Feature Engineering
  5. Statistical Analysis & Machine Learning
  6. Data Visualisation & Reporting
  7. Data Archiving

All of these sub-processes contribute towards the efficiency of your analytics workflow, which is why a data analytics project must include specialists with an understanding of the whole. Decisions made for and at each segment could either derail or simplify the end result.

Data Acquisition & Source Integration

No matter the size of your business, there are always operations that generate relevant data. There will also be operations that access and process data originating outside of your business. The goal of data analytics is to make sense of this data, understand how your business is performing, and finally improve business operations. A data analytics workflow must accommodate systems that can record data generated across different teams and activities. This includes data generated by the analytics workflow itself. An analytics workflow must also include tools that can integrate data sources which are external to your business.

Data Storage & Warehousing

Once your business has identified valid sources of data, and found ways to access these sources, it needs to find ways to store this data in a manner that is manageable and effective over the long term. A whole lot of data generated by business processes are “raw” in nature. This data needs to be organised so that it can be easily and repetitively accessed and processed. It needs to become meaningful. A business’ data storage strategy can affect their analytics capability. Poorly developed storage systems can slow down the analytics workflow and unnecessarily increase operational expenditure. An efficient data storage strategy will take into account the volume of data generated and the velocity at which the data is generated. It will also manage data backups for disaster management and recovery.

Within the purview of data analytics there is also data warehousing – data storage that is specifically maintained for analytics. More on this in future posts.

Data Cleaning & Preliminary Analysis

While the data storage process brings order and structure to the data ingested from multiple sources, it is not always ready for the purpose of analytics. Data, especially textual data, can often have issues such as spelling mistakes, special characters, whitespaces, etc. which can undermine the results of analytics. Data ingestion can also face errors related to date formats, rounding of numerical values, truncation of text strings, etc. All of these need to be handled either when the data is being ingested into storage, or right after it has been ingested. At this stage, it is also important to run some preliminary analysis on the data. This helps understand how the data is distributed (particularly when the end-goal involves statistical analysis), and how much of the data is either redundant, missing, or simply irrelevant, and if more cleaning is required.

It is quite common for these first three sub-processes – data acquisition & source integration, data storage & warehousing, and data cleaning & preliminary analysis – to be implemented as one unified process within the analytics workflow.

The following three – feature engineering, statistical analysis & machine learning, and data visualisation & reporting – constitute the “analysis” portion of a data analytics workflow.

Feature Engineering

In many cases data that has been cleaned is good enough for statistical analysis, reporting, and visualisation; however, it is not always optimal. Businesses might want to extrapolate more information from the cleaned data, such that this information makes it easier to filter data reports or explain trends and historical events. Basically, existing data can be manipulated to create new “features” and “attributes” that have more value than the cleaned data from which it was created. In fact, data analytics and machine learning systems rely heavily on feature engineering processes. Efficient feature engineering can greatly enrich the quality of your business’ reports and forecasts.

Statistical Analysis & Machine Learning

Analytical systems will frequently employ statistical methods for analysing data and identifying features that are relevant to the business questions that need answering. They also help determine which features in the data have strong predictive power and must be used in machine learning algorithms. A good portion of data analytics is oriented towards forecasting how business processes will perform and what preemptive steps are required to ensure business continuity. For this reason data analytics teams often have at least one data scientist and one machine learning engineer who can hypothesize and test the business hypothesis by using statistical methods.

Data Visualisation & Reporting

The results of analysis (variances, correlations, scores, etc.) can be quite difficult to read. Business stakeholders with the power to make decisions are not always acquainted with the technical knowledge required to make sense of statistical analysis. To accommodate such an audience, it is necessary that the results are presented as reports (verbose and visual) that drive the message home. An analytics workflow will consist of at least one dashboard or report, depending on the business requirement. The development of dashboards and reporting solutions can be quite cumbersome, and is impacted by the underlying data storage (warehouse architecture, data models, etc.) and so great care must be taken to ensure that your storage solution is not overcomplicated.

Data Archiving

A much overlooked process within data analytics projects, data archiving – not to be mistaken with data backup – can quicken the response times of your analytics solutions. By archiving data that is no longer in frequent use, considered “offline”, we take them out of operational use, leaving behind aggregations and snapshots instead which do not need to be re-processed along with newer operational data. Reduced processing and archived storage often translates to reduced operational expense. Decisions about data archiving need to be made early in the analytics workflow development, and must be covered alongside the storage solution discussed earlier.

Not so simple

As mentioned earlier, there is the ever-lingering temptation to regard data analytics as a simple end-to-end pipeline. It seldom works so simply. Data analytics workflows, given the business requirement, can be quite complex. There are processes and considerations we have not covered in the preceding sections of this post, but the ones mentioned are the typical building blocks of an analytics workflow. When developing an analytics solution teams have also to consider aspects such as data security and data compliance standards of local governments where servers are hosted.

Keeping all of this in mind, here is how I’d visualise a basic analytics workflow.

alt_text

Don’t let that image frighten you. It’s only meant to give you an idea of how data flows in a data analytics system. An efficient analytics workflow automates data movement from one subprocess to another, and minimises human intervention, allowing analysts and business stakeholders to focus their attention on high-priority issues. It also helps individuals and teams make better decisions by shifting their focus from scouring crude data to analysing reports produced by the system.

Get in touch

At Exposé we pride ourselves in developing bespoke data analytics solutions that cater specifically to your business’ requirements. Whether you are looking to integrate reporting tools, automate your data workflows, or modernise your existing tech infrastructure by moving to the cloud, we can provide you the service you require. Get in touch with us for a confidential initial discussion.

Azure Purview, does it fill the data governance blind spot for Microsoft? (Part 2)

Welcome back to Part 2 of this blog on Azure Purview. In this instalment continuing from Part 1, I will go through a detailed review and highlights of Purview.

Azure Purview – classifications and sensitivity:

  • Classifications using Purview’s own 100+ prebuilt classifications, or BYO ones, are used to mark and identify data of a specific type that is found within your data estate. We could easily see where all the credit card and Australian Tax File numbers were located across the data lake.
  • Sensitivity labels define how sensitive certain data is and they are applied when one or more classifications and conditions are found together. We could clearly find the data with a sensitivity of ‘secret’ – an immediate application of this could be to support the IRAP data classifications as defined by the Australian Signals Directorate and PCI in the financial sector.

When the scan completes all data meeting classification rules will be discoverable, whereas Purview’s sensitivity labelling needs a couple of hours to reflect the new assets and auto label sensitivity after which time it too will be discoverable. It is also possible to view insight reports for classifications and sensitivity labels.

Azure Purview classification and sensitivity label examples
Azure Purview classification and sensitivity label examples

Azure Purview – business glossary

We could easily employ a glossary to overlay business friendly terms over the metadata so that we converted the physical vocabulary with a standard vocabulary the business can understand. Remember data is the business’ asset more so than that of ICT, so a business vocabulary is important.

Purview has, as I previously mentioned, more than 100 system classifiers it uses during scans to automatically detect the system and it can also use your own BYO classifications and apply them to data assets and schemas. But it was easy to override these with the business glossary and anything a human override was never replaced by subsequent automated scans. We for example overrode CC# with Credit Card Number.

It is also in the glossary where data stewards and owners are set, two core elements of effective data governance.

Terms can also be subject to workflow and approvals so that it does not become a free for all.

Azure Purview – show lineage at various levels

We could see a bird’s-eye view of the data estate, including very impressive lineage at the asset, column and process, levels, as well as the full data map:

  • At the data asset/ entity level, i.e., where the entity/ asset came from.
Azure Purview Entity level lineage
Azure Purview Entity level lineage
  • At the column level. i.e., where the attribute came from.
Azure Purview column level lineage
Azure Purview column level lineage
  • At the process level, i.e., how data was moved/ orchestrated.
Azure Purview process level lineage
Azure Purview process level lineage

Azure Purview – gain insights

It is also easy to see insights across many of these concepts – across the data assets, glossary, scans, classifications, sensitivity, and file extensions.


The images below show only some insight examples for the glossary, for classification and for sensitivity:

Azure Purview Glossary insights
Azure Purview Glossary insights
Azure Purview Classification Insights
Azure Purview Classification Insights
Azure Purview Sensitivity Insights
Azure Purview Sensitivity Insights

Conclusion – is Azure Purview a worthy data governance tool

Does Purview hit the mark? As I said before, there were, as at the date of authoring this, some kinks Microsoft needed to sort out and my correspondence with their product team suggests they are working on this. So, looking at it from a pure data cataloguing perspective, it ticks many boxes and at a very compelling price point.

But data governance is broader than just cataloguing, and even though Purview crosses the boundary into some aspects that would not normally sit within a data catalog (which is a good thing), other areas still require attention, notably master data, data quality and data security. BUT we all know this is just the first module, so watch this space!

Originally published by Etienne Oosthuysen at https://www.makingmeaning.info/post/azure-purview-does-it-fill-the-data-governance-blind-spot-for-microsoft

Azure Purview, does it fill the data governance blind spot for Microsoft? (Part 1)

Data enablement – can Azure Purview help

For several years now, I have been evangelising about the need for data enablement in organisations. In a nutshell, this means getting data to consumers (be it final consumers of reports, or consumers who go on to create data solutions such as data engineers, data scientists and business focussed data modellers) faster, and accepting that data transformations, data modelling and new data solutions (models, reports, dashboards, etc.) will occur as part of a highly agile, rapid development process and by multiple data workers, some of whom are external to ICT.

Technology has now reached that part in the maturity curve, where this enablement will accelerate and become the norm, replacing old school, linear data workloads performed by central BI/ ICT teams only. Core to these technologies is Data Lakes for storage at scale (the ‘lake’ of your ‘lake house’), Synapse or Databricks for on-demand transformations and the virtual data warehouse layer (the ‘house’ of your ‘lake house’), and Power BI for business models and visualisations (the ‘front door’ to the whole thing) and of course resources to move data into, and around the ecosystem.

BUT all this enablement now demands a massive rethink of governance – both in terms of methodology as well as technology. Long and laborious theory heavy data governance methodologies simply won’t keep up with the rapid internal growth of the internal data market and the many workers across the organisation who take part in data related activities. An alternative, much more pragmatic methodology is required and must be supported by technology that posses two crucial things: (1) Artificial Intelligence to simplify and accelerate the process of data cataloguing and classification, and (2) crowd sourcing so that users across the business can quickly add to the collective knowledge of the data assets of the business. And it is in the technology space where Azure had a massive blind spot.

Introducing Azure Purview.

The word Purview simply means the ‘range of vision’ and when it comes to data, then the greater the range of this vision and the clearer the objects you see, the better. Will Purview live up to this definition of its more generic namesake and will it cover the blind spot I previously mentioned?

The current generally available version is the first of multiple planned modules for Purview, i.e., the Data Catalog module. This first module supports AI based cataloguing and classification of data across your data estate, curation, and data insights. Users will in addition be able to use and maintain business glossaries, expressions to classify data based on patterns beyond the out-of-the-box classifications (let’s call these bring your own (BYO) expressions to cover additional patterns), provide visibility over ownership and custodianship, show lineage, etc.

This will have immediate benefit to anyone seeking pragmatic data governance as it will immediately provide a heap of knowledge about the data in your data estate via 100+ out of the box scanning rules, something that would have required resource intensive and error prone human activity, plus it enables a data worker to augment/ override the AI scanning in a crowd sourcing ecosystem, or by allowing data workers to BYO scanning rules.


In the recent road test, we dumped a whole load of data into Azure Data Lake and set Purview scanning loose over it to do its AI and built-in classifier magic. The results looked pretty good and goes a long way to fulfil that requirement I mentioned before for pragmatic and accelerated data governance.

In the next blog, I will go through a detailed review and highlights of Azure Purview. Stay tuned!