Old faithful is renewed – welcome to Cubes as a Service

Analysis Services (as a) Service

For BI professionals, the main component used to provide a rich business-friendly enterprise information layer to non-technical users at scale was SQL Server Analysis Services (Power BI models achieves the same, but is limited by scale).

One major downside of Analysis Services was its requirement to be on premise (with “on-premise” I, of course, include Virtual Machines through Infrastructure as a Service) – so no modern analytics architecture that included Analysis Services could claim to be server-less, and such modern cutting edge Analytics Architectures therefore always required a bit of the old (the other components in the same camp are Reporting- and Integration Services).

This will soon no longer be the case, Microsoft recently announced Analysis Services Service. This is basically Cubes as a Service.

 (Grinslade, 2016)

We feel once this service is released for General Availability, it will be a become a very common component of any large and modern Data Analytics landscape.

The benefits:

·       Arguably the most important benefit is cost – (following the pattern used by Microsoft for their Data warehouse as a Service, SQLDW) is the ability to pause the service, which means, together with scaling up or down, you pay only for what you use. There are major cost benefits associated with this compared to maintaining servers (incl. IaaS) to host Analysis Services.

·       It makes moving the business from on-premise to the cloud much simpler.

·       It helps achieve a server-less ecosystem which means the cloud provider (in this case Microsoft) can take care of commoditized services (like infrastructure, availability, and recovery).

·       It is easy to scale (through Query Processing Units) and to pause.

·       The service is in the cloud, so connecting to cloud data sources is as expected very simple, and connecting to on-premise data sources is enabled through the on-premise data gateway (you may already use this in your Power BI solutions, and if not it is very easy to deploy).

 caas-sources

 

 (Duncan, 2016)

·       One of the downsides of the (underlying) data model created in Power BI (in Import mode) is that fact that it is a black box, not accessible by other BI tools. Analysis Services Service overcomes this limitation.

caas-visualisations

 

 (Duncan, 2016)

Purely Technical benefits

·       The service is built on SQL Server 2016 Analysis Services Enterprise Edition so for those BI professionals familiar with this technology there is no new skills to learn.

·       There is a time benefit – technical professionals can now spend their time developing truly great Analysis Services solutions without having to be distracted by infrastructure challenges (like setting up and configuring servers to host the solution).

The service is still in its infancy and many things simply do not work, but we will be test driving it in due course, so watch out for a blog post once it goes into General Availability.

Let’s hope Microsoft embraces the same strategy for Integration Services (ETL) and Reporting Services (paginated reports)

References

Duncan, O. (2016, October 26). What is Azure Analysis Services? Retrieved from https://azure.microsoft.com/en-us/documentation/articles/analysis-services-overview/

Grinslade, B. (2016, October 25). Announcing Azure Analysis Services preview. Retrieved from https://azure.microsoft.com/en-us/blog/introducing-azure-analysis-services-preview/

Microsoft. (n.d.). Azure Analysis Services. Retrieved from https://azure.microsoft.com/en-us/services/analysis-services/

 

 

Colouring with numbers – Can data present a better picture?

Coloring with numbers – Can data present a better picture? Time to bring sexy back!

With the rise of self-service reporting, is just presenting the data enough? Or is how information is presented just as impactful as to what is being presented?

As people are becoming increasingly exposed to data visualization through infographics, digital dashboards, and interactive websites, there is a growing expectation that information is presented in a visually coherent and aesthetic manner. People are becoming trained to skim read what is displayed on a screen versus in-depth reading that traditionally occurred with information presented on paper.

A 2005 study by Ziming Liu of San Jose University looked at how reading behavior had changed over the previous ten years from paper to screen and found exactly this pattern.

“The screen‐based reading behavior is characterized by more time spent on browsing and scanning, keyword spotting, one‐time reading, non‐linear reading, and reading more selectively, while less time is spent on in‐depth reading, and concentrated reading. Decreasing sustained attention is also noted.” Liu summarised.

Perceivably, there is limited time for a digital report to grab a reader’s focus and impart the requisite information.

So with this in mind how do we ensure we are delivering information clearly to the report reader?

Before we start designing our report, there are three main areas to establish an understanding for that will influence the report design; audience, function, and presentation.

  1. Audience – who is this information intended for? Without a good grasp on this understanding, we can fail in producing a report that addresses the need and ensures clarity of message.
  2. Function – what is the purpose of this report? Knowing the expectation of use of the report helps in ensuring the right information is used and represented to the reader.
  3. Presentation – how is the report to be displayed? There is an ever-increasing number of portals in which information can be displayed; paper, mobile screens, large format displays. While one report design may work for one particular presentation method, it doesn’t mean it will automatically work for another.

Identifying answers to these key areas will help us to focus and increase the success with which our report is interpreted.

We’ve already noted that our audience is spending less and less time actually reading information and even less time interpreting it. Generally, an audience is expecting a report to provide an answer to a question.

How much of our product did we sell last month?

What were my child’s grades for each of their subjects?

What is the temperature forecast for the next week?

The audience doesn’t want to have to perform their own exploration of the data to find the answer to the question that brought them to the report in the first instance. It is our role as the report designer to perform this on their behalf and to visualize this information as clearly and succinctly as possible for them.

Once the audience has found the answer to their initial question, they will either absorb this information and move on or immediately want to apply this new information to other scenarios to assist with further decision making.

What would happen to our revenue from the sale of our product if we had raised or lowered the price?

Were my child’s grades reasonable compared to the rest of the class?

Will I need to take an umbrella this week?

These additional questions aren’t normally identified in the initial discovery phase when trying to understand the requirements of the report as they won’t have been conceptualized until the initial answer is presented and the context will be different depending on who’s consuming the report. It’s in trying to offer the capability to provide answers to these unrealized questions as quickly and clearly as possible to the audience is where the true art of self-service report design is realized.

So now we have a presumably impossible task of providing answers to questions that don’t exist while making sure information is clearly presented, easily interpreted and visually appealing.

How do we achieve this?

There are some principles of the design of reports that I have found to be quite useful to assist with the design process:

  1. Storyboard the report – Try and find the “story” within the information. What is trying to be said? Where does it start and where does it finish? This will help in understanding the context of the information within a report, outline potential ways the data could be used, alternative questions that could be asked and identify any relationship between the data being visualized.
  2. Assign visualizations – Choose relevant visuals to portray the information as coherently and succinctly as possible. It isn’t just as simple as having any visualization to display the information as you have to ensure the capabilities support the three influencers of audience, function, and presentation.
  3. Define user experience – Critical to ensuring the audience can find and interpret the information that is being presented. To do this effectively requires thought behind –
    • Navigation – Decide how the report reader will move through the report to find the information they require with as little resistance as possible, but also don’t discount enabling the explorative value in your design as this may also be how hidden information is brought to their attention creating new and insightful knowledge.
    • Layout – Identify where visualizations and navigations are placed to enhance the user experience. Clarity and brevity are important. Remove anything that isn’t enhancing your story or offering benefit to the audience.
    • Colour – Colour plays an important role in both clarity and engagement with a report. By utilizing color effectively information can be enhanced and importance highlighted. It can also allow the audience to form an emotive attachment ensuring better interpretation of the information presented. But also keep in mind that color is highly subjective and can be responded to differently by different audiences.
  4. Iteration – A report is a dynamic entity and needs to be constantly honed against needs and requirements. It can be consistently improved with the addition of further input from users, change in visualizations and datasets. Recognise the thought and effort you put into the report but don’t become emotionally invested in it as when you release it in front of an audience, they will be the ones who will ultimately decide whether it is right for them or not. So be prepared to accept critique and take it into consideration when iterating your report.

In the coming weeks, I will be working through each of these principles with an example to help illustrate how each can be effectively utilized. I hope you enjoy the journey with me.

Ziming Liu, “Reading behavior in the digital environment: Changes in reading behavior over the past ten years”, Journal of Documentation, Vol. 61 Iss: 6, pp.700 – 712

UPDATED: ArcGIS Maps for Power BI provided by Esri (Preview)

Microsoft recently announced ArcGIS Maps for Power BI provided by Esri. This functionality is still only in preview, but we decided to give it a test drive. https://powerbi.microsoft.com/en-us/blog/announcing-arcgis-maps-for-power-bi-by-esri-preview/

We used data from our social media solution to visualize with this new feature. The results were pretty impressive as shown in the images below and a huge improvement over the existing Map visual.

arcgis-map-example-3

arcgis-map-example-2

arcgis-map-example-4

arcgis-map-example

There is a BUT! This feature is not currently available for Power BI Service, so its application is very limited. Of course, the feature is still only in preview, so it will be corrected soon as confirmed in this thread. http://community.powerbi.com/t5/Desktop/ArcGIS-Maps-for-Power-BI-Preview-Discussion/td-p/72947

UPDATE: Microsoft has announced that the ArcGIS mapping service is now available in Preview in the Power BI service as per this post from Microsoft on 24 November https://powerbi.microsoft.com/en-us/blog/arcgis-maps-for-powerbi-available-in-powerbi-service/.

Synergy Reporting

Logo_Synergy.png

The Synergy Solution assists organizations with accessing their people, finance, and operational data at any time, in any place and on any device in order to make informed business decisions without the need for static reports and reliance on IT.

Download the brochure here: synergy-brochure

Here is a quick demo of Synergy

The battle of the AMLs – Amazon Machine Learning Vs Azure Machine Learning

As machine learning has become more accessible to businesses and the number of products currently available has risen in the market, the question is regularly asked of us as leaders in data and analytics to recommend, or at least provide insight, into some of those products.  Machine learning has become more accessible and there are many products currently available.  In determining a use case, it was decided to use Amazon Web Services and Microsoft Azure as both are comparable in the market as well as being cloud-based offerings, allowing business a lower setup time and ongoing costs. The reason for selecting Azure and Amazon was based on the familiarity of working with Microsoft and Amazon Web Services products.

Both Microsoft and Amazon can achieve outcomes in different ways. Any perceived strengths and weaknesses highlighted here are based on my personal experience using the products. Do note that there are frequent enhancements carried out by both Microsoft and Amazon.  This document is based on research conducted in September 2016. It is possible that there may have been some changes since the time of writing this.

Use case – The use case is centered around a common retail-related problem.  A business offers services to clients. The sales promotion activities within that business were not targeted to specific groups of clients. This was because the business didn’t have any in-depth knowledge of which services to push to which segment of clients. Below is a comparison of Amazon’s and Microsoft’s Machine Learning solutions based on my findings when trying to solve the above-mentioned use case.

 

Algorithms – Azure Machine Learning allows multiple ways to try and solve a given problem

Azure has a plethora of algorithms to choose from when compared to Amazon which only had one. A simple google search for an Azure Machine Learning cheat sheet will provide you with good guidance on which algorithms would be best for a particular problem. However, Amazon is limited to Logistic regression as confirmed in the Amazon Machine Learning Frequently Asked Questions page (https://aws.amazon.com/machine-learning/faqs/). While Logistic regression has its place, it is only ideal when there is a single decision boundary. Most use cases will require trialing multiple algorithms to achieve a good outcome.

Findings – If you need to predict if a client will make another purchase or not, then logistic regression would be ideal. However, when needing to understand which services some clients are more likely to purchase, more capability from the algorithms are needed to be able to generate a good outcome.  This is where Azure leads the way.

aml_cheatsheetAzure Machine Learning Cheat Sheet
Source: https://azure.microsoft.com/en-us/documentation/articles/machine-learning-algorithm-cheat-sheet/

 

Categorise Data – Amazon Machine Learning categorizes the source data for you

Amazon automatically pre-processes the data and categorizes each field. The possible categories are Categorical, Numeric, Binary or Text. If a user wishes to, they can change the category of the data which may have an effect on the final outcome. If the data contains a row identifier, it can be declared and will be ignored when making predictions.

A binary field is one which can only have one of two values. In Amazon, binary fields can only have one combination of the following (Case insensitive);

  • 0,1
  • Y, N
  • Yes, No
  • True, False.

It is not capable of understanding any other combination even if only two values exist in the field (e.g. paid, unpaid).

In Azure, there is no option to classify data fields. The classification is done automatically. While this is very intuitive, I would prefer to have the ability to manipulate fields as required. It is worth noting that Azure had no difficulty understanding that a field containing only two values was a binary data set. This reduced the amount of data manipulation that was required before creating a model and made the output easier to read.

aws_categorise_1
AWS automatically proposed data categories

Findings – I had a column which had an indicator for purchases made in the morning and afternoon (AM/PM). Amazon wasn’t able to see this as a binary field and forcing it to be a binary field and trying to predict if a service was required in the morning or afternoon caused an error.

 

Source Data – Amazon gives good visibility of the content within the source data

Amazon grouped the fields into the categories identified above. The target field is the field that we are attempting to predict using machine learning. It was possible to view each field and understand the correlation to the target field.  The distribution of categorical data was useful by identifying the top 10 attributes and the number of occurrences. A bar chart helped to understand the distribution of attributes in each field.

aws_source_data_visualisation
AWS source data visualization of a single field

Azure too has the ability to show each field and the distribution of attributes within it. It only showed the top 10 attributes for each field and therefore it was not easy to understand the proportion of the data that did not make the top 10. The correlation of a field to the target was not shown in the example that I worked on. This is most probably due to the availability of multiple algorithms. The correlation of a field to the target field would be different for each algorithm.

AML_Source_Data_visualisation.png
Azure source data visualization

Findings – There were more than ten types of popular services. I was unable to understand if the services that didn’t make the top ten was significant or not in Azure. This was crystal clear on Amazon because it created a grouping of all services which were not in the top ten on the chart and displayed it as ‘Others’.

 

Training a Model – Amazon has a fixed 70/30 split when training a model; Azure allows you to select your desired split

In Amazon, the data split between the training and evaluation dataset was fixed at 70% for training and 30% for evaluating. The only method of changing this was to use a different data set for evaluating the data. Based on your use case you may need a different split. In Amazon, you need to do this by manually creating training and learning datasets.

In Azure, it was possible to specify the desired split between the data available for training and scoring the model. I would consider this to be an essential feature as the split would need to change based on the problem that you are trying to solve.

AWS_70_30_split.pngAmazon can only do a 70/30 split

Findings – A 70/30 split didn’t cause an issue. However, the ability to trial different splits in Azure was useful for me to understand its impact.

 

Evaluating a Model – Amazon automatically evaluates the machine learning model

Evaluation happens automatically in Amazon; it occurs immediately after the Machine Learning model is created. In a binary classification scenario, Amazon lets the user change the trade-off, False positive rate, Precision, Recall, and Accuracy. Each of these attributes has an effect on all the other attributes. It is easy to visualize the outcome immediately by tweaking them.

In Azure, evaluation can be added to the experiment as necessary. In a binary classification scenario, it was only possible to change the trade-off threshold which in turn impacts other factors.

aws_advanced_metricsAmazon lets you change all metrics

 

Findings – I would consider it to be critical that a machine learning model is evaluated while in development. Doing some manipulation in the data improved the quality of the model. Many iterations are usually required before landing on a good model.

 

Predictions – Making predictions using Machine Learning

Both Amazon and Azure have options to manually test, batch process and create endpoints for real-time predictions. Batch processing is a more common method of creating predictions from machine learning. It is usually done using a large set of records which usually takes some time to process.

Testing predictions on Amazon – The easiest way to test a model is to manually enter some values and get a prediction. This can be done by either typing in the values into a web form or pasting values separated by commas. The predicted label is displayed on the screen but to see the confidence of the prediction, you need to filter through some code. An excerpt of an example output is below;

       “PredictiveModelType”: “BINARY”
},
“predictedLabel”: “1”,
“predictedScores”: {
“1”: 0.7453358769416809
}

Amazon batch predictions – To do this, you first need to convert the data to a CSV (Comma Separated Values) format and upload it to Amazon S3 (Simple Storage Service). Predictions can be created based on this file and will be output to another CSV file in S3. The output file was compressed and saved as a GZ file which is commonly used in Unix environments. A drawback in the output received was that there was no way of finding out which row of the source data matched the output. The only way that I found to get around this was to add a row number column when creating the machine learning model. The row number appeared in the predictions. To marry up the two, you would need to use a method such as VLOOKUP which was a bit frustrating. Trial and error proved that the rows in the output were in the same sequence of the rows in the source file.

Amazon endpoint – An endpoint creates an Amazon web service which can be accessed via an API (Application Programming Interface). This is useful when there is a requirement to get predictions on a real-time or ad-hoc basis. Amazon claims that a query will be responded to in 100 milliseconds. They also claim to be able to process up to 200 queries per second. Any extra queries will be queued up and responded to. Higher capacities can be accommodated by contacting Amazon.

Testing predictions in Azure – You are presented with a test option where you can type in data to a web form. There was no option to simply paste comma separated values into the webpage like on Amazon. However, the option to download a customized Excel workbook was very intuitive. This workbook consisted of parameters and predicted values side by side. It was very easy to use and responses were received within a second. This method is great if the data set has a small number of fields but may become progressively harder to use if there are many fields.

Azure batch predictions – Similar to the test option, this option also made use of an Excel workbook. However, it was more flexible allowing you to pick a cell range for the input and output data. I set up my output data on another sheet in the same workbook. I tested 300 rows of predictions and got the results back within one second.

Endpoint in Azure – The endpoint in Azure appeared much more user-friendly. There was an API help page which had sample code for a request and response using the dataset which you are working on. It also contained sample code for C3, Python and R which would reduce the complexities of having to write code from scratch.

AML_Batch.pngAzure batch predictions in Excel

Findings – In my use case, I didn’t require real-time predictions. I used batch predictions and it was a hassle to have to manually marry up the original data with the predictions from Amazon.

 

Costs – Azure’s better features and user interface costs more. Up to five times more!

Please note that pricing may vary based on many factors. These include but are not limited to the region, the complexity of the solution, the computing tier is chosen, the size/nature of your organization and any other negotiations entered into with either Microsoft or Amazon.

I performed some high-level cost calculations on the cost based on the dataset used. For an Amazon solution, it costs approximately USD 100 per month when using 20 hours of computing time and 890,000 predictions. Real-time predictions cost more than batch predictions. However, the difference was not significant for the model that I used (USD 104.84 for real-time vs USD 97.40 for batch predictions).

The same solution using Azure came up to an estimated total cost of just under USD 500.

The prices are my observations only and should always be confirmed with the license provider. For more information on pricing, please see the links below;

Amazon – https://aws.amazon.com/machine-learning/pricing/

Azure – https://azure.microsoft.com/en-us/pricing/details/machine-learning/

 

Audience – Amazon appears to be more focused towards technical people

Overall, Amazon appeared to be focused on users who are more technically minded and more comfortable with programming. When learning to use Amazon Machine Learning, I only came across one example and therefore, I would consider this to be very limited.

On the other hand, Azure Machine Learning appeared more suited for power users within a business, as well as the technically minded. It provides a more familiar graphical drag-and-drop interface. These components are pre-programmed and are grouped such that they are easy to understand. The examples provided are a great way of getting accustomed to the tool and there are good sample datasets available.

 

Azure would appeal to most users

If you are more into coding and you have technical resources to support you and you can manage with only using the Logistic Regression algorithm, then Amazon would be ideal for you. With its lower cost and range of other cloud services, it is definitely worth considering. For the rest of us, Azure is clearly the best choice.

It is worth remembering that the many algorithms and ease of use comes with a price tag attached to it.

Findings – I almost felt spoilt for choice with the number of algorithms in Azure. I spent a lot of time trying to get a good model working on Amazon. The process was much easier when I used the same dataset in Azure.

In conclusion, I was able to gain an understanding of which groups of clients are most inclined to purchase certain services using Azure. This makes it possible to target specific clients and make the best use of the marketing budget. The ability to easily view the predictions along with the fields used for the predictions was very helpful. The predictions will be used to target clients who are more likely to purchase the services. In turn, this will create a better outcome for both the business and its’s clients. Overall, Azure proved to be a more mature offering for someone who wanted to solve a particular use case and trying to avoid deep coding.

We will keep a close eye on the Amazon offering to see when it catches up.

From water cooler discussion to corporate Data Analytics in record time

Power BI

For those who have been involved in conventional Business Intelligence projects, you will be all too familiar with the likely contiguous chain of events and the likely outcome. It typically goes something like this: The idea is incubated by someone (very often this would be within ICT), a business case for a project is written, someone holding the purse strings approves it, suitable business and functional requirements are gathered-, written- and approved, the solution is architected and developed-, tested- and signed off, the solution is implemented, and voila, the business MIGHT use the solution.

These conventional solutions are problematic for a number of reasons:

Requirements are lukewarm at best – If the idea was not incubated by someone experiencing a real business pain point, or wanting to exploit a real opportunity, then the “requirements” gathered are always going to be skewed towards “what do you think you need” rather than “what do you know you need right now”.

Suboptimal outcome – Lukewarm requirements will lead to solutions that do not necessarily add value to the business as they are not based on real pain points or specific opportunities that need to be exploited. So when the formal SDLC process conclude the business may, in the spirit of trying to contribute to successful project outcomes, try to use the solution as is, or send it back for rework so that the solution at least satisfies some of their needs.

Low take-up rate – accepting a suboptimal solution so as to be a good team player, over time, usage will drop off as it’s not really serving a real need.

Costly rework and an expensive project – these issues means that the business often gets to see a final product late in the project lifecycle and often only then start thinking about what it could do for them if changes are applied to meet their real needs. The solution is sent back for changes, and that is very costly.

So what has changed?

Technology has finally caught up with what business users have been doing for a long time.

For many years, given the costly and problematic outcomes of conventional BI, users have often preferred access to the data they need, rather than fancy reports or analytics. They would simply download the data into Excel and create their own really useful BI (see related blog here https://exposedata.wordpress.com/2016/07/02/power-bi-and-microsoft-azure-whats-all-the-fuss-about/ ). This and the advent of a plethora of services in the form of Platform-, Infrastructure- and Software as a Service (PaaS, IaaS and SaaS), and more recently BI specific services such as Data Warehouse as a Service (DWaaS), are proving to be highly disruptive in the Analytics market, and true game changers.

How have this changed things?

In the end, the basic outcome is still the same. Getting data from some kind of source to the business in a format that they can consume by converting it to useful information.

But technology and clever architectures now allow for fast response to red-hot requirements. This is game-changing as solutions are now nimble and responsive and can, therefore, respond to requirements often discussed informally when pain points and/ or opportunities are highlighted in the course of the normal business day such as in meetings, over drinks with colleagues and at the “watercooler”. The trick is to recognize these “requirements” and to relate them back to the opportunities that the new world of data and analytics provide. If that can be done, then the solutions typically respond much better to these organic requirements vs. solutions that respond to requirements incubated and elicited in a much more formal way.

What are water cooler discussions?

I use this term to describe informal discussions around the organization about pain points or opportunities in the business. These pain points or opportunities represent organic requirements that should be responded to fast if they hold real value. This DOES NOT mean that formal requirements are no longer valid, not at all, but it means that we need to recognize that real requirements manifest itself in informal ways too. Here is an example of how a requirement can originate in a formal and in an informal way:

The problem

City planning realize that there are issues with parking availability as they receive 100s of calls each month from irate commuters stuck driving around looking for parking. It seems as if commuters are abandoning the city in favor of suburban shopping centers where ample parking is provided. This is not good for businesses in the city center, and not good for the city council.

Formal

The head of planning realizes that information (in the form of data) will be key to any of his decisions to deal with the problem, so requests reports on traffic volumes, finances and works management planning. It seems as if this data is not in the data warehouse, so a business analyst is employed to elicit the requirements around the reports required, and so the long and costly process starts.

Watercooler

The Head of Planning tells a colleague that he wishes he could expedite getting his hands on the information he needs but he has a limited budget so he cannot employ more resources to move his reports along quicker.

This is overheard by the Data Analytics consultant who realizes that in order to maximize supporting such an important decision, the Head of Planning will have to look at the issue from multiple angles which will likely not be provided by such formal reports. The data he needs, I.e. traffic volumes, finances and works management planning must be blended with other contextual data such as weather, events, date and time of day.

The consultant knows that:

  • The city already holds traffic volumes, finance date and works management planning across fragmented source systems.
  • The city already collects millions of sensor data per day – parking, traffic flow, commuter flow.
  • There are heaps of contextual data out there which is easy to access – weather, events, city businesses financial results, employment figures, etc.
  • The city already has a cloud subscription where services such as IaaS, but especially PaaS and SaaS and DWaaS can quickly be added and configured so as to allow for the collection, blending, storage, processing of data at a fraction of the cost of achieving the same on-premise.
  • That the cloud subscription allows for data science and predictive analytical activities to complement the collection, blending, storage, processing of data.

He calls a meeting with the Head of Planning who is intrigued with the idea and the quick return on investment (ROI) at a fraction of the cost, and commissions the consultant to provide a proof of concept (POC) on the matter.

From water cooler discussion to the solution in record time

In my example (which is based one of our real-world examples using Microsoft Azure) the city has an existing investment in a cloud service. Also, note that I provide a high-level resource mapping of the POC and solution at the end of this blog for both Amazon Web Services (AWS)® and Microsoft Azure®.

  1. The process starts with a POC. Either in a free trial subscription or in the customer’s existing cloud provider such as AWS or Azure.
  2. The preference is to keep the solution server-less and only opt for IaaS where resources cannot be provided as PaaS or SaaS.
  3. IoT parking data form the basis of the solution and both real-time flows plus history is required.
  4. Weather data, traffic flows into and around the city, events, and time of day will help add important context when predicting times of parking through peaks and troughs.
  5. Organisational works management planning data will further enhance better parking planning.
  6. Whilst business financial results show the impact, and more importantly lost opportunity cost on businesses if people abandoned the city in favor of suburban shopping centers.
  7. The processing and storage of the sheer volume of data are achieved at a fraction of the cost than previously envisaged by the business.
  8. The resulting solution is not a replacement for any corporate data warehouse, but complementary to it. Any existing data repository can be viewed as additional and useful contextual data in this new data analytics landscape.

The POC architecture involves the following resources (both Azure and AWS are shown):

Table
Workloads and vendor components

Please note that the diagram below by no means implies a detailed design, but is a true representation of the high-level architecture we used to achieve the specific solution.

Watercooler arch
High-Level Architecture

Data flow patterns

  • A: IOT – Sensor originated real-time data;

o   A1: Into storage;

o   A2: Into predictive analytics where it is blended with B and C;

o   A3: Directly into real-time visualizations;

  • B: Additional contextual data from publicly available sources such as weather, events, business financial performance, into storage;

o   B1: Into predictive analytics where it is blended with A and C;

  • C: On-premise data such as works management, into storage;

o   C1: Into predictive analytics where it is blended with A and B;

  • D: Predictive analytical results into real-time visualizations, and also to storage for historical reporting;
  • E: Massive parallel processing, scalability and on-demand compute where and when required and supporting visual reporting;

Outcome

The result of such a real-world example POC was the realization by the business that very deep insights can be achieved by leveraging the appropriate data wherever it exists and by cleverly architecting solutions with components and services within easy reach, superior outcomes can be achieved fast.

The building blocks created in the POC was adopted and extended into a full production solution and it set the direction for future data analytical workloads.

Strategic Segmentation

Preface

Market segmentation is a common practice in marketing and sales in order to better understand – and therefore be better able to target – customers. This same principle, though, can be applied to any business problem where the division of a diverse population into sub-populations based on similarities (or differences) would be advantageous.

Fortunately, rather than having to slice every variable an infinite amount of ways, we can utilise unsupervised learning algorithms to produce groupings of samples based on the similarity of data features.

This post will discuss the application of a technique to accomplish this, using a mixture of technologies (SQL, Python and R) and algorithms (Self Organising Maps (SOM) and Hierarchical Clustering)

The Goal

The following example is taken from a pool of insurance claims, with the aim to understand sub-populations contained within the data to allow for appropriate monitoring and exception reporting. Specifically, we want to know if the mix of claims changes for the worse.

Claims by nature have a structure whereby a majority of the portfolio consists of low-value claims, reducing quickly to very few at the highest values (for the statistically minded, the value generally follows a log-normal distribution). Typical partitioning strategies are to slice by claim duration (shorter claims are in general cheaper), and this is suitable in a lot of cases when dealing with claims in aggregate. But given the lack of granularity, the challenge is that when there is a change in expected durations, the second level of analysis as to the “who” is required.

By using unsupervised learning, we will essentially encode the “who” into the cluster exemplars, so we can then focus on the more important question: “So what are we going to do about it?”

Approach

Obtaining the data was, fortunately, a trivial task, as it was all contained in an on-premises SQL Server database, and the data was as clean as it was going to get. 5 years of historical records were used as the training sample.

Execution of the modelling was done using the R Kohonen package, and subsequent clustering of the SOM model by the hclust function.

Finally, to glue everything together into a processing pipeline I used Python 2.7 with:

  • sqlalchemy: to connect to the database;
  • numpy / pandas: for data massaging;
  • rpy2: to connect to R.

Algorithm Overview

I will not go into detail around the technical implementations of SOM and Hierarchical Clustering, there are far better explanations out there than I could hope to provide (in Tan 2006, for example). However, to provide a simple overview, a SOM is an m × n grid of nodes, to which samples are assigned to based on the similarity (or dissimilarity) measure used; commonly, and in our case, this is Euclidean distance.

mapping
Figure 1. A 10×10 Hexagonal SOM Grid

It is an iterative algorithm with random initialisation, and at each step, the node codebook (a vector identical in structure as the input data, with feature values that represent the samples assigned to the node) is updated and the samples re-evaluated as to which node they belong. This continues for a set number of iterations, however, we can check to see if the codebooks are changing to any degree using a ‘Change Plot’. Figure 3 is an example of the ‘Change Plot’ from the Kohonen package, where we observe that after 50 or so iterations, there is a minimal change occurring.

changes
Figure 1. A 10×10 Hexagonal SOM Grid

When we have a SOM model we are happy with, Hierarchical Clustering allows us to condense the grid into a smaller number of clusters for further evaluation. Clustering is performed over the node codebooks, after which a number of clusters are selected. Note that one can use the nodes in the SOM as the clustering, but generally, based on the number of samples you train the model with, a SOM grid contains too many nodes to be useful.

clustering
Figure 1. A 10×10 Hexagonal SOM Grid

Modelling Process

The process by which the modelling was undertaken is depicted in figure 4.

process
Figure 4. High-Level Modelling Process

As can be seen from figure 4, the Python script initiates everything and uses the results from the SQL and R calls.

This approach was taken primarily due to my own limitations in R programming; I am far more comfortable developing in Python and know my Python code will be more efficient than my R code purely due to proficiency.

My selection of R and the Kohonen package was based on my research into a suitable implementation of SOM to use. kohonen has an implementation called SuperSOM, which is a multi-layered SOM that gets trains each layer separately, that I thought would be ideal for temporal features in the source data (e.g. layer one = features at t1, layer 2 = features at t2 etc).

Finally, the data set was not big by any stretch so whilst training a SOM can be a compute-intensive task, in this case, anything more than a decent laptop was not required – on my i7 laptop, training of the SuperSOM took only 3 seconds to run 80 iterations against ~5,000 samples.

Model Evaluation

Both Quantisation Error (QE) and Topological Error (TE) were used to evaluate the quality of the model where;

  • QE = the mean distance from node samples to the node codebook. A low QE ensures that the node codebook is close to the samples allocated to it.
  • TE = the mean distance to a node’s nearest node. A low TE means that similar nodes are positioned close to each other on the grid.

TE is particularly imported in our case as we wish to cluster the result. Non-contiguous clusters are a side effect of having a relatively high TE.

Result

The clustering was used to create data exemplars around which monitoring was developed to understand and changes in the population mix. Interventions could be focused on the relevant sub-population based on the dominant features of the samples in that group (ie if group “A” is a high-cost group, and feature “X” was particularly dominant how can we influence this feature to move the sample into a lower cost group). It is for this targeting reason that marketing has been a large user of clustering historically.

New samples can be allocated to a cluster by find first allocating them to the nearest node, then assigned the cluster id of the node.

Other Use Cases

Marketing: To group customers based on buying habits, product type purchases, purchase value etc to tailor marketing activities and/or focus on higher value customers.

Customer Service: Triage call types based on historical interactions into high/medium/low priority or risk.

Government: Understanding the demographics of users of services in order to better tailor customer experience

Technical Notes

  1. SOM uses random initialisation, therefore to get repeatable results a seed needs to be set;
  2. Min-max scaling between 0 and 1 was used given the presence of binary variables;
  3. In a single som scenario, one can just do dist(som$codes) to get a distance matrix for clustering purposes, but for superstorm, we need to handle the layers. So a function was created to calculate the average weighted distance across layers to output as the distance matrix. The weighting was based on layer weight and also node distance (ie the further the node away from each other the higher the weighting) to form contiguous clusters.

References

A good segmentation tutorial using the Kohonen package: http://www.r-bloggers.com/self-organising-maps-for-customer-segmentation-using-r

kohonen package official documentation: https://cran.r-project.org/web/packages/kohonen/kohonen.pdf

Wehrens and Buydens: Self- and Super-organizing Maps in R: The Kohonen Package, Journal of Statistical Software, October 2007, Volume 21, Issue 5: https://www.jstatsoft.org/article/view/v021i05/v21i05.pdf

Fusco and Perez: Spatial Analysis of the Indian Subcontinent: the Complexity Investigated through Neural Networks, Proceedings of CUPUM 2015: http://web.mit.edu/cron/project/CUPUM2015/proceedings/Content/analytics/287_fusco_h.pdf

Connecting to Big Data Clusters from Excel

Connecting to big data clusters from Excel is now relatively simple. But those who have tried would have noticed multiple ways to achieve what seems to be the same thing. This article aims to clear these muddy waters a bit.

I am using Microsoft Office 365 Pro Plus (Excel 2013) updated to 15.0.4719.1002

Power Query updated to 2.22.4007.242

All tests performed show relative and average results only and should not be taken as authoritative performance tests, but rather as observations averaged across multiple tests.

Areas from where you can access HDInsight clusters

There are 3 areas from where you can access big data in HDInsight clusters from within Excel, with the aim of integrating it into your BI solution:

The first is the ‘old school’ Data tab:

Image13.JPG

If you use this ‘old school’ area:

  • The variety of data source options are more limited, but it does include Hadoop by way of an ODBC connection.
  • It is a more convoluted path via ODBC to Hadoop Clusters (specifically compared the Power Query method).
  • You will be limited to the row and column limitation in Excel (1048576 rows by 16384 columns).

The second is to use the data connection functionality inside Power Pivot:

Image14

  • The variety of data source options are more plentiful when compared to the Data tab ODBC connection, and it also includes Hadoop by way of an ODBC connection.
  • It is still a more convoluted path via ODBC to Hadoop Clusters (specifically compared the Power Query method).
  • You are not subject to the limitations posed by Excel.

The third is to use the Power Query functionality – it is this area which will be discussed in more detail in this post:

  • You have the widest variety of data source options available via Power Query, including three methods for accessing Hadoop clusters (discussed below). New data source options are constantly added.
  • Although it provides two methods for accessing Hadoop clusters beyond ODBC, it is important to understand the limitations of each (discussed below).
  • You are not subject to the limitations posed by Excel.

As you know HD Insights is a Hadoop cluster of compute nodes that provides the MapReduce compute power over data stored in HDFS which is not co-located with the compute nodes. By separating storage from computing, HD Insight allows for data stored in containers to be accessed through HDFS APIs or through Azure Blobs Storage REST APIs. This means that the data exists beyond the life of the cluster so that computing power can be achieved when required, yet data remains intact and available when the cluster no longer exists. This obviously holds a cost-benefit as compute clusters are much more costly than storage. There are other benefits not discussed here as the focus here is less on HD Insight and more on its accessibility through Excel as a BI tool.

Power Query provides 3 methods for accessing the data stored in the HDFS. Two via the “From Azure” menu, and one via the “From Other Sources: menu.

From Azure:

Image15.JPG

In the From Azure menu, you have both “From Azure HD Insight” as well as “From Azure Blob Storage” methods available (as shown in the diagram above):

The Test 1 chart below shows relative and average results from multiple tests that accessed the same data set and then comparing the two methods against each other.

Image16.JPG

I see no substantial performance difference between the two methods both when the cluster is running, and after it has been deleted. Also, note that the queries generated by the two methods are very similar.

From Azure Blob Storage – query

Image17.JPG

From Azure HD Insight – query

Image18.JPG

Conclusion:

  • Using From Azure Blob Storage simply means that the storage account is accessed directly and you do not leverage the compute power of the VM’s nor HIVE’s MapReduce available via the cluster.
  • You would have thought that opting for the From Azure HD Insight method means that you can leverage the compute power of the VM’s and HIVE’s MapReduce available via the cluster whilst the cluster runs, but tests (see the test 1 chart) do not show the substantial improvements expected with HIVE’s MapReduce. The results of From Azure HD Insight also remains largely unchanged irrespective of the existence of the HDInsight cluster.
    • This method (as is the case with From Azure Blob Storage) is therefore limited to accessing blobs in the container and it provides no way to access HIVE tables so it, therefore, renders the MapReduce programming model and its parallel, distributed compute power superfluous via this connection method. How then do we leverage the power provided by MapReduce? Read on…

From Other Sources:

This menu provides the third method for accessing HDInsight and that is through an ODBC connection after installing the Microsoft HIVE ODBC connection driver and setting up the appropriate System DSN. Set up the initial ODBC connection only after the cluster has been provisioned. The ODBC method will therefore only be applied whilst the cluster runs, yet it is not necessary to reconfigure the ODBC connection if a deleted cluster is recreated, as long as the name (and therefore the OBDC host address), user and password are created the same as before.

Image19.JPG

Retesting shows a marked improvement in relative and average results when accessing the HIVE table version created over the blob that was tested previously in Test 1 via the ODBC method.

Image20.JPG

Also note that the query generated compared to the previous two methods is notably different.

Image21

Overall Conclusion:

Given that there are essentially 5 ways to access data associated with the Hadoop cluster and the various limitations listed above, how do you decide which method is the most appropriate? I would suggest deciding base on your particular use case:

  • If you do not have Power Pivot and Power Query enabled in your Excel deployment, and your number of records does not exceed Excel’s record limitation, then you can use the ‘old school’ ODBC DSN connection.
  • If you have Power Pivot and Power Query enabled in your Excel deployment and you require uninterrupted access to your data in the HDFS without the need for the compute power brought by Hive’s MapReduce, then consider either of the two connection options in the From Azure menu of Power Query.
  • If you do have Power Pivot and Power Query enabled in your Excel deployment, require ad-hoc/ intermittent access to your data in the HDFS and support of the computer power brought by Hive’s MapReduce (for example during a monthly reporting cycle only), then consider the From ODBC connection option in the From Other Sources menu in Power Query, and a consistent and pre-defined configuration of the cluster each time you enter a period where data is to be accessed supported by superior compute power.
  • If you do have Power Pivot and Power Query enabled in your Excel deployment, require uninterrupted access to your data in the HDFS supported by the computer power brought by Hive’s MapReduce, then consider the From ODBC connection option in the From Other Sources menu in Power Query and a constantly running cluster. But please be aware of the cost implications of this.

 

 

 

Power BI and Microsoft Azure – what’s all the fuss about?

This blog provides a journey starting with Excel and its promotion to BI, it moves on to Power BI version 2 and concludes with a much larger Azure analytical story.

Let’s start this journey with some topical statistics.

  • US$97bn the projected annual dollar spend on BI services by 2016. Good
  • 19% the percentage annual growth rate of the Analytics market. Good
  • 21% the average adoption rate of BI products. The range of 18-24% range remains unchanged since 2005. Bad

http://www.thoughtspot.com/blog/15-game-changing-bi-stats-2015

The story of Excel’s promotion into a BI tool

Image01

We all have war stories of huge amounts of effort and spend on large BI projects that often included a Data Warehouse and some Data Marts, perhaps a couple of Cubes and then some Reports. Just for the business to ignore all the hard work and use one feature the BI solution above any other, the reports’ “export button”. The business wants the data – they want to do stuff with it.

So although the general adoption rate of BI products was and remains very low, the notable exception is Excel as this is where the data mostly end up post report export for further analysis and manipulation.

The Microsoft epiphany:

Excel is pervasive in most organizations, and it is already used in a quasi-self-service BI capacity – refer to the reports’ “export button” story. So why not take Excel, and transform it into a proper “BI” tool and remove the “but it’s only Excel” stigma, towards a “hey, I did not know you could do that with Excel”, provide a clear distinction between solution Discovery and solution Production, and yet still retain the familiarity of Excel.

As a result, MS set out to develop the “Power-“ components (Power Pivot, Power Map, Power Query and Power View, with a local instance of the Analysis Services engine to support Power Pivot) and provide a deployment roadmap into SharePoint and/ or SSAS. This was a stroke of genius as it could even satisfy at least some of the skeptics over the fence in ICT.

The story was a simple yet powerful one. Your BI journey starts life in Excel, in the hands of some kind of power user with intimate domain knowledge, i.e. that guy or gal who do some pretty intense data stuff in Excel as part of their day-to-day job, think of the risk analysts, actuaries, etc. Let them use their domain knowledge and skills to evolve a solution (which includes a performant semantic model in the form of Power Pivot) in a discovery mode that keeps their business capability front of mind. Then hand their solution over the fence to ICT who enables the solution for a corporate audience, i.e. deploy it to an SSAS Tabular instance, or to SharePoint or SharePoint online for smaller models, wrap security around it, etc.

This, in turn, allows the data professionals in ICT to discover what is required in the data platform space to complement and enable the user experience further, for example, data quality, master data and shared dimensions, slowly changing history tracking, live data streams, unstructured data, etc. No longer has large data warehouse deployments been a prerequisite for delivering useful solutions and insights to the business – in fact the opposite becomes the norm, i.e. let the daily analytical tasks drive what is needed in the data platform domain.

Along comes Power BI V2

Image02

Step forward to today and Power BI for Excel has a new(ish) and much better twin in the form of Power BI version 2. But why create a new and better “twin” rather than just improve the existing guy?

As the adoption of Power BI in Excel increased, so too did the challenges. MS and we in the PTSP community pushed this adoption hard (“BI to a billion people” – http://blogs.seattletimes.com/microsoftpri0/2014/02/10/microsofts-goal-for-power-bi-for-office-365-reaching-a-billion-people/ ) and as a consequence, we realized some of the limitations. These challenges need to be resolved in order for Power BI to become a serious player in a relatively crowded market.

Power BI in Excel vs. Power BI V2

With the ability to easily access datasets, including very large ones, came a requirement for more compute power. 32-bit versions of Excel are limited to 2 GB of virtual address space. The 2 GB is shared by the entire workbook components including spreadsheets, the Power- add-ins and the subsequently created models, whereas 64-bit versions impose no such limits and the workbook size is instead only limited only by local memory and resources. The problem is that 32-bit version of Office is the more common deployment in most organizations and they are simply not reliable enough for serious data workflow.

The second limitation relates to upgrades. Excel is built on a solid and stable code platform and changing that is no easy feat. So pushing periodic updates in the Excel “Power-” space was always going to be an issue.

Power BI V2 removes these limitations: no longer is 32 vs 64-bit versions and issue, and MS can now push constant updates out to the user community with ease (the September 2015 Update (2.27.4163.351) for example included 44 additional features – https://support.powerbi.com/knowledgebase/articles/489224-what-s-new-in-the-latest-power-bi-desktop-update ). It also ensures Power BI is backed up by the huge computational engines at MS’s data centers.

Does this mean Power BI in Excel no longer applies? No, not at all, but we just need to recognize that there are two flavours of the same product available, the Excel version and the Power BI V2 version and that they will likely serve two different audience profiles: The power analysists in organisations used to get their hands dirty with Excel, and the data proficient business users who prefer drag and drop environments.

We also need to recognize that the improvements of the Excel version will always lag behind the Power BI Vn version and as a result, the capabilities of the latter will most likely always trump the capabilities of the former.

The two versions will also have slightly different deployment stories: The Power Pivot part of Excel becomes the semantic layer in a deployed SSAS Tabular solution, whereas the Query and Modelling features of Power BI V2 become the semantic model out there in Azure-land.

The two versions are however very similar under the covers as shown in his table:

Image03

* Please remember that this version will be an improvement of the Excel version due to the regular updates pushed to the user community as discussed earlier.

The bigger Azure analytics story

Image04

Please note that this is not an in-depth discussion around Cortana Analytics or the IoT suites in Azure (follow up blogs will cover these topics), as it focuses on Power BI’s integration with some of the key products in these suites.

I think it’s safe to assume that all vendors or their vendor partners representing Qlik, Tableau, Oracle, etc. will be able to make a strong case for their specific technologies and some will fare better than others in terms of characteristics such as visualisations, breadth of data sources, cost, etc. so I will not delve into the advantages and/or disadvantages of Power BI V2 (hereafter simply referred to as “Power BI”) vs. its competitors. Also if you consider the manner in which Power BI receive constant pushed updates, a disadvantage today could no longer be the case tomorrow so such comparisons become largely superfluous. What does remain relevant is the bigger landscape in which Power BI operates and I think this is where it has a clear advantage over competitors.

Windows Azure

In order to understand this advantage of Power BI, we need to consider some topical concepts:

Big data, live streams of data (including IoT), predictive analytics and mobility:

This could involve ingesting high volume structured or unstructured data through the data platform which have been generated from devices (IoT) and pitting that in context of more structured data from internal and external data sources in interactive metadata models and exposed to the users through easy to use and performant self-service BI platforms.

Use case

Think of a real-world use case where a government transport agency wanted to improve traffic flow through a city and of course the public perception thereof and costs associated with it. For monitoring and analysis they would need:

  • Real-time flow of vehicles to show what is happening right now – measured by sensors at road intersections.
  • Social media and its sentiment as to monitor perception – for example, Tweets monitoring keywords @-mentions and #-tags.
  • Road maintenance operational data captured in an on-premise operational maintenance system.
  • Cost information obtained from a General Ledger system.
  • Demographics by postcode obtained from the Australian Bureau of Statistics and Australian Tax Office.
  • Access to historical views of in-flight data (real-time flow of vehicles and social media sentiment) and some key historical values from the on-premise systems to track changes over time.

Power BI has deep integration with relevant to services in the Azure ecosystem, so achieving visualization of real-time data flows, predictive analytics using real-time and historical views of data to predict future trends, and reporting and analysing real-time and historical data augmented with prediction scores in a self-service and performant platform and proactive alerts when thresholds are breached on a range of devices is a key selling point for Power BI.

The overall solution architecture is shown in the diagram below. Please note that the Azure products and services showed are not comprehensive, but the architecture shown will satisfy the requirements to support the use case.

Architectures for the constituent use case parts are shown thereafter. You can see what a prominent role Power BI plays in the overall architecture as well as the architectures for the constituent parts.

Image05

  • Use case parts 1 & 2 involves a more conventional (recognisable) deployment of BI.
  • Use case part 3 includes live data feeds (into live visualizations and as a historical view).
  • Use case part 4 involves big data
  • Use case part 5 closes the loop and provides predictive analytics

Use case 1 – create useful analytics across sourced data

Image06.JPG

Maintenance data is sourced directly from the Operational Maintenance system, associated costs from the General Ledger system and demographics (such as population and average incomes) from web sources such as the ATO and the ABS websites.

Data is mashed together and enhanced as to create business-friendly Semantic Models which are then used to visualize the data in reports and dashboards.

Visualisations are accessed by users on any device and alerts (such as when certain cost thresholds are breached) are managed at the user end via their own devices.

A note on Semantic Models – Power BI uses the SSAS Tabular/ velocity data engine under the covers and as it is built on Azure, it is supported by literally thousands of Azure servers so performance was always going to be good.

Use case 2 – store key data in data mart structures to allow for history tracking where source system does not.

Image07

Source systems often do not track non-transactional history (for example in this scenario the change of a maintenance contractors’ names). This is achieved by way of SCD2 tracking using SSIS and a data store hosted in SQL DB.

The Semantic Model is now extended to include SCD2 history tracked data.

Use case 3 – add live data streams to live visualizations

Image08

High volume data from the traffic sensors and social media are ingested by Event Hubs and passed to the complex event processing engine, or Stream Analytics, from where queries push the data directly into Power BI for live visualization alongside existing visualizations.

Alerts also become very useful in this context when thresholds such as traffic volumes are met.

Use case 4 – the same live data streams are passed to cost-effective storage for a historical capture of the data.

Image09.JPG

The same high volume data from the traffic sensors and social media are also pushed to cost-effective Azure Storage, instead of pushed directly into live visualizations in Power BI, as to accumulate history over time. This is now used to augment the existing visualizations to show not only what is happening right now, but also what has happened in the past.

The sheer volume of data accumulated in this way may require the help of a big data cluster, which is achieved through HDInsight.

A note on Azure Data Sources – The diagram below (taken from https://support.powerbi.com/knowledgebase/articles/568614-azure-and-power-bi ) shows some of the Azure Data Sources accessed through Power BI (not a comprehensive list). It shows both Storage and HDInsight as sources.

Image10.JPG

Notes on Big data beyond HDInsight – Big data integration with Power BI can be achieved via other means too. For example, Spark on HDInsight and indeed through the SQL Data Warehouse (which provides “massively parallel processing” power in the Azure ecosystem.

http://blogs.msdn.com/b/powerbi/archive/2015/07/14/visualize-big-data-with-power-bi-and-spark-for-azure-hdinsight.aspx

http://blogs.msdn.com/b/powerbi/archive/2015/06/24/exploring-azure-sql-data-warehouse-with-power-bi.aspx

Use case 5 – predictive analytics

Image11.JPG

Use case 2 provides for operational and finance data in a data repository.

Use case 4 provides for a historical view on large volumes of unstructured data.

Other data sources which are publically available and provides historical data on weather, population demographics, macroeconomic indicators such as unemployment, etc.

This data can now be brought together and passed through Machine Learning algorithms to achieve a range of predictive outcomes. In this scenario, we use a whole range of attributes including planned maintenance, weather, employment, and social media sentiment to predict traffic flows based on forecasted weather and current employment and social media sentiment and planned maintenance. Proactive alerts can become very relevant in this scenario.

In this scenario, predictive scores are stored in the cost-effective Azure Storage accounts from where they are included in the Visualizations and the Alerts.

Notes on other routes for Predictive Analytics – Machine Learning can, of course, distribute scores to Stream Analytics and/ or web services to achieve a real-time flow of predictions (either to Power BI or to other applications), or it can be passed to SQL DB or SQL DW for inclusion into other datasets.

Conclusion

You can see how the tight integration between Power BI and the Azure Analytics ecosystem services places it front and center in the Microsoft Analytics story, whether it be through the Cortana Analytics suite, or even the IoT suite (http://www.informationweek.com/cloud/infrastructure-as-a-service/microsofts-azure-iot-suite-is-now-available/d/d-id/1322388 ).

You can also see how factors such as the deep integration with topical and relevant cloud services, the simplicity with which this integration can be achieved, the constant updates that keep the product cutting edge, and supported by a low-cost point make Power BI a very big deal.