Financial forecasting, reporting and commentary at scale – LSA Synergy Case Study

Synergy gave LSA (Lifetime Support Authority) financial forecasting, reporting, commentary and interactive analysis at scale. This translated to astounding business benefits such as:

  • Effective low latent financial reporting
  • Lesser reliance on IT to manage reporting structure
  • Saved huge reporting and analysis effort
  • Commentary  to contextualize data that follow the analysis for life
  • Auditing and workflow
  • Etc.

See our case study here: exposé case study – LSA – Synergy Solution

For more information see the Synergy brochure here: synergy-brochure

Or contact us on info@exposedata.com.au to view a video demonstration of Synergy.

 

Is the Data Warehouse Dead?

data warehouse

I am increasingly asked by customers – Is the Data Warehouse dead?

In technology terms, 30 years is a long time. This is how old the Data Warehouse is – that makes the Data Warehouse an old timer. Can we consider it a mature yet productive worker, or is it a worker gearing up for a pension?

I come from the world of Data Warehouse architecture and in the mid to late naughties (2003 to 2010) whilst working for various high profile financial service institutions in the London market, Data Warehouses were considered all important and companies spent a lot of money on their design, development, and maintenance. The prevailing consensus was that you could not get meaningful, validated and trusted information to business users for decision support without a Data Warehouse (whether it followed an Inmon, or a Kimbal methodology – the pros and cons of which are not under the spotlight here). The only alternative for companies without the means to commit to the substantial investment typically associated with a Data Warehouse was to allow Report Writers to develop code against the source systems database (or a landed version thereof), but this, of course, leads to the proliferation of reports, and it caused a massive maintenance nightmare and it went against every notion of a single trusted source of the truth.

Jump ahead to 2011, and businesses started showing a reluctance to invest in Data Warehouses – a trend that accelerated from that point onward. My observations of the reasoning for this ranged from the cost involved, the lack of quick ROI, a low take-up rate, difficulty to align it to ongoing business change, and, more recently, a change in the variety, volume and velocity of data that businesses are interested in.

In a previous article “From watercooler discussion to corporate Data Analytics in record time” (https://exposedata.wordpress.com/2016/09/01/from-watercooler-discussion-to-corporate-data-analytics-in-record-time/) I stated that the recent acceleration of changes in the technology space, “…now allows for fast response to red-hot requirements… and how the “…advent of a plethora of services in the form of Platform-, Infrastructure- and Software as a Service (PaaS, IaaS and SaaS)…are proving to be highly disruptive in the Analytics market, and true game changers.

Does all of this mean the Data Warehouse is dead/ dying? Is it an old timer getting ready for pension, or does it still have years of productive contribution to a corporate data landscaper left?

My experience across the Business Intelligence and Data Analytics market, across multiple industries and technology taught me that:

A Data Warehouse is no longer a must-have for meaningful, validated and trusted information to the business users for decision support. As explained in the previous article the PaaS, SaaS and IaaS services that focus on Data Analytics (for example the Cortana Intelligence Suite in Azure (https://www.microsoft.com/en-au/cloud-platform/cortana-intelligence-suite), or the Amazon Analytics Products (https://aws.amazon.com/products/analytics/) allows for modular solutions that can be provisioned as required which collectively answers all the Data Analytics challenges and ensures data gets to users (no matter where it originates, its format, its velocity or its volume) fast, validated and in a business-friendly format.

But this does not mean that these modular Data Platforms that use a clever mix of PaaS, Saas, and IaaS services can easily provide some of the fundamental services provided by a Data Warehouse (or more accurately, components typically associated with a Data Warehouse), such as:

  • Where operational systems do not track history and the analytical requirements require such history to be tracked through (for example slowly changing dimensionality type 2).
  • Where business rules and transformations are so complex that it makes sense to define the rules and transformations by way of detailed analysis and for it to be hardcoded into the landscape through code and materialised data in structures that the business can understand and is often reused (for example dimensions and facts resulting from complex business rules and transformations).
  • Where complex hierarchies are required by the reporting and self-service layer.
  • To assist regulatory requirements such as proven data lineage, reconciliation, and retention by law (for example for Solvency II, Basel II and III and Sarbanes-Oxley).

Where these requirements exist, a Data Warehouse (or more accurately, components typically associated with a Data Warehouse) is required. But even in those cases, a Data Warehouse (or more accurately, components typically associated with a Data Warehouse) will merely form part of a much larger Data Analytics Data Landscape. It will perform the workloads described above, and there is a larger data story delivered by complimentary services.

In the past, Data Warehouses were key to delivering optimized analytical models that normally manifested themselves in materialized Data Mart Star Schemas (the end result of a series of layers such as ODS, staging, etc.) Such optimized analytical models are now instead handled by business-friendly metadata layers (e.g. Semantic Models) that source data from any appropriate source of information, bringing fragmented sources together in models that are quick to develop and easy for the business to consume. These sources include those objects typically associated with a Data Warehouse/ Mart (for example materialized SCD2 Dimensions, materialized facts resulting from complex business rules, entities created for regulatory purposes, etc.) and they are blended with data from a plethora of additional sources. The business user still experiences that clean and easy to consume Star Schema-like model. The business-friendly metadata layer becomes the Data Mart, but is easier to develop, provides a quicker ROI, is much more responsive to business change, etc.

Conclusion

The data warehouse is not dead but its primary role as we knew it is fading. It is becoming complementary to a larger Data Analytics Platforms we see evolving. Some of its components will continue to fulfil a central role, but it will be surrounded by all manner of services and collectively these will fulfil the organisation’s data needs.

In addition, we see the evolution of Data Warehouse as a Service (DWaaS). This is not a Data Warehouse in the typical sense of the word as spoken of in this article, but rather a service optimized for Analytical Workloads. Can it serve those requirements typically associated with a Data Warehouse such as SCD2, materialization due to complex rules, hierarchies or regulatory requirements? Absolutely. But its existence does not change the need for those modular targeted architectures and the need for a much larger Data Analytics Data Landscape using a variety of PaaS (including DWaaS), IaaS and SaaS. It merely makes the hosting of typical DW workloads much simpler, better performing and more cost-effective. Examples of DWaaS are Microsoft’s Azure SQL DW and Amazon’s Redshift.

 

 

Visualizations are the new black

This is the 3rd in a series of 3 articles:

First, we looked at Colouring with Numbers – can data present a better picture

The second showed how a story is worth a thousand visuals

Now we conclude with how Visualisations are the new black

 

In my last blog “A story is worth a thousand visuals”, we discussed how to consider the layout of the report to entice and lead the audience through to find information that is of importance.

Now that we have gotten the audience to this point it would be all for nothing if they can’t effectively interpret what they are seeing. This is where the choice of visualization to present this information is paramount. Choose the right visualization and the audience can understand and interpret the information clearly. Choose the wrong one and the information can become lost or misinterpreted.

“So how do I know that I have chosen the right visualization?”

Glad you asked.

You won’t.

No matter how you believe the information should be displayed, it’s ultimately up to the audience that you are delivering to that will determine if what you are portraying is effective.

To assist with trying to get a visualization that is effective as possible these are the five rules that I use to help achieve this:

Rule 1 – “Always consult with your audience”

You will always be closer to the data than your audience and you will naturally use this to establish your own beliefs around the correct way to represent information. If you consult with your audience, they will assist in helping to ensure you maintain an objective view. If you can’t consult with your audience, try at least to seek an independent reviewer. If you find you are having to explain what they are looking at, this is probably a good guide that the visualization you have created isn’t achieving its intended purpose. Remain objective and open to critique. Everyone perceives information in different ways and you have to remember that this needs to be received by an audience that may not see the information the same way as yourself.

Rule 2 – “Understand what it is you are trying to say”

If you can’t understand what it is you are trying to visualize, how can you effectively translate it into something that can be understood by others?

Before choosing any visualization, stop and take a minute to ensure you have formulated the question to which this visualization is going to provide the answer.

Always check your understanding of the true intent of the question. Interrogate further when required. Make sure the breadth and depth of what is being requested are captured ensuring the specific detail actually sought can be represented within your visualization.

Along with understanding what it is you are trying to visualize, check that the data that you are using is accurate and correctly represents the question and answer.

Rule 3 – “Choose an appropriate visualization”

The goal of data visualization is to communicate information as efficiently and clearly as possible to an audience to enable analysis and understanding. It tries to reinterpret complex data to make it more accessible and as much as the interpretation of data is a science, the presentation of the data is art.

Edward Tufte, a noted leading figure in data visualization, wrote in his book “The Visual Display of Quantitative Information” the following principles for effective visualization:

“Excellence in statistical graphics consists of complex ideas communicated with clarity, precision and efficiency. Graphical displays should:

  • show the data
  • induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production or something else
  • avoid distorting what the data has to say
  • present many numbers in a small space
  • make large data sets coherent
  • encourage the eye to compare different pieces of data
  • reveal the data at several levels of detail, from a broad overview to the fine structure
  • serve a reasonably clear purpose: description, exploration, tabulation or decoration
  • be closely integrated with the statistical and verbal descriptions of a data set.

Graphics reveal data. Indeed, graphics can be more precise and revealing than conventional statistical computations.” (Tufte, 1983)

So where do we start?

A good guide to determining what chart type helps to present what type of information was developed by Dr. Andrew Abela showed below in Figure 1. It’s based on the four analytical models of comparison, composition, distribution, and relationship.

f1

Figure 1 (Abela)

This provides a great starting point for choosing a visual representation of the data. But remember to use this as a guide. Always assess if the visual is staying true to what you are trying to present.

Some good resources to assist further with understanding types of visualizations include:

www.datavizcatalogue.com

http://labs.juiceanalytics.com/chartchooser/index.html

http://annkemery.com/essentials/

 

Rule 4 – “Use color to enhance the visual and not detract”

Colour can be a powerful tool to draw your audience in and focus their attention. But it has to be used with care as it can just as easily detract and cause confusion. Once again Edward Tufte provides us with some guidance here:

“…avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.” (Tufte, Envisioning Information, 1990)

We process color before we are even consciously aware that we are interpreting it and we can use this to our advantage when presenting a visualization to provide the audience with clarity and direction on interpreting the information.

Colour should be reserved in its use and only applied when it adds meaning to the data to do so.

For example, presenting the following two-column charts. Both are displaying exactly the same information

f2

Figure 2

f3

Figure 3

The chart in Figure 2 is harder to interpret than Figure 3 due to the selection of individual colors for each column. When the audience sees Figure 2 they instinctively try to apply a meaning to the color scheme. It is better to remove this mental fatigue and use a singular color as in Figure 3 as the audience will interpret this as all the same data with the comparison to occur at the individual column.

Try and use soft colors predominantly reserving more intense colors for drawing attention to specific points of interest.

f4

Figure 4

f5

Figure 5

As shown in Figure 5, increasing the lightness of the surrounding colors allows the intended data point to be drawn in to focus in comparison to the same information represented in Figure 4.

In concert with trying to highlight the relevant data, helper information such as axes, data labels, background colors and borders should be muted so as not to detract from the information being presented. Figures 6 – 9 below show some examples of how this may look when not taken into consideration.

f6

Figure 6

f7

Figure 7

f8

Figure 8

f9

Figure 9

To ensure consistency and cohesiveness throughout your visuals, establish a color palette that you can use. The palette should enable you to display data that is of the following types; sequential, diverging and categorical.

Sequential color palettes are used to organize quantitative data from high to low using a gradient effect. You are generally wanting to show a progression rather than a contrast. By using a gradient-based color scheme this allows you to show this progression.

f10

Figure 10

Diverging palettes show information that moves outward from an identified central point of the data range. A typical diverging palette uses two different sequential palettes so that they diverge from a shared light color toward dark colors at each extreme but provide a natural visual order that assists the audience in interpreting the progression.

f11

Figure 11

Categorical color palettes are used to highlight categories of data. With categorical data, you typically want to create a lot of contrast to ensure the visual distinction between each category. To do this use different hues to represent each of your data points.

f12 

Figure 12

After establishing your palette ensure you include complementary colors where the brightness is reduced to enable you to use colors from your primary palette to highlight and the secondary palette to support. If we were to do this with Figure 12 it would create a palette as shown in Figure 13 below.

f13

Figure 13

Fortunately, there are many websites that can assist you in establishing your color palettes without having to have an intimate understanding of color theory. Some recommendations I can make are:

http://paletton.com

http://colorbrewer2.org/

http://tools.medialab.sciences-po.fr/iwanthue/

http://www.colorhexa.com/

https://color.adobe.com/create/color-wheel/

A final word on the use of color wouldn’t be complete without recognizing accessibility requirements. Approximately 10% of males and 1% of females suffer from poor colour perception, commonly referred to as colour blindness. It is recommended that designing your palettes that the colours you choose should accommodate for this. Colorhexa has a good visual tool to assist you with understanding how a colour is perceived by the different types of colour perception.

Rule 5 – “Ensure clarity in your visualisation”

When designing your visualisation, remember that the key is to communicate information as clearly and quickly as possible. Only visualise information that is relevant and enhances what is being interpreted.

Now that you have constructed your visual, stand back and look at it. Squint. Is there anything that detracts or confuses the information you are trying to present?

An example of how too much noise can cause confusion is illustrated in the example below in Figures 14 and 15.

In Figure 14 the American Joint Economic Committee, Republican Staff released a chart to demonstrate the complexity of the American Affordable Healthcare Act.

f14

Figure 14

An American citizen, Robert Palmer, felt that the chart was purposefully designed to highlight what is a complex topic by making the chart itself difficult to read. Thus, he redrew it as shown in Figure 15 to demonstrate while still a complex topic clarity of information could still be presented.

f15

Figure 15

(Palmer)

TL;DR

In summary, if you have skimmed to the bottom of this looking for the quick answers here’s my rules for visualisation:

Rule 1 – “Always consult with your audience”

Rule 2 – “Understand what it is you are trying to say”

Rule 3 – “Choose an appropriate visualisation”

Rule 4 – “Use colour to enhance the visual and not detract”

Rule 5 – “Ensure clarity in your visualisation”

 

References

Abela, D. A. (n.d.). Charts. Retrieved from Extreme Presentation: https://extremepresentation.com/design/7-charts/

Palmer, R. (n.d.). Retrieved from Flickr: http://www.flickr.com/photos/robertpalmer/3743826461/

Tufte, E. (1983). The Visual Display of Quantitative Information. Cheshire, Connecticut: Graphics Press.

Tufte, E. (1990). Envisioning Information. Graphics Press.

 

 

 

Disrupting the banking market

This video shows a comprehensive solution geared around disruption in the Banking market, from transactions through to advanced analytics.

The viewer meets 3 different customers, their challenges and how the bank responds to them.

Test drive AWS QuickSight

As part of our commitment to deliver the best possible business outcome for our Advanced Analytics customers, we ensure that we remain across the technologies that enable us to deliver such outcomes. This test drive of AWS’ QuickSight BI tool and its underlying parallel processing engine (SPICE) is part of that commitment to our customers and to the wider Data and Analytics market.

View our video test drive:

And download the full article here.

 

A story is worth a thousand visuals

In my previous article (Colouring with numbers – Can data present a better picture?), I outlined some principles that I have found to be useful when creating data visualizations. I also promised to take you through each of the principles I had outlined in more depth to hopefully help explain the concepts over the next few weeks.

The first principle I will cover is that of storyboarding.

Any report can contain information. It’s relatively easy to just place random visualizations together to display information from a multitude of data sources. But without context or structure does this add any value or provide a better experience for the report reader?

Our job as report writers is to try and make it as easy as possible for the reader to consume the information and then apply this to make interpretive decisions. If we fail to carefully curate the requirements of a report, poor or even wrong inferences can be drawn from the information that is presented even though the data underpinning the report may be correct.

As discussed previously less and less time is being spent by people on in-depth reading as information moves to screen-based technology. Facts and understanding need to be imparted as quickly as possible to the reader otherwise they won’t spend the time to synthesize the information.

Ironically, you may be even doing it now as you read through this blog and just try to assess the highlights and quickly evaluate if it’s worth your time to spend on reading it.

So how do we ensure we craft something that will meet the challenges of engagement and understanding with the report reader?

By following a methodology called storyboarding.

Storyboarding was originally created in the movie industry to help plan the camera work and ensure continuity of the story. It allowed them to shoot various scenes out of order and then splice them back together to make a coherent story. This concept is similar to what we need to do when creating reports. We take themes or topics and visually place them in a contextually relevant order that will lead the report reader through the information presented to them.

So how do we start this process?

  1. Gather requirements – By writing out the concepts required in the report, (my personal preference is to use Post-It® notes for this) one requirement to a page. I’d also recommend leaving room for sketching ideas of potential visualizations or presentations above each topic.
  1. Create themes – Once you have each of the requirements, identify themes that are contextually relevant to the report.

post1

  1. Sequence – Order the themes and requirements together. Add, move or remove requirements and themes as required to create the “story” for the report. Think about how the construct and flow of the story are to go and how report readers may traverse the information within the report.

Linear Story Sequence

post2

Problem Analysis

post3

Comparative Analysis                                             Cause and Effect

post4

Finally, once your storyboard is complete. Read through it and ensure that it meets the requirements of the three key areas;

  • Audience – who is this information intended for?
  • Function – what is the purpose of this report?
  • Presentation – how is the report to be displayed?

Once you are satisfied with the storyboard you should now be ready to start identifying the visualizations that will best present the information which I will be covering in my next article.

 

 

The battle of the AMLs – Amazon Machine Learning Vs Azure Machine Learning

As machine learning has become more accessible to businesses and the number of products currently available has risen in the market, the question is regularly asked of us as leaders in data and analytics to recommend, or at least provide insight, into some of those products.  Machine learning has become more accessible and there are many products currently available.  In determining a use case, it was decided to use Amazon Web Services and Microsoft Azure as both are comparable in the market as well as being cloud-based offerings, allowing business a lower setup time and ongoing costs. The reason for selecting Azure and Amazon was based on the familiarity of working with Microsoft and Amazon Web Services products.

Both Microsoft and Amazon can achieve outcomes in different ways. Any perceived strengths and weaknesses highlighted here are based on my personal experience using the products. Do note that there are frequent enhancements carried out by both Microsoft and Amazon.  This document is based on research conducted in September 2016. It is possible that there may have been some changes since the time of writing this.

Use case – The use case is centered around a common retail-related problem.  A business offers services to clients. The sales promotion activities within that business were not targeted to specific groups of clients. This was because the business didn’t have any in-depth knowledge of which services to push to which segment of clients. Below is a comparison of Amazon’s and Microsoft’s Machine Learning solutions based on my findings when trying to solve the above-mentioned use case.

 

Algorithms – Azure Machine Learning allows multiple ways to try and solve a given problem

Azure has a plethora of algorithms to choose from when compared to Amazon which only had one. A simple google search for an Azure Machine Learning cheat sheet will provide you with good guidance on which algorithms would be best for a particular problem. However, Amazon is limited to Logistic regression as confirmed in the Amazon Machine Learning Frequently Asked Questions page (https://aws.amazon.com/machine-learning/faqs/). While Logistic regression has its place, it is only ideal when there is a single decision boundary. Most use cases will require trialing multiple algorithms to achieve a good outcome.

Findings – If you need to predict if a client will make another purchase or not, then logistic regression would be ideal. However, when needing to understand which services some clients are more likely to purchase, more capability from the algorithms are needed to be able to generate a good outcome.  This is where Azure leads the way.

aml_cheatsheetAzure Machine Learning Cheat Sheet
Source: https://azure.microsoft.com/en-us/documentation/articles/machine-learning-algorithm-cheat-sheet/

 

Categorise Data – Amazon Machine Learning categorizes the source data for you

Amazon automatically pre-processes the data and categorizes each field. The possible categories are Categorical, Numeric, Binary or Text. If a user wishes to, they can change the category of the data which may have an effect on the final outcome. If the data contains a row identifier, it can be declared and will be ignored when making predictions.

A binary field is one which can only have one of two values. In Amazon, binary fields can only have one combination of the following (Case insensitive);

  • 0,1
  • Y, N
  • Yes, No
  • True, False.

It is not capable of understanding any other combination even if only two values exist in the field (e.g. paid, unpaid).

In Azure, there is no option to classify data fields. The classification is done automatically. While this is very intuitive, I would prefer to have the ability to manipulate fields as required. It is worth noting that Azure had no difficulty understanding that a field containing only two values was a binary data set. This reduced the amount of data manipulation that was required before creating a model and made the output easier to read.

aws_categorise_1
AWS automatically proposed data categories

Findings – I had a column which had an indicator for purchases made in the morning and afternoon (AM/PM). Amazon wasn’t able to see this as a binary field and forcing it to be a binary field and trying to predict if a service was required in the morning or afternoon caused an error.

 

Source Data – Amazon gives good visibility of the content within the source data

Amazon grouped the fields into the categories identified above. The target field is the field that we are attempting to predict using machine learning. It was possible to view each field and understand the correlation to the target field.  The distribution of categorical data was useful by identifying the top 10 attributes and the number of occurrences. A bar chart helped to understand the distribution of attributes in each field.

aws_source_data_visualisation
AWS source data visualization of a single field

Azure too has the ability to show each field and the distribution of attributes within it. It only showed the top 10 attributes for each field and therefore it was not easy to understand the proportion of the data that did not make the top 10. The correlation of a field to the target was not shown in the example that I worked on. This is most probably due to the availability of multiple algorithms. The correlation of a field to the target field would be different for each algorithm.

AML_Source_Data_visualisation.png
Azure source data visualization

Findings – There were more than ten types of popular services. I was unable to understand if the services that didn’t make the top ten was significant or not in Azure. This was crystal clear on Amazon because it created a grouping of all services which were not in the top ten on the chart and displayed it as ‘Others’.

 

Training a Model – Amazon has a fixed 70/30 split when training a model; Azure allows you to select your desired split

In Amazon, the data split between the training and evaluation dataset was fixed at 70% for training and 30% for evaluating. The only method of changing this was to use a different data set for evaluating the data. Based on your use case you may need a different split. In Amazon, you need to do this by manually creating training and learning datasets.

In Azure, it was possible to specify the desired split between the data available for training and scoring the model. I would consider this to be an essential feature as the split would need to change based on the problem that you are trying to solve.

AWS_70_30_split.pngAmazon can only do a 70/30 split

Findings – A 70/30 split didn’t cause an issue. However, the ability to trial different splits in Azure was useful for me to understand its impact.

 

Evaluating a Model – Amazon automatically evaluates the machine learning model

Evaluation happens automatically in Amazon; it occurs immediately after the Machine Learning model is created. In a binary classification scenario, Amazon lets the user change the trade-off, False positive rate, Precision, Recall, and Accuracy. Each of these attributes has an effect on all the other attributes. It is easy to visualize the outcome immediately by tweaking them.

In Azure, evaluation can be added to the experiment as necessary. In a binary classification scenario, it was only possible to change the trade-off threshold which in turn impacts other factors.

aws_advanced_metricsAmazon lets you change all metrics

 

Findings – I would consider it to be critical that a machine learning model is evaluated while in development. Doing some manipulation in the data improved the quality of the model. Many iterations are usually required before landing on a good model.

 

Predictions – Making predictions using Machine Learning

Both Amazon and Azure have options to manually test, batch process and create endpoints for real-time predictions. Batch processing is a more common method of creating predictions from machine learning. It is usually done using a large set of records which usually takes some time to process.

Testing predictions on Amazon – The easiest way to test a model is to manually enter some values and get a prediction. This can be done by either typing in the values into a web form or pasting values separated by commas. The predicted label is displayed on the screen but to see the confidence of the prediction, you need to filter through some code. An excerpt of an example output is below;

       “PredictiveModelType”: “BINARY”
},
“predictedLabel”: “1”,
“predictedScores”: {
“1”: 0.7453358769416809
}

Amazon batch predictions – To do this, you first need to convert the data to a CSV (Comma Separated Values) format and upload it to Amazon S3 (Simple Storage Service). Predictions can be created based on this file and will be output to another CSV file in S3. The output file was compressed and saved as a GZ file which is commonly used in Unix environments. A drawback in the output received was that there was no way of finding out which row of the source data matched the output. The only way that I found to get around this was to add a row number column when creating the machine learning model. The row number appeared in the predictions. To marry up the two, you would need to use a method such as VLOOKUP which was a bit frustrating. Trial and error proved that the rows in the output were in the same sequence of the rows in the source file.

Amazon endpoint – An endpoint creates an Amazon web service which can be accessed via an API (Application Programming Interface). This is useful when there is a requirement to get predictions on a real-time or ad-hoc basis. Amazon claims that a query will be responded to in 100 milliseconds. They also claim to be able to process up to 200 queries per second. Any extra queries will be queued up and responded to. Higher capacities can be accommodated by contacting Amazon.

Testing predictions in Azure – You are presented with a test option where you can type in data to a web form. There was no option to simply paste comma separated values into the webpage like on Amazon. However, the option to download a customized Excel workbook was very intuitive. This workbook consisted of parameters and predicted values side by side. It was very easy to use and responses were received within a second. This method is great if the data set has a small number of fields but may become progressively harder to use if there are many fields.

Azure batch predictions – Similar to the test option, this option also made use of an Excel workbook. However, it was more flexible allowing you to pick a cell range for the input and output data. I set up my output data on another sheet in the same workbook. I tested 300 rows of predictions and got the results back within one second.

Endpoint in Azure – The endpoint in Azure appeared much more user-friendly. There was an API help page which had sample code for a request and response using the dataset which you are working on. It also contained sample code for C3, Python and R which would reduce the complexities of having to write code from scratch.

AML_Batch.pngAzure batch predictions in Excel

Findings – In my use case, I didn’t require real-time predictions. I used batch predictions and it was a hassle to have to manually marry up the original data with the predictions from Amazon.

 

Costs – Azure’s better features and user interface costs more. Up to five times more!

Please note that pricing may vary based on many factors. These include but are not limited to the region, the complexity of the solution, the computing tier is chosen, the size/nature of your organization and any other negotiations entered into with either Microsoft or Amazon.

I performed some high-level cost calculations on the cost based on the dataset used. For an Amazon solution, it costs approximately USD 100 per month when using 20 hours of computing time and 890,000 predictions. Real-time predictions cost more than batch predictions. However, the difference was not significant for the model that I used (USD 104.84 for real-time vs USD 97.40 for batch predictions).

The same solution using Azure came up to an estimated total cost of just under USD 500.

The prices are my observations only and should always be confirmed with the license provider. For more information on pricing, please see the links below;

Amazon – https://aws.amazon.com/machine-learning/pricing/

Azure – https://azure.microsoft.com/en-us/pricing/details/machine-learning/

 

Audience – Amazon appears to be more focused towards technical people

Overall, Amazon appeared to be focused on users who are more technically minded and more comfortable with programming. When learning to use Amazon Machine Learning, I only came across one example and therefore, I would consider this to be very limited.

On the other hand, Azure Machine Learning appeared more suited for power users within a business, as well as the technically minded. It provides a more familiar graphical drag-and-drop interface. These components are pre-programmed and are grouped such that they are easy to understand. The examples provided are a great way of getting accustomed to the tool and there are good sample datasets available.

 

Azure would appeal to most users

If you are more into coding and you have technical resources to support you and you can manage with only using the Logistic Regression algorithm, then Amazon would be ideal for you. With its lower cost and range of other cloud services, it is definitely worth considering. For the rest of us, Azure is clearly the best choice.

It is worth remembering that the many algorithms and ease of use comes with a price tag attached to it.

Findings – I almost felt spoilt for choice with the number of algorithms in Azure. I spent a lot of time trying to get a good model working on Amazon. The process was much easier when I used the same dataset in Azure.

In conclusion, I was able to gain an understanding of which groups of clients are most inclined to purchase certain services using Azure. This makes it possible to target specific clients and make the best use of the marketing budget. The ability to easily view the predictions along with the fields used for the predictions was very helpful. The predictions will be used to target clients who are more likely to purchase the services. In turn, this will create a better outcome for both the business and its’s clients. Overall, Azure proved to be a more mature offering for someone who wanted to solve a particular use case and trying to avoid deep coding.

We will keep a close eye on the Amazon offering to see when it catches up.