We test drive Azure Time Series Insights

azure time series

In this video, we take Azure Time Series Insights for a test drive. It is the new “fully managed analytics, storage, and visualization service that makes it simple to explore and analyse billions of IoT events simultaneously”.

We also importantly look at the differences between Power BI’s real-time visualisation capabilities and Azure Time Series Insights.

See more about Azure

Augmented Reality meets Advanced Analytics – it changes the way we plan

Augmented_reality_1

Using your historical data, putting it on steroids with the help of Machine Learning, then overlaying it with Augmented reality. All of a sudden you can see what impact changes in the physical space will have on your data. You can almost “experience” what the changes will look like.

Combining AR/ VR with Advanced Analytics is especially relevant anywhere where planning in the physical space applies. From city planning, through to event planning, and everything in-between. It’s about integrating data sources to bring contextually relevant information into your maps.

The experience is accessible via Hololens as well as through your mobile or tablet device.

In this demo, we show a segment of the Adelaide city map. It shows how foot traffic is affected by factors such as weather, time of day, the day of the week, etc. It also peers into the future by tying the solution into Machine Learning to see the likely effect on foot traffic in the future, including through substantial infrastructure changes such as replacing a building with a park.

Please see a more comprehensive brochure here – exposé and Cortex Interactive Virtual Planner Solution

See more about Advanced Analytics

info@exposedata.com.au

Is the Data Warehouse Dead?

data warehouse

I am increasingly asked by customers – Is the Data Warehouse dead?

In technology terms, 30 years is a long time. This is how old the Data Warehouse is – that makes the Data Warehouse an old timer. Can we consider it a mature yet productive worker, or is it a worker gearing up for a pension?

I come from the world of Data Warehouse architecture and in the mid to late naughties (2003 to 2010) whilst working for various high profile financial service institutions in the London market, Data Warehouses were considered all important and companies spent a lot of money on their design, development, and maintenance. The prevailing consensus was that you could not get meaningful, validated and trusted information to business users for decision support without a Data Warehouse (whether it followed an Inmon, or a Kimbal methodology – the pros and cons of which are not under the spotlight here). The only alternative for companies without the means to commit to the substantial investment typically associated with a Data Warehouse was to allow Report Writers to develop code against the source systems database (or a landed version thereof), but this, of course, leads to the proliferation of reports, and it caused a massive maintenance nightmare and it went against every notion of a single trusted source of the truth.

Jump ahead to 2011, and businesses started showing a reluctance to invest in Data Warehouses – a trend that accelerated from that point onward. My observations of the reasoning for this ranged from the cost involved, the lack of quick ROI, a low take-up rate, difficulty to align it to ongoing business change, and, more recently, a change in the variety, volume and velocity of data that businesses are interested in.

In a previous article “From watercooler discussion to corporate Data Analytics in record time” (https://exposedata.wordpress.com/2016/09/01/from-watercooler-discussion-to-corporate-data-analytics-in-record-time/) I stated that the recent acceleration of changes in the technology space, “…now allows for fast response to red-hot requirements… and how the “…advent of a plethora of services in the form of Platform-, Infrastructure- and Software as a Service (PaaS, IaaS and SaaS)…are proving to be highly disruptive in the Analytics market, and true game changers.

Does all of this mean the Data Warehouse is dead/ dying? Is it an old timer getting ready for pension, or does it still have years of productive contribution to a corporate data landscaper left?

My experience across the Business Intelligence and Data Analytics market, across multiple industries and technology taught me that:

A Data Warehouse is no longer a must-have for meaningful, validated and trusted information to the business users for decision support. As explained in the previous article the PaaS, SaaS and IaaS services that focus on Data Analytics (for example the Cortana Intelligence Suite in Azure (https://www.microsoft.com/en-au/cloud-platform/cortana-intelligence-suite), or the Amazon Analytics Products (https://aws.amazon.com/products/analytics/) allows for modular solutions that can be provisioned as required which collectively answers all the Data Analytics challenges and ensures data gets to users (no matter where it originates, its format, its velocity or its volume) fast, validated and in a business-friendly format.

But this does not mean that these modular Data Platforms that use a clever mix of PaaS, Saas, and IaaS services can easily provide some of the fundamental services provided by a Data Warehouse (or more accurately, components typically associated with a Data Warehouse), such as:

  • Where operational systems do not track history and the analytical requirements require such history to be tracked through (for example slowly changing dimensionality type 2).
  • Where business rules and transformations are so complex that it makes sense to define the rules and transformations by way of detailed analysis and for it to be hardcoded into the landscape through code and materialised data in structures that the business can understand and is often reused (for example dimensions and facts resulting from complex business rules and transformations).
  • Where complex hierarchies are required by the reporting and self-service layer.
  • To assist regulatory requirements such as proven data lineage, reconciliation, and retention by law (for example for Solvency II, Basel II and III and Sarbanes-Oxley).

Where these requirements exist, a Data Warehouse (or more accurately, components typically associated with a Data Warehouse) is required. But even in those cases, a Data Warehouse (or more accurately, components typically associated with a Data Warehouse) will merely form part of a much larger Data Analytics Data Landscape. It will perform the workloads described above, and there is a larger data story delivered by complimentary services.

In the past, Data Warehouses were key to delivering optimized analytical models that normally manifested themselves in materialized Data Mart Star Schemas (the end result of a series of layers such as ODS, staging, etc.) Such optimized analytical models are now instead handled by business-friendly metadata layers (e.g. Semantic Models) that source data from any appropriate source of information, bringing fragmented sources together in models that are quick to develop and easy for the business to consume. These sources include those objects typically associated with a Data Warehouse/ Mart (for example materialized SCD2 Dimensions, materialized facts resulting from complex business rules, entities created for regulatory purposes, etc.) and they are blended with data from a plethora of additional sources. The business user still experiences that clean and easy to consume Star Schema-like model. The business-friendly metadata layer becomes the Data Mart, but is easier to develop, provides a quicker ROI, is much more responsive to business change, etc.

Conclusion

The data warehouse is not dead but its primary role as we knew it is fading. It is becoming complementary to a larger Data Analytics Platforms we see evolving. Some of its components will continue to fulfil a central role, but it will be surrounded by all manner of services and collectively these will fulfil the organisation’s data needs.

In addition, we see the evolution of Data Warehouse as a Service (DWaaS). This is not a Data Warehouse in the typical sense of the word as spoken of in this article, but rather a service optimized for Analytical Workloads. Can it serve those requirements typically associated with a Data Warehouse such as SCD2, materialization due to complex rules, hierarchies or regulatory requirements? Absolutely. But its existence does not change the need for those modular targeted architectures and the need for a much larger Data Analytics Data Landscape using a variety of PaaS (including DWaaS), IaaS and SaaS. It merely makes the hosting of typical DW workloads much simpler, better performing and more cost-effective. Examples of DWaaS are Microsoft’s Azure SQL DW and Amazon’s Redshift.

 

 

Cognitive Intelligence meets Advanced Analytics

cognitive intelligence

Acquiring knowledge of anonymous customers through Cognitive Intelligence is the next generation customer based Business Intelligence.

Human behavior and characteristics such as speech, demographics, and emotion can now be expressed digitally and blended that with Advanced Analytics.  Exposé apply this across a number of different use cases as shown in the video.

 

Or navigate to – https://www.youtube.com/watch?v=XkeCLp7noyo

See more on Advanced Analytics

Advanced Analytics and Big Data Platform – RAA Case Study

data platform

An Exposé case study around our advanced analytics and big data platform for RAA that allows for the acquisition and blending of large volumes of fragmented geospatial data, transforming it using massive processing capacity, using predictive analytics to assess the risk of millions of properties, and providing interactive and geospatial visualisations of the blended data and results.

This video case study shows a solution summary:

See the full case study here: expose-case-study-raa

See another big data solution here

UPDATED: ArcGIS Maps for Power BI provided by Esri (Preview)

Microsoft recently announced ArcGIS Maps for Power BI provided by Esri. This functionality is still only in preview, but we decided to give it a test drive. https://powerbi.microsoft.com/en-us/blog/announcing-arcgis-maps-for-power-bi-by-esri-preview/

We used data from our social media solution to visualize with this new feature. The results were pretty impressive as shown in the images below and a huge improvement over the existing Map visual.

arcgis-map-example-3

arcgis-map-example-2

arcgis-map-example-4

arcgis-map-example

There is a BUT! This feature is not currently available for Power BI Service, so its application is very limited. Of course, the feature is still only in preview, so it will be corrected soon as confirmed in this thread. http://community.powerbi.com/t5/Desktop/ArcGIS-Maps-for-Power-BI-Preview-Discussion/td-p/72947

UPDATE: Microsoft has announced that the ArcGIS mapping service is now available in Preview in the Power BI service as per this post from Microsoft on 24 November https://powerbi.microsoft.com/en-us/blog/arcgis-maps-for-powerbi-available-in-powerbi-service/.

Connecting to Big Data Clusters from Excel

Connecting to big data clusters from Excel is now relatively simple. But those who have tried would have noticed multiple ways to achieve what seems to be the same thing. This article aims to clear these muddy waters a bit.

I am using Microsoft Office 365 Pro Plus (Excel 2013) updated to 15.0.4719.1002

Power Query updated to 2.22.4007.242

All tests performed show relative and average results only and should not be taken as authoritative performance tests, but rather as observations averaged across multiple tests.

Areas from where you can access HDInsight clusters

There are 3 areas from where you can access big data in HDInsight clusters from within Excel, with the aim of integrating it into your BI solution:

The first is the ‘old school’ Data tab:

Image13.JPG

If you use this ‘old school’ area:

  • The variety of data source options are more limited, but it does include Hadoop by way of an ODBC connection.
  • It is a more convoluted path via ODBC to Hadoop Clusters (specifically compared the Power Query method).
  • You will be limited to the row and column limitation in Excel (1048576 rows by 16384 columns).

The second is to use the data connection functionality inside Power Pivot:

Image14

  • The variety of data source options are more plentiful when compared to the Data tab ODBC connection, and it also includes Hadoop by way of an ODBC connection.
  • It is still a more convoluted path via ODBC to Hadoop Clusters (specifically compared the Power Query method).
  • You are not subject to the limitations posed by Excel.

The third is to use the Power Query functionality – it is this area which will be discussed in more detail in this post:

  • You have the widest variety of data source options available via Power Query, including three methods for accessing Hadoop clusters (discussed below). New data source options are constantly added.
  • Although it provides two methods for accessing Hadoop clusters beyond ODBC, it is important to understand the limitations of each (discussed below).
  • You are not subject to the limitations posed by Excel.

As you know HD Insights is a Hadoop cluster of compute nodes that provides the MapReduce compute power over data stored in HDFS which is not co-located with the compute nodes. By separating storage from computing, HD Insight allows for data stored in containers to be accessed through HDFS APIs or through Azure Blobs Storage REST APIs. This means that the data exists beyond the life of the cluster so that computing power can be achieved when required, yet data remains intact and available when the cluster no longer exists. This obviously holds a cost-benefit as compute clusters are much more costly than storage. There are other benefits not discussed here as the focus here is less on HD Insight and more on its accessibility through Excel as a BI tool.

Power Query provides 3 methods for accessing the data stored in the HDFS. Two via the “From Azure” menu, and one via the “From Other Sources: menu.

From Azure:

Image15.JPG

In the From Azure menu, you have both “From Azure HD Insight” as well as “From Azure Blob Storage” methods available (as shown in the diagram above):

The Test 1 chart below shows relative and average results from multiple tests that accessed the same data set and then comparing the two methods against each other.

Image16.JPG

I see no substantial performance difference between the two methods both when the cluster is running, and after it has been deleted. Also, note that the queries generated by the two methods are very similar.

From Azure Blob Storage – query

Image17.JPG

From Azure HD Insight – query

Image18.JPG

Conclusion:

  • Using From Azure Blob Storage simply means that the storage account is accessed directly and you do not leverage the compute power of the VM’s nor HIVE’s MapReduce available via the cluster.
  • You would have thought that opting for the From Azure HD Insight method means that you can leverage the compute power of the VM’s and HIVE’s MapReduce available via the cluster whilst the cluster runs, but tests (see the test 1 chart) do not show the substantial improvements expected with HIVE’s MapReduce. The results of From Azure HD Insight also remains largely unchanged irrespective of the existence of the HDInsight cluster.
    • This method (as is the case with From Azure Blob Storage) is therefore limited to accessing blobs in the container and it provides no way to access HIVE tables so it, therefore, renders the MapReduce programming model and its parallel, distributed compute power superfluous via this connection method. How then do we leverage the power provided by MapReduce? Read on…

From Other Sources:

This menu provides the third method for accessing HDInsight and that is through an ODBC connection after installing the Microsoft HIVE ODBC connection driver and setting up the appropriate System DSN. Set up the initial ODBC connection only after the cluster has been provisioned. The ODBC method will therefore only be applied whilst the cluster runs, yet it is not necessary to reconfigure the ODBC connection if a deleted cluster is recreated, as long as the name (and therefore the OBDC host address), user and password are created the same as before.

Image19.JPG

Retesting shows a marked improvement in relative and average results when accessing the HIVE table version created over the blob that was tested previously in Test 1 via the ODBC method.

Image20.JPG

Also note that the query generated compared to the previous two methods is notably different.

Image21

Overall Conclusion:

Given that there are essentially 5 ways to access data associated with the Hadoop cluster and the various limitations listed above, how do you decide which method is the most appropriate? I would suggest deciding base on your particular use case:

  • If you do not have Power Pivot and Power Query enabled in your Excel deployment, and your number of records does not exceed Excel’s record limitation, then you can use the ‘old school’ ODBC DSN connection.
  • If you have Power Pivot and Power Query enabled in your Excel deployment and you require uninterrupted access to your data in the HDFS without the need for the compute power brought by Hive’s MapReduce, then consider either of the two connection options in the From Azure menu of Power Query.
  • If you do have Power Pivot and Power Query enabled in your Excel deployment, require ad-hoc/ intermittent access to your data in the HDFS and support of the computer power brought by Hive’s MapReduce (for example during a monthly reporting cycle only), then consider the From ODBC connection option in the From Other Sources menu in Power Query, and a consistent and pre-defined configuration of the cluster each time you enter a period where data is to be accessed supported by superior compute power.
  • If you do have Power Pivot and Power Query enabled in your Excel deployment, require uninterrupted access to your data in the HDFS supported by the computer power brought by Hive’s MapReduce, then consider the From ODBC connection option in the From Other Sources menu in Power Query and a constantly running cluster. But please be aware of the cost implications of this.