Market segmentation is a common practice in marketing and sales in order to better understand - and therefore be better able to target - customers. This same principle, though, can be applied to any business problem where the division of a diverse population into sub-populations based on similarities (or differences) would be advantageous.
Fortunately, rather than having to slice every variable an infinite amount of ways, we can utilise unsupervised learning algorithms to produce groupings of samples based on the similarity of data features.
This post will discuss the application of a technique to accomplish this, using a mixture of technologies (SQL, Python and R) and algorithms (Self Organising Maps (SOM) and Hierarchical Clustering)
The following example is taken from a pool of insurance claims, with the aim to understand sub-populations contained within the data to allow for appropriate monitoring and exception reporting. Specifically, we want to know if the mix of claims changes for the worse.
Claims by nature have a structure whereby a majority of the portfolio consists of low-value claims, reducing quickly to very few at the highest values (for the statistically minded, the value generally follows a log-normal distribution). Typical partitioning strategies are to slice by claim duration (shorter claims are in general cheaper), and this is suitable in a lot of cases when dealing with claims in aggregate. But given the lack of granularity, the challenge is that when there is a change in expected durations, the second level of analysis as to the “who” is required.
By using unsupervised learning, we will essentially encode the "who" into the cluster exemplars, so we can then focus on the more important question: "So what are we going to do about it?"
Obtaining the data was, fortunately, a trivial task, as it was all contained in an on-premises SQL Server database, and the data was as clean as it was going to get. 5 years of historical records were used as the training sample.
Execution of the modelling was done using the R Kohonen package, and subsequent clustering of the SOM model by the hclust function.
Finally, to glue everything together into a processing pipeline I used Python 2.7 with:
- sqlalchemy: to connect to the database;
- numpy / pandas: for data massaging;
- rpy2: to connect to R.
I will not go into detail around the technical implementations of SOM and Hierarchical Clustering, there are far better explanations out there than I could hope to provide (in Tan 2006, for example). However, to provide a simple overview, a SOM is an m × n grid of nodes, to which samples are assigned to based on the similarity (or dissimilarity) measure used; commonly, and in our case, this is Euclidean distance.
It is an iterative algorithm with random initialisation, and at each step, the node codebook (a vector identical in structure as the input data, with feature values that represent the samples assigned to the node) is updated and the samples re-evaluated as to which node they belong. This continues for a set number of iterations, however, we can check to see if the codebooks are changing to any degree using a 'Change Plot'. Figure 3 is an example of the ‘Change Plot’ from the Kohonen package, where we observe that after 50 or so iterations, there is a minimal change occurring.
When we have a SOM model we are happy with, Hierarchical Clustering allows us to condense the grid into a smaller number of clusters for further evaluation. Clustering is performed over the node codebooks, after which a number of clusters are selected. Note that one can use the nodes in the SOM as the clustering, but generally, based on the number of samples you train the model with, a SOM grid contains too many nodes to be useful.
The process by which the modelling was undertaken is depicted in figure 4.
As can be seen from figure 4, the Python script initiates everything and uses the results from the SQL and R calls.
This approach was taken primarily due to my own limitations in R programming; I am far more comfortable developing in Python and know my Python code will be more efficient than my R code purely due to proficiency.
My selection of R and the Kohonen package was based on my research into a suitable implementation of SOM to use. kohonen has an implementation called SuperSOM, which is a multi-layered SOM that gets trains each layer separately, that I thought would be ideal for temporal features in the source data (e.g. layer one = features at t1, layer 2 = features at t2 etc).
Finally, the data set was not big by any stretch so whilst training a SOM can be a compute-intensive task, in this case, anything more than a decent laptop was not required – on my i7 laptop, training of the SuperSOM took only 3 seconds to run 80 iterations against ~5,000 samples.
Both Quantisation Error (QE) and Topological Error (TE) were used to evaluate the quality of the model where;
- QE = the mean distance from node samples to the node codebook. A low QE ensures that the node codebook is close to the samples allocated to it.
- TE = the mean distance to a node's nearest node. A low TE means that similar nodes are positioned close to each other on the grid.
TE is particularly imported in our case as we wish to cluster the result. Non-contiguous clusters are a side effect of having a relatively high TE.
The clustering was used to create data exemplars around which monitoring was developed to understand and changes in the population mix. Interventions could be focused on the relevant sub-population based on the dominant features of the samples in that group (ie if group "A" is a high-cost group, and feature "X" was particularly dominant how can we influence this feature to move the sample into a lower cost group). It is for this targeting reason that marketing has been a large user of clustering historically.
New samples can be allocated to a cluster by find first allocating them to the nearest node, then assigned the cluster id of the node.
Other Use Cases
Marketing: To group customers based on buying habits, product type purchases, purchase value etc to tailor marketing activities and/or focus on higher value customers.
Customer Service: Triage call types based on historical interactions into high/medium/low priority or risk.
Government: Understanding the demographics of users of services in order to better tailor customer experience
- SOM uses random initialisation, therefore to get repeatable results a seed needs to be set;
- Min-max scaling between 0 and 1 was used given the presence of binary variables;
- In a single som scenario, one can just do dist(som$codes) to get a distance matrix for clustering purposes, but for superstorm, we need to handle the layers. So a function was created to calculate the average weighted distance across layers to output as the distance matrix. The weighting was based on layer weight and also node distance (ie the further the node away from each other the higher the weighting) to form contiguous clusters.
A good segmentation tutorial using the Kohonen package: http://www.r-bloggers.com/self-organising-maps-for-customer-segmentation-using-r
kohonen package official documentation: https://cran.r-project.org/web/packages/kohonen/kohonen.pdf
Wehrens and Buydens: Self- and Super-organizing Maps in R: The Kohonen Package, Journal of Statistical Software, October 2007, Volume 21, Issue 5: https://www.jstatsoft.org/article/view/v021i05/v21i05.pdf
Fusco and Perez: Spatial Analysis of the Indian Subcontinent: the Complexity Investigated through Neural Networks, Proceedings of CUPUM 2015: http://web.mit.edu/cron/project/CUPUM2015/proceedings/Content/analytics/287_fusco_h.pdf