Name : perl-Algorithm-ExpectationMaximization
| |
Version : 1.22
| Vendor : obs://build_opensuse_org/devel:languages:perl
|
Release : 3.51
| Date : 2024-02-15 04:18:46
|
Group : Development/Libraries/Perl
| Source RPM : perl-Algorithm-ExpectationMaximization-1.22-3.51.src.rpm
|
Size : 0.39 MB
| |
Packager : (none)
| |
Summary : Perl module for clustering numerical
|
Description :
*Algorithm::ExpectationMaximization* is a _perl5_ module for the Expectation-Maximization (EM) method of clustering numerical data that lends itself to modeling as a Gaussian mixture. Since the module is entirely in Perl (in the sense that it is not a Perl wrapper around a C library that actually does the clustering), the code in the module can easily be modified to experiment with several aspects of EM.
Gaussian Mixture Modeling (GMM) is based on the assumption that the data consists of \'K\' Gaussian components, each characterized by its own mean vector and its own covariance matrix. Obviously, given observed data for clustering, we do not know which of the \'K\' Gaussian components was responsible for any of the data elements. GMM also associates a prior probability with each Gaussian component. In general, these priors will also be unknown. So the problem of clustering consists of estimating the posterior class probability at each data element and also estimating the class priors. Once these posterior class probabilities and the priors are estimated with EM, we can use the naive Bayes\' classifier to partition the data into disjoint clusters. Or, for \"soft\" clustering, we can find all the data elements that belong to a Gaussian component on the basis of the posterior class probabilities at the data elements exceeding a prescribed threshold.
If you do not mind the fact that it is possible for the EM algorithm to occasionally get stuck in a local maximum and to, therefore, produce a wrong answer even when you know the data to be perfectly multimodal Gaussian, EM is probably the most magical approach to clustering multidimensional data. Consider the case of clustering three-dimensional data. Each Gaussian cluster in 3D space is characterized by the following 10 variables: the 6 unique elements of the \'3x3\' covariance matrix (which must be symmetric positive-definite), the 3 unique elements of the mean, and the prior associated with the Gaussian. Now let\'s say you expect to see six Gaussians in your data. What that means is that you would want the values for 59 variables (remember the unit-summation constraint on the class priors which reduces the overall number of variables by one) to be estimated by the algorithm that seeks to discover the clusters in your data. What\'s amazing is that, despite the large number of variables that must be optimized simultaneously, the EM algorithm will very likely give you a good approximation to the right answer.
At its core, EM depends on the notion of unobserved data and the averaging of the log-likelihood of the data actually observed over all admissible probabilities for the unobserved data. But what is unobserved data? While in some cases where EM is used, the unobserved data is literally the missing data, in others, it is something that cannot be seen directly but that nonetheless is relevant to the data actually observed. For the case of clustering multidimensional numerical data that can be modeled as a Gaussian mixture, it turns out that the best way to think of the unobserved data is in terms of a sequence of random variables, one for each observed data point, whose values dictate the selection of the Gaussian for that data point. This point is explained in great detail in my on-line tutorial at https://engineering.purdue.edu/kak/Tutorials/ExpectationMaximization.pdf.
The EM algorithm in our context reduces to an iterative invocation of the following steps: (1) Given the current guess for the means and the covariances of the different Gaussians in our mixture model, use Bayes\' Rule to update the posterior class probabilities at each of the data points; (2) Using the updated posterior class probabilities, first update the class priors; (3) Using the updated class priors, update the class means and the class covariances; and go back to Step (1). Ideally, the iterations should terminate when the expected log-likelihood of the observed data has reached a maximum and does not change with any further iterations. The stopping rule used in this module is the detection of no change over three consecutive iterations in the values calculated for the priors.
This module provides three different choices for seeding the clusters: (1) random, (2) kmeans, and (3) manual. When random seeding is chosen, the algorithm randomly selects \'K\' data elements as cluster seeds. That is, the data vectors associated with these seeds are treated as initial guesses for the means of the Gaussian distributions. The covariances are then set to the values calculated from the entire dataset with respect to the means corresponding to the seeds. With kmeans seeding, on the other hand, the means and the covariances are set to whatever values are returned by the kmeans algorithm. And, when seeding is set to manual, you are allowed to choose \'K\' data elements --- by specifying their tag names --- for the seeds. The rest of the EM initialization for the manual mode is the same as for the random mode. The algorithm allows for the initial priors to be specified for the manual mode of seeding.
Much of code for the kmeans based seeding of EM was drawn from the \'Algorithm::KMeans\' module by me. The code from that module used here corresponds to the case when the \'cluster_seeding\' option in the \'Algorithm::KMeans\' module is set to \'smart\'. The \'smart\' option for KMeans consists of subjecting the data to a principal components analysis (PCA) to discover the direction of maximum variance in the data space. The data points are then projected on to this direction and a histogram constructed from the projections. Centers of the \'K\' largest peaks in this smoothed histogram are used to seed the KMeans based clusterer. As you\'d expect, the output of the KMeans used to seed the EM algorithm.
This module uses two different criteria to measure the quality of the clustering achieved. The first is the Minimum Description Length (MDL) proposed originally by Rissanen (J. Rissanen: \"Modeling by Shortest Data Description,\" Automatica, 1978, and \"A Universal Prior for Integers and Estimation by Minimum Description Length,\" Annals of Statistics, 1983.) The MDL criterion is a difference of a log-likelihood term for all of the observed data and a model-complexity penalty term. In general, both the log-likelihood and the model-complexity terms increase as the number of clusters increases. The form of the MDL criterion in this module uses for the penalty term the Bayesian Information Criterion (BIC) of G. Schwartz, \"Estimating the Dimensions of a Model,\" The Annals of Statistics, 1978. In general, the smaller the value of MDL quality measure, the better the clustering of the data.
For our second measure of clustering quality, we use `trace( SW^-1 . SB)\' where SW is the within-class scatter matrix, more commonly denoted S_w, and SB the between-class scatter matrix, more commonly denoted S_b (the underscore means subscript). This measure can be thought of as the normalized average distance between the clusters, the normalization being provided by average cluster covariance SW^-1. Therefore, the larger the value of this quality measure, the better the separation between the clusters. Since this measure has its roots in the Fisher linear discriminant function, we incorporate the word \'fisher\' in the name of the quality measure. _Note that this measure is good only when the clusters are disjoint._ When the clusters exhibit significant overlap, the numbers produced by this quality measure tend to be generally meaningless.
|
RPM found in directory: /packages/linux-pbone/ftp5.gwdg.de/pub/opensuse/repositories/devel:/languages:/perl:/CPAN-A/openSUSE_Tumbleweed/noarch |