SEARCH
NEW RPMS
DIRECTORIES
ABOUT
FAQ
VARIOUS
BLOG

 
 

perl-Algorithm-ExpectationMaximization rpm build for : openSUSE Tumbleweed. For other distributions click perl-Algorithm-ExpectationMaximization.

Name : perl-Algorithm-ExpectationMaximization
Version : 1.22 Vendor : obs://build_opensuse_org/devel:languages:perl
Release : 3.51 Date : 2024-02-15 04:18:46
Group : Development/Libraries/Perl Source RPM : perl-Algorithm-ExpectationMaximization-1.22-3.51.src.rpm
Size : 0.39 MB
Packager : (none)
Summary : Perl module for clustering numerical
Description :
*Algorithm::ExpectationMaximization* is a _perl5_ module for the
Expectation-Maximization (EM) method of clustering numerical data that
lends itself to modeling as a Gaussian mixture. Since the module is
entirely in Perl (in the sense that it is not a Perl wrapper around a C
library that actually does the clustering), the code in the module can
easily be modified to experiment with several aspects of EM.

Gaussian Mixture Modeling (GMM) is based on the assumption that the data
consists of \'K\' Gaussian components, each characterized by its own mean
vector and its own covariance matrix. Obviously, given observed data for
clustering, we do not know which of the \'K\' Gaussian components was
responsible for any of the data elements. GMM also associates a prior
probability with each Gaussian component. In general, these priors will
also be unknown. So the problem of clustering consists of estimating the
posterior class probability at each data element and also estimating the
class priors. Once these posterior class probabilities and the priors are
estimated with EM, we can use the naive Bayes\' classifier to partition the
data into disjoint clusters. Or, for \"soft\" clustering, we can find all the
data elements that belong to a Gaussian component on the basis of the
posterior class probabilities at the data elements exceeding a prescribed
threshold.

If you do not mind the fact that it is possible for the EM algorithm to
occasionally get stuck in a local maximum and to, therefore, produce a
wrong answer even when you know the data to be perfectly multimodal
Gaussian, EM is probably the most magical approach to clustering
multidimensional data. Consider the case of clustering three-dimensional
data. Each Gaussian cluster in 3D space is characterized by the following
10 variables: the 6 unique elements of the \'3x3\' covariance matrix (which
must be symmetric positive-definite), the 3 unique elements of the mean,
and the prior associated with the Gaussian. Now let\'s say you expect to see
six Gaussians in your data. What that means is that you would want the
values for 59 variables (remember the unit-summation constraint on the
class priors which reduces the overall number of variables by one) to be
estimated by the algorithm that seeks to discover the clusters in your
data. What\'s amazing is that, despite the large number of variables that
must be optimized simultaneously, the EM algorithm will very likely give
you a good approximation to the right answer.

At its core, EM depends on the notion of unobserved data and the averaging
of the log-likelihood of the data actually observed over all admissible
probabilities for the unobserved data. But what is unobserved data? While
in some cases where EM is used, the unobserved data is literally the
missing data, in others, it is something that cannot be seen directly but
that nonetheless is relevant to the data actually observed. For the case of
clustering multidimensional numerical data that can be modeled as a
Gaussian mixture, it turns out that the best way to think of the unobserved
data is in terms of a sequence of random variables, one for each observed
data point, whose values dictate the selection of the Gaussian for that
data point. This point is explained in great detail in my on-line tutorial
at
https://engineering.purdue.edu/kak/Tutorials/ExpectationMaximization.pdf.

The EM algorithm in our context reduces to an iterative invocation of the
following steps: (1) Given the current guess for the means and the
covariances of the different Gaussians in our mixture model, use Bayes\'
Rule to update the posterior class probabilities at each of the data
points; (2) Using the updated posterior class probabilities, first update
the class priors; (3) Using the updated class priors, update the class
means and the class covariances; and go back to Step (1). Ideally, the
iterations should terminate when the expected log-likelihood of the
observed data has reached a maximum and does not change with any further
iterations. The stopping rule used in this module is the detection of no
change over three consecutive iterations in the values calculated for the
priors.

This module provides three different choices for seeding the clusters: (1)
random, (2) kmeans, and (3) manual. When random seeding is chosen, the
algorithm randomly selects \'K\' data elements as cluster seeds. That is, the
data vectors associated with these seeds are treated as initial guesses for
the means of the Gaussian distributions. The covariances are then set to
the values calculated from the entire dataset with respect to the means
corresponding to the seeds. With kmeans seeding, on the other hand, the
means and the covariances are set to whatever values are returned by the
kmeans algorithm. And, when seeding is set to manual, you are allowed to
choose \'K\' data elements --- by specifying their tag names --- for the
seeds. The rest of the EM initialization for the manual mode is the same as
for the random mode. The algorithm allows for the initial priors to be
specified for the manual mode of seeding.

Much of code for the kmeans based seeding of EM was drawn from the
\'Algorithm::KMeans\' module by me. The code from that module used here
corresponds to the case when the \'cluster_seeding\' option in the
\'Algorithm::KMeans\' module is set to \'smart\'. The \'smart\' option for KMeans
consists of subjecting the data to a principal components analysis (PCA) to
discover the direction of maximum variance in the data space. The data
points are then projected on to this direction and a histogram constructed
from the projections. Centers of the \'K\' largest peaks in this smoothed
histogram are used to seed the KMeans based clusterer. As you\'d expect, the
output of the KMeans used to seed the EM algorithm.

This module uses two different criteria to measure the quality of the
clustering achieved. The first is the Minimum Description Length (MDL)
proposed originally by Rissanen (J. Rissanen: \"Modeling by Shortest Data
Description,\" Automatica, 1978, and \"A Universal Prior for Integers and
Estimation by Minimum Description Length,\" Annals of Statistics, 1983.) The
MDL criterion is a difference of a log-likelihood term for all of the
observed data and a model-complexity penalty term. In general, both the
log-likelihood and the model-complexity terms increase as the number of
clusters increases. The form of the MDL criterion in this module uses for
the penalty term the Bayesian Information Criterion (BIC) of G. Schwartz,
\"Estimating the Dimensions of a Model,\" The Annals of Statistics, 1978. In
general, the smaller the value of MDL quality measure, the better the
clustering of the data.

For our second measure of clustering quality, we use `trace( SW^-1 . SB)\'
where SW is the within-class scatter matrix, more commonly denoted S_w, and
SB the between-class scatter matrix, more commonly denoted S_b (the
underscore means subscript). This measure can be thought of as the
normalized average distance between the clusters, the normalization being
provided by average cluster covariance SW^-1. Therefore, the larger the
value of this quality measure, the better the separation between the
clusters. Since this measure has its roots in the Fisher linear
discriminant function, we incorporate the word \'fisher\' in the name of the
quality measure. _Note that this measure is good only when the clusters are
disjoint._ When the clusters exhibit significant overlap, the numbers
produced by this quality measure tend to be generally meaningless.

RPM found in directory: /packages/linux-pbone/ftp5.gwdg.de/pub/opensuse/repositories/devel:/languages:/perl:/CPAN-A/openSUSE_Tumbleweed/noarch

Content of RPM  Provides Requires

Download
ftp.icm.edu.pl  perl-Algorithm-ExpectationMaximization-1.22-3.51.noarch.rpm
     

Provides :
perl(Algorithm::ExpectationMaximization)
perl-Algorithm-ExpectationMaximization

Requires :
perl(:MODULE_COMPAT_5.38.2)
perl(Graphics::GnuplotIF) >= 1.4
perl(Math::GSL) >= 0.26
perl(Math::Random) >= 0.71
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(PayloadIsZstd) <= 5.4.18-1


Content of RPM :
/usr/lib/perl5/vendor_perl/5.38.2/Algorithm
/usr/lib/perl5/vendor_perl/5.38.2/Algorithm/ExpectationMaximization.pm
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/README
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/README
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/canned_example1.pl
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/canned_example2.pl
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/canned_example3.pl
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/canned_example4.pl
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/canned_example5.pl
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/canned_example6.pl
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/cleanup_directory.pl
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/cluster_plot.png
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/data_generator.pl
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/data_scatter_plot.png
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/datafile_1d.txt
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/mydatafile1.dat
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/mydatafile2.dat
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/mydatafile3.dat
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/mydatafile4.dat
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/mydatafile5.dat
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/mydatafile6.dat
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/mydatafile7.dat
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/param1.txt
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/param2.txt
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/param3.txt
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/param4.txt
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/param5.txt
/usr/share/doc/packages/perl-Algorithm-ExpectationMaximization/examples/param6.txt
There is 16 files more in these RPM.

 
ICM