Some notes about using multivariate gaussian (MG) and Gasussian mixture model (GMM) in research works.
In one of the program, we need to provide a model assumption and fit the existing data into model, and I learned a lot from that process and recap a lot of mathmatical knowledge. These knowledge is belonging to the statistics maching learning and statistics things. Since we need to implement it from scratch to let them run well on different devices, we need to implement our own linear algorithm from scratch, although the concepts are old but there are a lot of works here.
Type of distributions
This represents a specific variable satisfies what kind of distribution. There are two principles to classifies the distributions, one is the type of the distribution function, such as the uniform distribution, normal distribution, etc and there are different types of distribution functions.
Another principles is the indepedent of multivariate. For example, if we try to use a model to fit 2 samples from two variables. We can either assume they are indepedent or corelated. Either multiple indepedent variabels or using one density function with multiple variables, such as multivariate gaussian model, belongs to the category of multivariate probability analysis (be careful about this). The key difference is that for the indepedent assumption, there is P(A,B)=P(A)*P(B)
, for the colrelated case, there is P(A,B)=P(A|B)*P(B)
. In multivariate case, there is concept about joint probability.
Be careful about the term using when we say a particular variable satisfies a distribution. In this case, we actually mean two things, one is its probability density function(pdf)
, such as the Gaussian distribution, its probability density function is that classic shape. Another things is that cumulative distribution function(cdf)
, which is the integral of the probability density function. When we say the probability of P(x=a)
it represents a point in the pdf (continuous case), or P(a<x<b)
, which represents a specific integration in the region between a and b in pdf function.
Be careful about the defination of likelihood, look at this viedo L(mean is a and stdev is b | sample is c)
this is likelihood, what we know is the sample and express the likelihood of particular probability value when it is specific model (we can change the model). Given the sample, when model is specific one, how much probability to get this sample. Which is a kind of reverse process of computing probability (in this case, we know the probability density function, and check the probability of get specific sample)
Q-Q plot can be used to choose which distribution assumption should be used
Deciding if to use multivariate assumption or indepedent assumption. (there are still some questions in actual using, for example, when we founud two variables have some kinds of relations, the correlation coefficient is between 0 and 1, which assumption should we use?)
There is no simple way to compute the cumulative distribution function, the monte carlo approach is simple and efficient way to use by computers (integral is hard, but we can map the input variables into the outcome probability). There is a figure here that shows the converge rate, it converges fast after 1k samples for, we can get similar converge figure for different types of problems we have. The output of the monte carlo results can be P(a<x<b)
Density estimation
The idea of histogram is the straightforward step based on simple statistics description. Just dividing regions into different bins and count how many observations collated in each bin.
There are all kinds of documents about the kernel density estimation, the straightforard idea is to acquire a continuous version based on histogram. So we basically stack each unit together to construct the density function. The unit is the standard unit such as the gaussian kernel. The approach is also simple, when there is one observation, we put one standard kernel here. (both pyplot hist and the seaborn can be used to do that)
The python implementaion of KDE is scipy KernelDensity function. The input data is the original observation raw data
Pay attention that the y axis of the histogram. The label of y can be changes from the count scale to the frequency scale. The operation is simple, just divide the total number of bins. (Pay attention about the meaning of y axis when drawing the figure)
Maybe the MSE is a simple metric to see if there is a good match between the obvervation data and the KDE function.
Histogram, density plot, probability density function
density plot can be viewed as the estimation of the probability density function. The probability function is the function in theory.
Approach\Assumptions | Indepedent | Depedent |
---|---|---|
Parametric | Uniform distribution, Single variate Gaussian distribution | Multivariate Gaussian distribution |
Nonparametric | Histogram, KDE (mixture model based on different kernel functions), Single variate Gaussian Mixture Model | Multivariate Gaussian Mixture Model (GMM) |
CDF of the normal distribution
This video provides some good explanation
https://www.youtube.com/watch?v=26QbWYBCw7Y
The idea is to express it based on the error function, we can compute the CDF namelly Pr(x<v) conveniently based on normal distribution. There is some connection between the error function and the integration of the normal distribution. The CDF can be expressed as a form of error function.
The good thing for the KDE based on gaussian kernel is that we can compute the error function for each components and sum them up together, the results are described here:
https://stats.stackexchange.com/questions/296285/find-cdf-from-an-estimated-pdf-estimated-by-kde
One simple way to consider the KDE is to use the uniformdistribution as kernel. For the gaussian kernel, the input value h can be viewed as the stdev of the original parameter, since it determines the shap of the kernel curve. After doing the x-x_i/h, the variable follows the N(0,1).
Distinguish with regreesion model
The correlation between two variables. The naive idea is to use the scatter plot to see if two variables are correlated, if there is noise everywhere, it can show that there is no relationship between these two variables. In the excel, we can use the CORREL
parameter to show that, which is also known as the pearson correlation coefficient.
If there is correlation between two variables (such as the temprature and the ice crease sales example), then we can derive the regression model to show this relationship, the model here can be the linear or non-linear regression model. The approach adapoted here is the least square approach. We do not dive into details here.
Sampling the multivariate gaussian based on normal distribution
It takes me some time to figure out how to do the sampling for the multivariate gaussian distribution. The key idea is to do the sampling from the single variable gaussian distribution firstly, and then try to convert it into multivariate gaussina distribution. In wiki page, there is a straightforward explanation. When X is multivariate gaussian distribution, there is X=AZ+u, where covariance matrix is AA^T and the Z is the normal random vector from the gaussian distribution. u is the mean value for each variable of multivariate gaussian distribution.
This online paper shows lots of details here:
Weishan Dong and Xin Yao. “Unified eigen analysis on multivariate Gaussian based estimation of distribution algorithms”. In: Information Sciences 178.15 (2008), pp. 3000–3023.
Here are more explanations about the theory part
https://math.stackexchange.com/questions/1801403/decomposition-of-a-positive-semidefinite-matrix
The implementation part takes me some times to do that. In python, it is easy to do things based on all kinds of libraries
https://stephens999.github.io/fiveMinuteStats/mvnorm_eigen.html
The matrix decomposition can be implemented based on existing libraries such as the Eigen on CPU:
However, there is no avalible code to implement assocaited math functions on GPU (pure c code) conveniently. We may use cuda associated libraies, but it is hard to make that code adaptable for both CPU, CUDA GPU and NVIDIA GPU. Here are some other avalible resources
Single node CPUs:
LAPACK: https://netlib.org/lapack/ De facto standard library (Fortran and C APIs), developed by UTK. Vendors like Intel, IBM and Cray provides their own proprietary version.
PLASMA: New library developed by UTK (C APIs). Adapted to multicore CPUs (OpenMP). https://github.com/icl-utk-edu/plasma/
Eige:, C++ library, but the performance is not so great. https://eigen.tuxfamily.org/index.php?title=Main_Page
GSL library. GNU scientific library.
GPUs:
MAGMA: Another package from UTK. https://icl.utk.edu/magma/software/index.html It supports both NVIDIA and AMD GPUs.
cuSolver: https://docs.nvidia.com/cuda/cusolver/
It seems that the MAGMA has support for both NVIDIA GPU and HIP. We do not know that when we do the project. Try to get familar with associated libray.
Techniques used for writting eigen value solver from scratch and related issues
Check another blog regarding the linear algorithm
The idea of likely hood expectaion of computing GMM
It turns that the multivariate gaussian is not enough. After going through some papers, we found that more common and general way to use a model to represent the variable is the gaussian mixture model. It can better fit the sample data compared with the single normal distribution. The question is how to implement this GMM for indepedent 1d variable and multivariable. The standard way is the maximiaztion expectation model and there is a good explanation in this serie of online video.
One related paper to implement this thing on GPU is
https://www.mdpi.com/1999-4893/14/10/285
The core idea can be descirbed as follow, assumping there are two components. If we have a guss of the (u1,sigma1), and (u2,sigma2), we can compute the likelihood of each observered data sample actually comes from these two distributions.
Implementing GMM on GPU
python example
Least square regression vs maximal expectation
References
A good tutorial about using the gsl library
likelihood concepts, good paper: