## Monday, July 23, 2007

### Math Models and Monte Carlo Simulation

It's always nice to know your odds of success before taking on any challenge. In some cases, such as card games, the odds can be computed from the information known to you. In other cases, odds are based on assumptions and/or past performance. This is true for things like sporting events and performance of financial products. Engineering design can also involve probabilities. Design engineers doing tolerance stack calculations are determining the odds that things will fit in a given space. In many cases, the required calculations are prohibitively complex. That is when Monte Carlo simulation becomes so appealing.

Monte Carlo simulation involves running a math model repeatedly using "random" numbers from assumed distributions as the inputs. Then you observe the distributions of the output variables and use them to predict the probabilities of seeing values within particular ranges.

Simulation in TK Solver has the same prerequisites as List Solving, plus a few more. The model must have input and output variables on the variable sheet and the model must solve using the direct solver or with automatic iteration (i.e. - without having to manually provide guess values).

Next, you need to determine if any of the simulation inputs are correlated. If two variables are correlated, they tend to vary together. If any input variables are correlated, the random values used as inputs must be generated with that in mind. This can have a significant impact on the simulation results, as the results will show greater variability if no correlations are assumed.

TK Solver's list fill facility can be used to generate lists from a variety of distributions but it assumes that the variables are uncorrelated. The TK Library includes additional procedures for generating independent lists of random numbers.

We recently developed a procedure (let's call it "rancorr" for now) for generating correlated sets of lists. Rancorr provides a table for specifying correlations between the input variables. The correlation matrix for the input variables defaults to 0’s off the diagonal, indicating that the variables are independent. If non-zero correlations between any or all of the variables are to be assumed, edit the correlation matrix and the simulation will account for it. Correlations can vary from -1 to 1. The correlation matrix is symmetric, so if you enter a correlation in row m and column n of the matrix, the same value is automatically entered in row n, column m. It is possible to construct a correlation matrix that is mathematically infeasible. The procedure checks and reports on that condition.

Rancorr includes another table for specifying the distribution parameters for each variable. Here is the list of distributions:

Bernoulli, Beta, Binomial, Cauchy, Erlang, Exponential, Extreme Value, Gamma, Half Normal, Hypergeometric, Kumaraswamy, Log Series, Logistic, Lognormal, Negative Binomial, Normal, Poisson, Power, Rayleigh, Rectangle, Semicircle, Skellam, Trapezoid, Triangle, Truncated Normal, Uniform, Weibull, Weibull Series, Yule Simon, and Zipf Mandelbrot.

The TK Library includes many routines for fitting various probability distributions to lists. By fitting, we mean that they determine the parameters that cause a particular distribution to match a list of values as closely as possible. It is easy to launch a second TK, load a Library routine and then copy and paste an output list from a simulation run into the fitting routine.

That method assumes that you have some notion as to which distributions might fit the data. Each library routine does its best to determine the parameters of a specific distribution but does not compare the results to determine which fit best. We recently developed a TK routine called "fitall" which does just that. Fitall can run through a series of over 30 different distributions and determine the parameters of the one which fits a list the best.

Fitall processes each of the output lists, providing a frequency histogram, probability plots and descriptive statistics for a visual and numerical summary. The histogram also includes an overlay of the best fitting theoretical distribution. The descriptive statistics for the simulated values include:

Max/Min – Maximum and minimum values
Mean – Arithmetic mean
Standard Deviation – A measure of the variability
Mean +/- 3SD – The arithmetic mean plus (or minus) three standard deviations is often used as a practical estimate for the expected range.
Skewness – Used to compare the distribution with a normal distribution, skewness is an indication of the symmetry of the results about the mean value. The skewness for a normal distribution is zero, and any symmetric data should have skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values indicate data that are skewed right. By skewed left, we mean that the left tail is heavier than the right tail.
Kurtosis - Used to compare the distribution with a normal distribution, kurtosis is an indication of the “peakedness” of the resulting distribution. Positive kurtosis indicates a "peaked" distribution and negative kurtosis indicates a "flat" distribution.

If the data contains any values less than or equal to zero, some of the distributions are eliminated. A chi-square test of the frequencies is done. Anderson-Darling and Kolmogorov-Smirnov test statistics are computed. The correlation between the data and the theoretical data is computed. The theory is that the best fitting distribution should have the smallest Chi-Square, Anderson-Darling and Kolmogorov-Smirnov statistics and the largest correlation coefficient.

The results from each distribution are ranked and the ranks summed to form a score. The distribution with the lowest score is selected. Tables summarize the results from each distribution tested along with the ranked scores.

Plots are available for each distribution. A plot of the best-fitting cumulative distribution is shown along with the actual data. A bar chart shows the observed and expected frequencies for five or more intervals.

Once you have the distribution parameters, you can compute the probability of observing values in any particular range. In effect, you now know the odds. The process only takes a minute or two so if you don't like the odds, you can tweak your math model and re-run the simulation until you're satisfied.

We are in the process of creating a Simulaton Wizard to automate the steps involved in using rancorr and fitall. Meanwhile, we are looking for individuals interested in testing these new tools. Please contact me if you are interested (todd@uts.com). You read this far... why not give them a try?!

Labels: , ,