trendline gaussian

  • I have a data set which looks like an incompleted gaussian distribution, or bell curve truncated on the right. How can I produce a Trendline best-fit Gaussian so that I can get the mean, standard deviation, and R-squared for the fit ?


    I have written a QBASIC program (for DOS) that can generate data points for a given mean and SD, that I can then paste into a spreadsheet alongside my data, and perform a sum of squares of differences, but that is only one fit, not "best-fit".


    Surely someone has done this before ?


    I have Borland Delphi 2005 if that helps, but from the look of it, its going to take a year to master it.

  • Re: trendline gaussian


    Welcome to the OzGrid Forum!


    You can probably come close by doing an eyeball estimate of the mean parameter by taking the middle of the highest clump of data. Then I would use excel's Goal Seek to find the value for sigma that minimizes the sum of squared differences. Then with that estimate, use Goal seek to find the best mean, and iterate a few times. It won't guarantee the best two parameter fit, but then you are guessing at the truncated Gaussian shape anyway.


    (Added in Edit)
    Of course what I wrote above makes no sense. Goal Seek doesn't work that way. Solver is the way to go. (This is my second extra-stupid reply this month. I blame this one on writing it before I had my first cup of coffee.)

  • Re: trendline gaussian


    Never used it, but the Excel solver can optimize for 2 variables (mean and standard deviation) at once. Then no iteration by hand is needed, although starting values (cleverly chosen) are needed, and from my statistics background, can perhaps alter the solution for m and s you get in the end.


    Wigi

    Regards,


    Wigi


    Excel MVP 2011-2014


    For more Excel memes: visit http://www.wimgielis.com ==> English articles ==> Excel memes


    -- Topics without [CODE] tags or a meaningful topic title, will be neglected by me (and probably many others as well) --

  • Re: trendline gaussian


    Why are you doing this?


    If what you are actually trying to do is find out whether or not some observations are from a particular distribution then you should be doing the correct statistical test.


    For example. If you have 100 samples and you want to estimate the true mean and standard deviation of the population then we should be using a t-test (although if you have many more observations then we can simplify this).


    To do what you are trying to do would require either a transformation of the data or some sort of iterative solution as suggested. Although you would then be able to calculate an r-sq value you would still need to test if it was a significant value. You could start to build a worksheet to do this, but the actual test is much easier.


    If I can (time permitting) I will build a simple demo sheet.


    A.

  • Re: trendline gaussian


    Thanks for your ideas. I had never heard of Solver before and didn't even have it installed. That should keep me quiet for a while. :)


    A9192Shark asked "Why are you doing this ? [ why not a t-test ] "
    The data is oil production history, and there is no reason I can think of why it should look like a probability distribution, but it does, sort of.
    But can it be better modelled by a triangular curve, logistic curve, or whatever ?
    Well, try them all and get some measure of the various fits for comparison.


    There are lots of data sets so it needs to be somewhat automated.


    Dave
    http://www.peakoil.org.au

  • Re: trendline gaussian


    I'm so proud of my new-found Solver ability that I thought I would send you my Gaussian-fit worksheet.


    Solver can handle three variables, at least - mean, std dev, and a scaling factor to match the Standard Normal Distribution to the data's scale, and is robust enough to home in on the solution from almost any starting point. Amazing.


    Thanks again.
    Dave

  • Re: trendline gaussian


    Dave,


    I do not want to detract from your excitement, but are you using the normal distribution in a sensible manner?


    What you have done is:
    Assumed that the oil production is following a definable function f(x) where x is the year. You are trying to find that function, presumably to predict what the production would be in a given future year.


    Firstly this is a time series not a distribution. You are therefore not plotting the normal or Gaussian disribution, but plotting the curve defined by the same equation. Why is this important? A distribution is related to probability of occurrence, you are analysing a time series. To assume a distribution would be to assume that the oil production next year could be 3 000 000. Clearly this is contrary to the time series.


    I appreciate that you are trying to obtain a curve that best describes the data and would (presumably) allow a prediction of the future oil production.


    A few boundaries exist for the curve. As the year gets larger we assume production decreases (this assumes that no significant new reserves are tapped into). Also the production can not be less than zero. Several curves spring to mind as possibly describing the data. However we must understand why we think the data fits and what is the real world reason for the data fitting that curve.


    To undertake this analysis I would search for a comparitively simple curve. Have a look at the different types of trend lines that you can put on the graph. By allocating a 3rd order polynomial I think that you will probably get a much better fit. The question will then be "Why is a 3rd order poly the best fit to use?" It is not sufficient to say because it matches the data, we must understand why it matches the data.


    I suggest therefore that you takeone of the following approaches.
    1. Use the trend tool in Excel to examine the fit for many different types of curves as given in the chart tool.


    2. Transform your data such that it would appear linear. This is fraught with problems for anything but an easy transformation and I would avoid this.


    The key to understanding all of this is "Why does the data behave the way it does?" If you can not explain why production went up and then started to decrease then you have to be very careful of using a curve that may fit the data verywell now but is no good for forecasting.


    I hope that the above helps, I do want to see your analysis work, but it must be valid.


    Cheers,


    Alan.


    PS (Ithink, but need to check, that your equation for normal distribution is missing a division by the standard deviation. I have used the NormDist() function in Excel.)
    A.

  • Re: trendline gaussian


    Quote from A9192Shark

    PS (Ithink, but need to check, that your equation for normal distribution is missing a division by the standard deviation. I have used the NormDist() function in Excel.)


    It is.
    Also, why don't you put the sqrt(2*pi()) into a cell and refer to it, rather than use an approximation. OK, doesn't change things a lot, but it's easy to accomodate. Same applies to the e, it's built in in Excel, so make use of it.


    About the core of the exercise, I'll try to have a look at it this evening.


    Wigi

    Regards,


    Wigi


    Excel MVP 2011-2014


    For more Excel memes: visit http://www.wimgielis.com ==> English articles ==> Excel memes


    -- Topics without [CODE] tags or a meaningful topic title, will be neglected by me (and probably many others as well) --

  • Re: trendline gaussian


    To elaborate on Alan's remarks, I agree this is a forecasting problem that has a host of applicable techniques. I ran your data through several standard extrapolation models and, not surprisingly looking at the data, autocorrelation models did the best job. Using both a one and two year lag regression model to the data from 1970 to 1990 gave a good fit to the data when projected either forwards or backwards, with an average error of about 2%.


    A comprehensive site for information about forecasting is
    http://www-marketing.wharton.upenn.edu/forecast/welcome.html

  • Re: trendline gaussian


    I forgot to attach the workbook I used for the forecasts. It's not exactly self-explanitory. The data go in columns A and B of the third sheet, which has the formulas for calulating in the other columns. The first sheet lets you select the span oftime to use for fitting the various models and calculating standard fit statistics. There are two buttons that just load and run Solver models to optimize the exponential fit parameters. The middle sheet plots the data and allows selection of the model to use to show on the plot.

  • Re: trendline gaussian


    Alan and Wigi



    Thanks for the trouble you have obviously taken to put me on the right track.


    I am aware that what I am doing is not sensible - that is why I wrote "The data is oil production history, and there is no reason I can think of why it should look like a probability distribution, but it does, sort of."


    My aim is rather to cast doubt on 'conventional theory' which says that the curve is a reasonable model. The following page is only under construction, but shows what I am looking at :
    http://home.austarnet.com.au/davekimble/hubbert-curves.htm


    On the missing 'division by SD' - I have had a hard look at the formula and I don't see a problem with it, so it is worrying that Wigi agrees with you.
    The part ((ROW(E2)-1-$D$2)/$D$3) represents the transform to z-score :
    z = (x - mean)/sd
    the rest is : y = (1/root(2*pi))*e^(-z²/2)
    Since the y is going to be scaled up to the data anyway, the constant is actually a waste of time.


    I'll be happy to be corrected if I'm wrong.


    Dave
    http://www.peakoil.org.au

  • Re: trendline gaussian


    I'm afraid I don't understand.
    I have opened your file and looked at the various charts.
    I can see smoothing, but what have you forecasted ?


    If you read my earlier post to Alan and Wigi and look at my web page, you can see (hopefully!) what I am trying to do.


    I should perhaps add that the Peak Oil theory doesn't depend on the Hubbert curves being able to forecast anything more than the peak of production.
    And as you can see from the various national production histories, production does peak, even if you cannot predict when.


    Dave

  • Re: trendline gaussian


    The forecast is for the next year's (and following years if desired) oil production. From your references to the peak year, I see that a forecast of the future is not not what you were after, so ignore it.

  • Re: trendline gaussian


    Quote from Dave.Kimble@Liz

    On the missing 'division by SD' - I have had a hard look at the formula and I don't see a problem with it, so it is worrying that Wigi agrees with you.
    The part ((ROW(E2)-1-$D$2)/$D$3) represents the transform to z-score :
    z = (x - mean)/sd
    the rest is : y = (1/root(2*pi))*e^(-z²/2)
    Since the y is going to be scaled up to the data anyway, the constant is actually a waste of time.


    Check out http://en.wikipedia.org/wiki/Normal_distribution where the formula for the probability density function is shown. The sigma is appearing over there.

    Regards,


    Wigi


    Excel MVP 2011-2014


    For more Excel memes: visit http://www.wimgielis.com ==> English articles ==> Excel memes


    -- Topics without [CODE] tags or a meaningful topic title, will be neglected by me (and probably many others as well) --

  • Re: trendline gaussian


    Wigi


    Yes, you're right.
    I was getting mixed up between Normal and Standard Normal.
    It works the way I have done it because by the time Solver has determined a value for sd, it has become a constant for the whole set. After the Normal Distribution is calculated, Solver scales the distribution up to the data with "constant" $D$4, so the sd "constant" is immaterial.


    So either (1/sd*SQRT(2*pi)) should be all in (for completeness), or all out (for efficiency), and not as (1/2.5066).


    E2=EXP(-0.5*((ROW(E1)-$D$2)/$D$3)^2)


    Thanks.
    Dave

  • Re: trendline gaussian


    Dave,


    Unfortunately my company will not let me view your web site so I can not see the examples you have pointed us to.


    The discussion on whether or not the missing SD is needed should be resolved by simply using teh NormDist function. You should use this for two reasons.
    1. It is a clear function that others can easily see what you have done. It will not have the same limitations of approximations (even if the impact is limited on teh results).
    2. Someone who looks at your worksheet may see the missing SD and question the validity of the analysis. Whilst you may be correct that teh scaling simply gets around the mathematical issues it does not get around the quality and confidence issues.


    A.

  • Re: trendline gaussian


    Alan


    I agree that NormDist is better.
    My only whinge is that the Help on it, and its variations, is so impenetrable that it was easier to go to my old stats textbook - and I got that wrong !
    And that boolean parameter - why not 2 seperate functions with meaningful names ?


    Thinking about tidying up the solution further, is there a simple formula to get the number that identifies a cell within a range which contains the maximum value in the range ?
    For example, if the Production data is at (A3:A99) and A70 contains the maximum value, how do you pick up the number 68 ?
    This would automate setting an initial value for Mean to go into Solver.
    And half that value could be an initial value for StdDev.
    And with MAX(A3:A99) for the initial scaling factor, the Solver process wouldn't need any active set up.
    Then if somehow we could press the Solver button automatically, the problem would resolve itself into defining :
    (B3:B99) = FitGaussian(A3:A99)


    Don't lose any sleep over it.
    Dave
    PS : Why did Hubbert use a probability function to fit a times series? It was a case of trying the easiest thing first.

  • Re: trendline gaussian


    Quote from Dave.Kimble@Liz

    Thinking about tidying up the solution further, is there a simple formula to get the number that identifies a cell within a range which contains the maximum value in the range ?
    For example, if the Production data is at (A3:A99) and A70 contains the maximum value, how do you pick up the number 68 ?


    =MATCH(MAX($A$3:$A$99),$A$3:$A$99,0)

    Regards,


    Wigi


    Excel MVP 2011-2014


    For more Excel memes: visit http://www.wimgielis.com ==> English articles ==> Excel memes


    -- Topics without [CODE] tags or a meaningful topic title, will be neglected by me (and probably many others as well) --

  • Re: trendline gaussian


    Dave,


    Quote from Dave.Kimble@Liz

    Why did Hubbert use a probability function to fit a times series? It was a case of trying the easiest thing first.


    Not previously knowing anything about this science I have just done a web search. I know I am being pedantic, but Hubbert did not use a probability distribution to fit a time series. He most likely used simple arithmetic to calculate the data and then plotted the points on a graph and draw a curve through them that had a 'characteristic' shape. No formulae have been given (that I can find). To fit the curve subsequent analyses have used a Gauss or Normal curve, but it is not a distribution, it is simply a mathematical equation that 'fits' the data.


    The key words are 'probability distribution' vs 'curve fit' they are not the same.


    In fact the drawings of his original curve suggest that a very wide stdev is required to get the wide curve at the top of the curve, but there are the problems that the tails are ill defined. Can it be considered that oil production had a specific start date? I suspect that it is a long way before the data that is being analysed.


    I could keep going, because it intrigues me, but it will not help you solve your problem and it is not a primary issue for this forum (and I am sure there are many others more qualified in this subject to comment)!


    However, on the use of R2 have a look at this article it gives an example of why simply searching for a function taht fits is not necessarily the best approach.


    With respect to your problem I presume that you are doing an analysis for several data sets and will hen combine the curves to provide a revised estimate of the global or regional peak oil prediction. An interesting problem and one that does appeal to my analytical mind!


    With respect to your question on developing the solver, have a look at the attached workbook. You should only edit coloured cells. Note that the formulae used for the curves are all filled down the coloured columns that correspond to the graph and the titles of the curves.


    I have put starting values in that help the solver. It is clear that the solver has limitations, I had to do a lot of the adjustments manually, especially for the Gamma distribution.


    Note that you can not modify a cell other than the cell holding a function, therefore the solver will always need to be run as a macro.


    Good luck,


    A.

  • Re: trendline gaussian


    Wigi, thanks - Alan manages to not need that bit, which is puzzling me.


    Alan, first the technical :
    Nice example sheet.
    I didn't post the full history of US oil production before because it goes back to 1859 and the very early data is of dubious relevance to the peaking problem. But being small numbers they don't change SS much.
    The data is attached below.
    I have pasted it into your sheet, and extended the column formulae and chart > source data > series > values ranges to match.
    When I run it, the Normal curve on the chart comes out too low.
    The mean and SD are being adjusted, but not Scaling.
    R2 looks fully optimised - much better than the chart curve.
    Looking at your macros I realise just how much I don't know about macros.


    On Hubbert's method :
    If you want to move this discussion off-list, just say.
    Hubbert was very secretive about his methods. Deffeyes (now Emeritus Professor, Princeton) was a junior contemporary of his and explains various methods in his book "Hubberts Peak : the impending world oil shortage". I haven't explained here yet that as well as fitting a curve to the data, he was also fitting the area under it to an estimated "total recoverable oil" figure that he got from his knowledge of the specific oilfield reserves data.


    I wasn't trying to do that. I was hoping to demonstrate that although the 'US lower 48 states' data is a good fit to a gaussian, it is not so good for 'US all states' , and a poor fit for most other countries. This could be because the US, prior to their peak, was controlling production to achieve a stable market price. After the peak they were just pumping flat out and importing the shortfall, and the job of controlling world production for stable price was taken up by OPEC, especially Saudi Arabia (some say at US gun-point). The implications of peaking world oil production are very grim.


    H used a logistic curve - Deffeyes introduced the Gaussian because it fits better, especially the early tail. It was Deffeyes' quote that you "try the simplest solution first". I haven't seen any convincing explanation of WHY the simplest solution should work.


    Dave

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!