.

Wednesday, March 13, 2019

Simple Linear Regression

simple-minded analog retrogression is the statistic method c eitherd to make summary of and picture the association amongst protean quantitys that are shirktinues and quantitative ,basically it deals with cardinal measures that describes how immobile the distribution channelar kindred we can compute in selective information .Simple analogue lapse nobblesist of one variable known as the predictor variable and the other variable de none y known as solution variable .It is expected that when we talk of simple running(a) backsliding to touch perception on deterministic family kinship and statistical relationship, the concept of least(prenominal) blotto square .the interpretation of the b0 and b1 that they are functiond to interpret the estimate reasoning backward . T present is withal what is known as the population infantile fixation cablegram and the estimate backsliding line .This linearity is measured victimization the correlation coefficient coefficie nt (r), that can be -1,0,1.The might of the association is determined from the appreciate of r .( https//onlinecourses.science.psu.edu/stat501/ leaf node/250). History of simple linear regress Karl Pearson established a demanding treatment of Applied statistical measure known as Pearson harvest-home trice coefficient of correlation .This bring forth from the thought of Sir Francis Galton ,who had the idea of the modern notions of correlation and statistical relapse ,Sir Galton contributed in science of Biology ,psychology and Applied statistics . It was seen that Sir Galton is fascinated with genetics and heredity provided the initial inspiration that led to retrogression and Pearson Product Moment Correlation .The thought that encouraged the advance of the Pearson Product Moment Correlation began with vexing problem of heredity to understand how closely features of generation of living things exhibited in the next generation. Sir Galton took the approach of using the swe et pea to break-dance the characteristic similarities. ( Bravais, A. (1846).The use of sweet pea was motivated by the fact that it is self- fertilize , girl plants shows differences in genetics from mother with-out the use of the atomic number 42 boot that entrust lead to statistical problem of assessing the genetic combination for twain(prenominal)(prenominal) parents .The first insight came intimately obsession came from deuce dimensional plat plotting the size independent being the mother peas and the dependent being the missy peas.He used this readation of data to show what statisticians call it regression today ,from his plot he take in that the median weight of daughter seeds from a particular size of mother seed approximately expound a straight line with positive dip less than 1. hence he naturally reached a straight regression line ,and the unbroken variability for all arrays of character for a given character of second .It was ,perhaps best for the progres s of the correlational calculus that this simple special showcase should promulgated first .It so simply grabbed by the beginner (Pearson 1930,p.5). accordingly it was later generalised to more complex way that is called the multiple regression. Galton, F. (1894),Importance of linear regressionStatistics usually uses the term linear regression in interpretation of data association of a particular survey, research and look into .The linear relationship is used in rideling .The flummoxling of one explanatory variable x and receipt variable y will require the use of simple linear regression approach .The simple linear regression is said to be broadly useful in methodology and the unfeignedistic application. This method on simple linear regression theoretical level is not used in statistics barely but it is applied in m all an(prenominal) biological, social science and environmental research. The simple linear regression is outlay importance because it gives indication of w hat is to be expected, mostly in monitoring and corrigible purposes involved on some disciplines(April 20, 2011 , plaza ,).Description of linear regression The simple linear regression model is described by Y=(?0 + ?1 +E), this is the mathematical way of showing the simple linear regression with labelled x and y .This equation gives us a clear idea on how x is associated to y, there is also an fault term shown by E. The term E is used to justification for repulsion in y, that we can be able to detect it by the use of linear regression to give us the amount of association of the two variables x and y .Then we become the parameters that are use to represent the population (?0 + ?1x) .We then have the model given by E(y)= (?0 + ?1x), the ?0 being the intercept and ?1 being the slope of y ,the look upon of y at the x esteems is E(y) . The meditation is assumed is we assume that there is a linear association between the two variables ,that being our H0 and H1 we assume that there is no linear relationship between H0 and H1. Background of simple linear regression Galton used descriptive statistics in order for him to be able to generalise his work of contrasting heredity problems .The carryed opportunity to conclude the process of analysing these data, he realised that if the grad of association between variables was held constant,then the slope of the regression line could be described if variability of the two measure were known . Galton assumed he estimated a bingle heredity constant that was generalised to multiple hereditary characteristics .He was wondering why, if such a constant existed ,the find slopes in the plot of parent child varied too much everyplace these characteristics .He realise variation in variability amongst the generations, he attained at the idea that the variation in regression slope he obtained were only when due to variation in variability between the various rank of measurements .In resent terms ,the principal this princi pal can be illustrated by assuming a constant correlation coefficient but change the regular aberrations of the two variables involved . On his plot he demonstrate out that the correlation in each data settle. He then observe three data sets ,on data set one he realised that the standard deviation of Y is the aforesaid(prenominal) as that of X , on data set two standard deviation of Y is less than that of X ,third data set standard deviation of Y is great than that of X .The correlation remain constant for three sets of data even though the slope of the line changes as an outcome of the differences in variability between the two variables.The rudimentary regression equation y=r(Sy / Sx)x to describe the relationship between his paired variables .He the used an estimated judge of r , because he had no knowledge of calculating it The (Sy /Sx) expression was a chastening factor that helped to adjust the slope according to the variability of measures .He also realised that the r atio of variability of the two measures was the key factor in determining the slope of the regression line .The uses of simple linear regression Simple linear regression is a typical Statistical selective information Analysis strategy. It is utilized to decide the degree to which there is a get hold of connection between a carey variable and at least one slack factors. (e.g. 0- carbon test score) and the free variable(s) can be estimated on either an all out (e.g. male versus female) or consistent tenderness scale.thither are a few different suppositions that the information essential full fill keeping in mind the end finish to meet all requirements for simple linear regression. Basic linear regression is like connection in that the reason for existing is to scale to what degree there is a direct connection between two factors.The real contrast between the two is that relationship sees no difference amongst the two variables . Specifically, the reason for simple linear regres sion anticipate the estimation of the dependent variable in light of the estimations of at least one free factors. https//www.statisticallysignificantconsulting.com/ atavismAnalysis.htmReferenceBravais, A. (1846), Analyse Mathematique sur les Probabilites des Erreurs de Situation dun Point, Memoires par divers Savans, 9, 255-332.Duke, J. D. (1978),Tables to booster Students Grasp Size Differences in Simple Correlations, Teaching of Psychology, 5, 219-221. pass awayzPatrick, P. J. (1960),Leading British Statisticians of the Nineteenth Century, Journal of the American Statistical Association, 55, 38-70.Galton, F. (1894),Natural Inheritance (5th ed.), New York Macmillan and Company.https//onlinecourses.science.psu.edu/stat501/node/250.https//www.statisticallysignificantconsulting.com/ regression toward the ungenerousAnalysis.htmGhiselli, E. E. (1981),Measurement conjecture for the Behavioral Sciences, San Francisco W. H. Freeman.Goldstein, M. D., and Strube, M. J. (1995), Understan ding Correlations Two Computer Exercises, Teaching of Psychology, 22, 205-206.Karylowski, J. (1985), regression toward the call back Toward the Mean Effect No Statistical Background Required, Teaching of Psychology, 12, 229-230.Paul, D. B. (1995),Controlling sympathetic Heredity, 1865 to the Present, Atlantic Highlands, N.J. Humanities Press.Pearson, E. S. (1938),Mathematical Statistics and Data Analysis (2nd ed.), Belmont, CA Duxbury.Pearson, K. (1896),Mathematical Contributions to the Theory of Evolution. III. retroversion, Heredity and Panmixia, Philosophical Transactions of the Royal Society of London, 187, 253-318.Pearson, K. (1922),Francis Galton A centenary Appreciation, Cambridge University Press.Pearson, K. (1930),The Life, Letters and Labors of Francis Galton, Cambridge University Press.Williams, R. H. (1975), A New Method for Teaching Multiple Regression to Behavioral Science Students, Teaching of Psychology, 2, 76-78.Simple Linear RegressionStat 326 fundament to wo rk Statistics II retread Stat 226 Spring 2013 Stat 326 (Spring 2013) cornerstone to duty Statistics II 1 / 47 Stat 326 (Spring 2013) mental hospital to Business Statistics II 2 / 47 critique illation for Regression physical exertion Real Estate, Tampa Palms, Florida Goal foreshadow sale outlay of residential position based on the appraised measure out of the lieu Data sale worth and total appraised harbor of 92 residential properties in Tampa Palms, Florida 1000 900 Sale Price (in Thousands of Dollars) 800 700 600 euchre cd ccc two hundred 100 0 0 100 cc 300 400 viosterol 600 700 800 900 1000 Appraised Value (in Thousands of Dollars) reexamine proof for Regression We can describe the relationship between x and y using a simple linear regression model of the form y = ? 0 + ? 1 x 1000 900 Sale Price (in Thousands of Dollars) 800 700 600 500 400 300 200 100 0 0 100 200 300 400 500 600 700 800 900 1000 Appraised Value (in Thousands of Dollars) reply variable y sale set explanatory variable x appraised cling to relationship between x and y linear strong positive We can estimate the simple linear regression model using Least Squares (LS) yielding the adjacent LS regression line y = 20. 94 + 1. 069x Stat 326 (Spring 2013) installation to Business Statistics II / 47 Stat 326 (Spring 2013) demonstration to Business Statistics II 4 / 47 analyse consequence for Regression edition of estimated intercept b0 corresponds to the predicted honor of y , i. e. y , when x = 0 review evidence for Regression Interpretation of estimated slope b1 corresponds to the change in y for a unit increase in x when x increases by 1 unit y will increase by the shelter of b1 interpretation of b0 is not eer involveingful (when x cannot take determine close to or equal to zero) here b0 = 20. 94 when a property is appraised at zero judge the predicted sales price is $20,940 beggarlyingful?Stat 326 (Spring 2013) Introduction to Business Statistics II 5 / 4 7 b1 0 y decreases as x increases (negative association) b1 0 y increases as x increases (positive association) here b1 = 1. 069 when the appraised honour of a property increases by 1 unit, i. e. by $1,000, the predicted sale price will increase by $1,069. Stat 326 (Spring 2013) Introduction to Business Statistics II 6 / 47 Review consequence for Regression Measuring strength and sufficiency of a linear relationship correlation coe? cient r measure of strength of linear relationship ? 1 ? r ? 1 here r = 0. 9723 Review Inference for RegressionPopulation regression line Recall from Stat 226 Population regression line The regression model that we assume to hold true for the replete(p) population is the so-called population regression line where y = ? 0 + ? 1 x, coe? cient of determination r 2 amount of variation in y explained by the ? tted linear model 0 ? r2 ? 1 here r 2 = (0. 9723)2 = 0. 9453 ? 94. 53% of the variation in the sale price can be explained through the linear re lationship between the appraised take account (x) and the sale price (y ) Stat 326 (Spring 2013) Introduction to Business Statistics II 7 / 47 y ordinary (mean) value of y in population for ? xed value of x ? population intercept ? 1 population slope The population regression line could only be obtained if we had information on all idiosyncratics in the population. Stat 326 (Spring 2013) Introduction to Business Statistics II 8 / 47 Review Inference for Regression Based on the population regression line we can fully describe relationship between x and y up to a random illusion term ? y = ? 0 + ? 1 x + ? , where ? ? N (0, ? ) Review Inference for Regression In summary, these are important notations used for SLR Description x y Parameters ? 0 ? 1 y ? Stat 326 (Spring 2013) Introduction to Business Statistics II 9 / 47 Stat 326 (Spring 2013)Description Estimates b0 b1 y e Description Introduction to Business Statistics II 10 / 47 Review Inference for Regression Review Inference for Regression Validity of previsions Assuming we have a obedient model, divinations are only valid within the range of x-values used to ? t the LS regression model Predicting outside the range of x is called extrapolation and should be avoided at all be as predictions can become unreliable. Why ? t a LS regression model? A good model allows us to make predictions approximately the behavior of the response variable y for di? rent values of x estimate ordinary sale price (y ) for a property appraised at $223,000 x = 223 y = 20. 94 + 1. 069 ? 223 = 259. 327 ? the average sale price for a property appraised at $223,000 is estimated to be nearly $259,327 What is a good model? answer to this question is not straight forward. We can visually check the validity of the ? tted linear model (through residuary plots) as s hearty as make use of numerical values such as r 2 . more on assessing the validity of regression model will follow. 11 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 12 / 47 Stat 326 (Spring 2013)Introduction to Business Statistics II Review Inference for Regression What to belief for Review Inference for Regression Regression boldnesss residual plot Assumptions SRS (independence of y -values) linear relationship between x and y for each value of x, population of y -values is normally distributed (? ? ? N) r2 for each value of x, standard deviation of y -values (and of ? ) is ? In order to do inference (con? dence breakups and hypotheses tests), we need the following 4 assumptions to hold Stat 326 (Spring 2013) Introduction to Business Statistics II 13 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 14 / 47Review Inference for Regression SRS Assumption is hardest to check The Linearity Assumption and Constant SD Assumption are typically checked visually through a residual plot. Recall residual = y ? y = y ? (b0 + b1 x) The Normality Assumption is checked by assessing whether residuals are approximat ely normally distributed (use normal quantile plot) plot x versus residuals any pattern indicates violation Review Inference for Regression Stat 326 (Spring 2013) Introduction to Business Statistics II 15 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 16 / 47 Review Inference for RegressionReturning to the Tampa Palms, Florida example 100 50 symmetry 0 -50 -100 -150 0 100 200 300 400 500 600 700 800 900 1000 Review Inference for Regression Going one footfall further, excluding the outlier yields 0. 2 0. 1 0. 0 -0. 1 -0. 2 -0. 3 4 4. 5 5 5. 5 log Appraised 6 6. 5 7 Residual Appraised Value (in Thousands of Dollars) Note non-constant variance can often be stabilized by transforming x, or 0. 5 y , or both Residual 0. 0 -0. 5 -1. 0 -1. 5 4 4. 5 5 5. 5 log Appraised 6 6. 5 7 outliers/in? uential points in general should only be excluded from an analysis if they can be explained and their exclusion can be justi? ed, e. g. ypo or invalid measurements, etc. excluding o utliers unendingly means a loss of information handle outliers with caution whitethorn want to compare analyses with and without outliers Stat 326 (Spring 2013) Introduction to Business Statistics II 17 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 18 / 47 Review Inference for Regression normal quantile plots Tampa Palms example Residuals Sale Price (in Thousands of Dollars) 100 .01 . 05 . 10 . 25 . 50 . 75 . 90 . 95 . 99 Review Inference for Regression Residuals log Sale 50 Regression Inference Con? dence legal separations and hypotheses tests -3 -2 -1 0 1 2 3 Normal Quantile darn -50 -100 regard to assess whether linear relationship between x and y holds true for entire population. .01 . 05 . 10 . 25 . 50 . 75 . 90 . 95 . 99 Residuals log Sale without outlier 0. 2 0. 1 0 -0. 1 -0. 2 -0. 3 -3 -2 -1 0 1 2 3 This can be accomplished through testing H0 ? 1 = 0 vs. H0 ? 1 = 0 based on the estimates slope b1 . For simplicity we will work with the untransformed Tampa Palms data. Normal Quantile Plot Stat 326 (Spring 2013) Introduction to Business Statistics II 19 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 20 / 47 Review Inference for RegressionReview Inference for Regression Example Find 95% CI for ? 1 for the Tampa Palms data set Con? dence breakups We can construct con? dence intervals (CIs) for ? 1 and ? 0 . General form of a con? dence interval estimate t ? SEestimate , where t ? is the critical value corresponding to the chosen level of con? dence C t ? is based on the t-distribution with n ? 2 degrees of freedom (df) Interpretation Stat 326 (Spring 2013) Introduction to Business Statistics II 21 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 22 / 47 Review Inference for Regression Review Inference for Regression seeing for a linear relationship between x and y If we wish to test whether there exists a signi? cant linear relationship between x and y , we need to test H0 ? 1 = 0 Why? If we fail to reject the postcode hypothesis (i. e. stick with H0 = ? 1 = 0), the LS regression model reduces to y = ? 1 =0 versus Ha ? 1 = 0 ?0 + ? 1 x ? 0 + 0 x ? 0 (constant) Introduction to Business Statistics II 24 / 47 = = implying that y (and hence y ) is not linearly dependent on x. Stat 326 (Spring 2013) Introduction to Business Statistics II 23 / 47 Stat 326 (Spring 2013) Review Inference for Regression Review Inference for RegressionExample (Tampa Palms data set) Test at the ? = 0. 05 level of signi? cance for a linear relationship between the appraised value of a property and the sale price Stat 326 (Spring 2013) Introduction to Business Statistics II 25 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 26 / 47 Inference slightly forecasting Why ? t a LS regression model? The purpose of a LS regression model is to 1 Inference most portent 2 estimate y average/mean value of y for a given value of x, say x ? e. g. estimate average sale price y for all residential property in Tampa Palms appraised at x ? $223,000 predict y an individual/single early value of the response variable y for a given value of x, say x ? e. g. predict a in blood(predicate) sale price of an individual residential property appraised at x ? =$223,000 Keep in mind that we consider predictions for only one value of x at a time. Note, these two tasks are VERY di? erent. carefully think round the di? erence Stat 326 (Spring 2013) Introduction to Business Statistics II 27 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 28 / 47 Inference about farsightedness To estimate y and to predict a single future y value for a given level of x = x ? we can use the LS regression line y = b0 + b1 x only if substitute the desired value of x, say x ? , for x y = b0 + b1 x ? Inference about portent In addition we need to know how much variability is associated with the point estimator. Taking the variability into account provides information about how good and reliable the point estimator truly is. That is, which range potentially captures the true (but unknown) parameter value? Recall from 226 ? spin of con? dence intervals Stat 326 (Spring 2013) Introduction to Business Statistics II 29 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 0 / 47 Inference about prophecy Much more variability is associated with estimating a single placard than estimating an average individual observations always vary more than averages Inference about Prediction Therefore we distinguish a con? dence interval for the average/mean response y and a prediction interval for a single future observation y Both intervals use a t ? critical value from a t-distribution with df = n ? 2. the standard fault will be di? erent for each interval While the point estimator for the average y and the future individual value y are the same (namely y = b0 + b1 x ? , the of the two con? dence intervals Stat 326 (Spring 2013) Introduction to Busin ess Statistics II 31 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 32 / 47 Inference about Prediction Con? dence interval for the average/mean response y Width of the con? dence interval is determined using the standard error SE (from estimating the mean response) SE can be obtained in JMP Keep in mind that every con? dence interval is always constructed for one speci? c given value x ? A level C con? dence interval for the average/mean response y , when x takes the value x? is given by y t ?SE , where SE is the standard error for estimating a mean response. Stat 326 (Spring 2013) Introduction to Business Statistics II 33 / 47 Inference about Prediction Prediction interval for a single (future) value y Again, Width of the con? dence interval is determined using the standard error SE (from estimating the mean response) SEy can be obtained in JMP Keep in mind that every prediction interval is always constructed for one speci? c given value x ? A level C predictio n interval for a single observation y , when x takes the value x ? is given by y t ? SEy , where SEy is the standard error for estimating a single response.Stat 326 (Spring 2013) Introduction to Business Statistics II 34 / 47 Inference about Prediction The larger picture Inference about Prediction The larger picture contd. Stat 326 (Spring 2013) Introduction to Business Statistics II 35 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 36 / 47 Inference about Prediction Example An thingmajig store runs a 5-month experiment to determine the e? ect of advertising on sales receipts. There are only 5 observations. The scatterplot of the advertising expenditures versus the sales revenues is shown below bivariate conciliate of gross sales Revenues (in Dollars) By Advertising expenditureInference about Prediction Example contd JMP can draw the con? dence intervals for the mean responses as well as for the predicted values for future observations (prediction intervals) . These are called con? dence bands Bivariate Fit of gross revenue Revenues (in Dollars) By Advertising expenditure 5000 5000 sales Revenues (in Dollars) 4000 3000 2000 1000 Sales Revenues (in Dollars) 4000 3000 2000 1000 0 0 0 100 200 300 400 500 600 Advertising expenditure (in Dollars) 0 100 200 300 400 500 600 Advertising expenditure (in Dollars) Linear Fit Linear Fit Sales Revenues (in Dollars) = -100 + 7 Advertising expenditure (in Dollars)Stat 326 (Spring 2013) Introduction to Business Statistics II 37 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 38 / 47 Inference about Prediction Inference about Prediction tenderness and prediction (for the appliance store data) Estimation and prediction Using JMP For each observation in a data set we can get from JMP y , SEy , and also SE . In JMP do 1 2 We wish to estimate the mean/average revenue of the subpopulation of stores that spent x ? = 200 on advertising. Suppose that we also wish to predict the revenue in a future month when our store spends x ? = 200 on advertising.The point estimate in both situations is the same y = ? 100 + 7 ? 200 ? 1300 the corresponding standard errors of the mean and of the prediction however are di? erent SE ? 331. 663 SEy ? 690. 411 40 / 47 Choose Fit Model From response icon, choose Save Columns and then choose Predicted Values, Std Error of Predicted, and Std Error of Individual. Stat 326 (Spring 2013) Introduction to Business Statistics II 39 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II Inference about Prediction Estimation and prediction (contd) Note that in the appliance store example, SEy SE (690. 411 versus 331. 63). This is true always we can estimate a mean value for y for a given x ? much more just now than we can predict the value of a single y for x = x ?. In estimating a mean y for x = x ? , the only uncertainty arises because we do not know the true regression line. In predicting a single y for x = x ? , we have two unce rtainties the true regression line plus the expected variability of y -values around the true line. Inference about Prediction Estimation and prediction (contd) It always holds that SE SEy Therefore a prediction interval for a single future observation y will always be wider than a con? ence interval for the mean response y as there is simply more uncertainty in predicting a single value. Stat 326 (Spring 2013) Introduction to Business Statistics II 41 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 42 / 47 Inference about Prediction Example contd JMP also calculates con? dence intervals for the mean response y as well as prediction intervals for single future observations y. (For operating instructions follow the handout on JMP commands related to regression CIs and PIs. ) Inference about Prediction Example contd To construct both a con? ence and/or prediction interval, we need to obtain SE and SEy in JMP for the value x ? that we are interested in calendar mont h Ad. Expend. Sales Rev. Pred. Sales Rev. StdErr Pred Sales Revenues StdErr Indiv Sales Revenues Lets construct one 95% CI and PI by hand and see if we can come up with the same results as JMP In the second month the appliance store spent x = $200 on advertising and observed $1000 in sales revenue, so x = 200 and y = 1000 Using the estimated LS regression line, we predict y = ? 100 + 7 ? 200 = 1300 Stat 326 (Spring 2013) Introduction to Business Statistics II 43 / 47 Need to ? nd t ? ?rstStat 326 (Spring 2013) Introduction to Business Statistics II 44 / 47 Inference about Prediction A 95% CI for the mean response y , when x ? = 200 Inference about Prediction A 95% PI for a single future observation of y , when x ? = 200 Stat 326 (Spring 2013) Introduction to Business Statistics II 45 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 46 / 47 Inference about Prediction Example contd Advertising exp. Sales Rev. disgrace 95% Mean Upper 95% Mean Sales Rev. Sales Rev. Lo wer 95% Indiv Sales Rev. Upper 95% Indiv Sales Rev. Month Stat 326 (Spring 2013) Introduction to Business Statistics II 47 / 47

No comments:

Post a Comment