What makes the least-squares line the best line, and what do its formulas and r-squared tell us?
Topic 2.8 Least Squares Regression: determine the least-squares regression line from summary statistics, and interpret the coefficient of determination r-squared and the standard deviation of the residuals.
A focused answer to AP Statistics Topic 2.8, on why the least-squares line minimizes squared residuals, computing it from means, standard deviations, and r, and interpreting r-squared and s, with full worked calculations.
Reviewed by: AI editorial process; not yet individually human-reviewed
Have a quick question? Jump to the Q&A page
Jump to a section
What this topic is asking
The College Board (Topic 2.8) wants you to find the least-squares regression line from summary statistics (means, standard deviations, and ), to know why it is the best-fitting line, and to interpret the coefficient of determination and the standard deviation of the residuals.
Why "least squares"
Among all possible lines, exactly one minimizes that sum, and that is the line technology reports. The choice to square (rather than, say, take absolute values) is what makes the slope and intercept have clean formulas in terms of and the standard deviations, and it ties the line to the correlation you already know.
Computing the line from summary statistics
The slope formula is worth reading: it scales the correlation by the ratio of the spreads, converting the unit-free into a slope in the units of per unit of . Because it contains , the slope has the same sign as the correlation: positive correlation gives positive slope. And once you have the slope, the intercept formula forces the line through , a fact that is itself sometimes tested directly.
Interpreting r-squared
The coefficient of determination is the single most important fit measure on the exam. It is the proportion (or percentage) of the variation in the response that is explained by the linear model with . If , then about of the variability in is accounted for by its linear relationship with , and the remaining is due to other factors and random variation. A full-credit interpretation always contains four elements: the percentage, "of the variation in [ in context]," "is explained by," and "the linear relationship with [ in context]." Two errors recur: interpreting as the proportion of points on the line (it is about variation, not points), and confusing with (the correlation). Because , you can move between them, but they answer different questions: measures the strength and direction of the linear association, while measures the share of variation explained.
The standard deviation of the residuals
The other fit measure is , the standard deviation of the residuals, which estimates the typical size of a prediction error in the units of . Where is a unitless proportion, is a concrete "on average our predictions are off by about [units]." A smaller means tighter predictions. Reading the two together gives a rounded picture: says what fraction of the variation the line captures, and says, in real units, how large the leftover errors typically are. On the exam, usually appears in computer output labelled near the regression equation, and you interpret it as the typical residual size, for example "predicted exam scores are typically off by about points." Being fluent at pulling , , , , and out of standard regression output, and interpreting each in context, is exactly the skill the next layer of exam questions (and the guide on reading computer output) builds on.
Try this
Q1. A regression has . Find and state what it means. [2 points]
- Cue. ; about of the variation in is explained by the linear relationship with .
Q2. Given , , and slope , find the intercept. [1 point]
- Cue. .
Exam-style practice questions
Practice questions written in the style of College Board exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.
AP 2018 (style)1 marksSection I (multiple choice). A regression of on has . What proportion of the variation in is explained by the linear relationship with ? (A) (B) (C) (D) Show worked answer →
The correct answer is (B).
The coefficient of determination is , so about of the variation in is explained by the linear relationship with .
(A) is itself, the correlation, not the proportion of variation explained. (C) and (D) are unrelated. The proportion of variation explained is always , not .
AP 2021 (style)4 marksSection II (free response). For a data set, , , , , and . (a) Find the slope and intercept of the least-squares line. (b) Interpret in context, where is hours of training and is a performance score. (c) State what the least-squares line minimizes.Show worked answer →
A 4-point computation-and-interpretation question.
(a) (2 points) Slope (1 point). Intercept (1 point). So .
(b) (1 point) , so about of the variation in performance score is explained by the linear relationship with hours of training.
(c) (1 point) The least-squares line minimizes the sum of the squared residuals (the sum of squared vertical distances from the points to the line).
Markers reward the correct slope and intercept from the summary-statistic formulas, an interpretation in context, and the definition of what least squares minimizes.
Related dot points
- Topic 2.6 Linear Regression Models: write, interpret, and use a least-squares regression equation to predict a response, interpreting the slope and intercept in context, and recognizing the danger of extrapolation.
A focused answer to AP Statistics Topic 2.6, on the form of a regression equation, interpreting slope and intercept in context, making predictions, and the danger of extrapolation, with a worked prediction and interpretation.
- Topic 2.7 Residuals: calculate and interpret residuals, construct and read residual plots, and use them to assess whether a linear model is appropriate.
A focused answer to AP Statistics Topic 2.7, defining the residual as observed minus predicted, interpreting positive and negative residuals, and using residual plots to judge whether a linear model is appropriate, with worked calculations.
- Topic 2.5 Correlation: calculate and interpret the correlation coefficient r, understand its properties (range, unit-free, resistance), and recognize what it can and cannot tell you.
A focused answer to AP Statistics Topic 2.5, defining the correlation coefficient r, its range and properties (unit-free, symmetric, non-resistant), what it measures and misses, and the correlation-causation caution, with a worked interpretation.
- Topic 2.9 Analyzing Departures from Linearity: identify outliers, high-leverage, and influential points in regression, and use transformations to model a non-linear relationship.
A focused answer to AP Statistics Topic 2.9, on regression outliers, high-leverage and influential points, and using transformations (logs and powers) to linearise a curved relationship, with a worked transformation example.
- Topic 2.4 Representing the Relationship Between Two Quantitative Variables: construct and describe scatterplots by direction, form, strength, and unusual features, in context.
A focused answer to AP Statistics Topic 2.4, on building scatterplots and describing them by direction, form, strength, and unusual features (the DUFS framework), in context, with a worked description.
Sources & how we know this
- AP Statistics Course and Exam Description — College Board (2020)