Skip to main content
United StatesStatisticsSyllabus dot point

How do we identify outliers and influential points, and how do transformations rescue a non-linear relationship?

Topic 2.9 Analyzing Departures from Linearity: identify outliers, high-leverage, and influential points in regression, and use transformations to model a non-linear relationship.

A focused answer to AP Statistics Topic 2.9, on regression outliers, high-leverage and influential points, and using transformations (logs and powers) to linearise a curved relationship, with a worked transformation example.

Generated by Claude Opus 4.810 min answer

Reviewed by: AI editorial process; not yet individually human-reviewed

Have a quick question? Jump to the Q&A page

Jump to a section
  1. What this topic is asking
  2. Outliers, leverage, and influence
  3. How influential points distort regression
  4. Transformations to achieve linearity
  5. Try this

What this topic is asking

The College Board (Topic 2.9) wants you to identify outliers, high-leverage points, and influential points in a regression, to understand how each affects the line, and to use transformations (such as logs or powers) to model a relationship that is not linear.

Outliers, leverage, and influence

These three ideas are related but distinct, and the exam tests whether you can keep them apart. Outlier is about a large residual (extreme in yy relative to the pattern). Leverage is about an extreme xx (far out along the horizontal axis). Influence is about effect: does removing the point change the line? A high-leverage point that lies on the trend has little influence (small residual, no change to the slope), whereas a high-leverage point off the trend is typically highly influential, because its distant xx lets it swing the line like a long lever arm. The cleanest test for influence is the thought experiment: imagine deleting the point; if the line moves a lot, the point was influential.

How influential points distort regression

Because the least-squares line minimizes squared vertical distances, an influential point can drag the slope and intercept toward itself and can inflate or deflate the correlation, giving a misleading summary of the bulk of the data. This is why Topic 2.5's warning that rr is not resistant matters here: a single influential point can make a weak relationship look strong, or hide a strong one. The practical advice the exam rewards is to identify such points (from the scatterplot and residual plot), to consider analyzing the data with and without them, and to report how the conclusions change, rather than silently letting one point dominate. You do not simply delete points, but you flag them and assess their effect, which is honest data analysis.

Transformations to achieve linearity

When the relationship itself is non-linear, the cure is not a different point but a transformation of a variable. If the scatterplot curves and the residual plot of a linear fit shows a systematic pattern (Topic 2.7), applying a function to yy or xx can straighten the relationship so a line fits the transformed data. Common transformations are the logarithm (taking ln(y)\ln(y) linearises exponential growth, where y=abxy = a \cdot b^x; taking ln\ln of both variables linearises a power law y=axby = a x^b) and powers or roots (such as y\sqrt{y}). The workflow is: transform, refit the line to the transformed data, check that the new residual plot shows random scatter (confirming the transformation worked), and then back-transform to make predictions in the original units. For a log-yy model ln(y)^=a+bx\widehat{\ln(y)} = a + bx, a prediction comes from computing ln(y)^\widehat{\ln(y)} and then raising ee to that power: y^=ea+bx\hat{y} = e^{\,a + bx}. The back-transformation step is where marks are most often lost, because students stop at the transformed prediction; remembering that ln(y)\ln(y) must be undone with ee to recover yy is essential. Transformation is the topic's payoff: it extends the entire regression toolkit (line, rr, r2r^2, residual analysis) to curved relationships, provided you transform first and interpret in the original units at the end.

Try this

Q1. A regression point lies far to the right (extreme xx) but right on the trend line. Classify it (outlier, high leverage, influential?). [2 points]

  • Cue. High leverage (extreme xx) but not an outlier (small residual) and not necessarily influential (it lies on the pattern, so removing it changes the line little).

Q2. After fitting ln(y)^=1+0.4x\widehat{\ln(y)} = 1 + 0.4x, you compute ln(y)^=3\widehat{\ln(y)} = 3 at some xx. What is the predicted yy? [1 point]

  • Cue. Back-transform: y^=e320.1\hat{y} = e^{3} \approx 20.1.

Exam-style practice questions

Practice questions written in the style of College Board exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.

AP 2019 (style)1 marksSection I (multiple choice). A point in a regression has an xx-value far from the other xx-values but lies near the overall pattern. This point is best described as: (A) An outlier with a large residual (B) A high-leverage point (C) An influential point that changes the slope greatly (D) Irrelevant to the regression
Show worked answer →

The correct answer is (B).

A point with an extreme xx-value has high leverage (potential to influence the line), but because it lies near the pattern its residual is small and it does not necessarily change the slope much. So it is high-leverage but not necessarily influential.

(A) describes an outlier (large residual), which this point is not. (C) would require it to actually change the slope a lot. (D) is wrong; leverage points always matter to check. Extreme in xx equals high leverage.

AP 2022 (style)4 marksSection II (free response). A scatterplot of yy against xx is clearly curved (concave up), and the residual plot of a linear fit shows a U-shape. A statistician takes ln(y)\ln(y) and finds that ln(y)\ln(y) against xx is now linear, with line ln(y)^=1+0.3x\widehat{\ln(y)} = 1 + 0.3x. (a) Explain why the original linear model was inappropriate. (b) Predict yy when x=5x = 5 using the transformed model. (c) Explain one advantage of transforming rather than forcing a straight line on the curved data.
Show worked answer →

A 4-point question on transformations.

(a) (1 point) The original linear model was inappropriate because the scatterplot was curved and the residual plot showed a U-shaped pattern, both signs that a straight line systematically misses the non-linear relationship.
(b) (2 points) At x=5x = 5: ln(y)^=1+0.3(5)=1+1.5=2.5\widehat{\ln(y)} = 1 + 0.3(5) = 1 + 1.5 = 2.5 (1 point). Back-transform: y^=e2.512.18\hat{y} = e^{2.5} \approx 12.18 (1 point).
(c) (1 point) Advantage: after transforming, the relationship is linear, so the least-squares line, rr, and r2r^2 are valid and predictions are reliable; forcing a line on curved data gives biased predictions and a patterned residual plot.

Markers reward citing the curve and U-shaped residual plot, a correct prediction with back-transformation via ee, and an advantage tied to the validity of the linear model after transforming.

Related dot points

Sources & how we know this