How do we ask whether two variables are related, and what does an association really mean?
Topic 2.1 Introducing Statistics - Are Variables Related?: identify questions about the association between two variables, distinguish association from causation, and recognize what two-variable data can answer.
A focused answer to AP Statistics Topic 2.1, on framing questions about the association between two variables, the difference between explanatory and response variables, why association is not causation, and what two-variable data can answer, with worked examples.
Reviewed by: AI editorial process; not yet individually human-reviewed
Have a quick question? Jump to the Q&A page
Jump to a section
What this topic is asking
The College Board (Topic 2.1) wants you to frame statistical questions about the relationship between two variables, to identify the explanatory and response variables, and above all to distinguish association from causation, recognizing that observational two-variable data can show a relationship but not prove that one variable causes the other.
Explanatory and response variables
Choosing which is which is a modelling decision driven by the question, not by the data alone. "Does studying time explain exam score?" makes study time explanatory and score the response. The labels matter because they fix the axes of a scatterplot and the direction of any prediction you later make.
What "associated" means
The form the analysis takes depends entirely on the variable types, which is why Topic 1.2's classification returns here: two categorical variables call for two-way tables (Topics 2.2 to 2.3), while two quantitative variables call for scatterplots, correlation, and regression (Topics 2.4 to 2.9).
Association is not causation
The defining lesson of this topic, and one of the most important in the whole course, is that an association does not by itself mean one variable causes the other. There are several reasons an association can appear without a causal link. A lurking (confounding) variable may influence both: ice-cream sales and drowning deaths rise together, but hot weather drives both, with no causal link between ice cream and drowning. Reverse causation is possible: maybe the response actually affects the explanatory variable. And the association could even be coincidence in a small sample. Because observational data cannot rule these out, the only design that supports a causal claim is a randomised experiment, in which random assignment to treatment groups balances out lurking variables. So when a question shows an observational study, the exam expects you to describe the association ("students who slept more tended to score higher") and then explicitly decline to claim cause, naming a plausible lurking variable or noting the lack of random assignment. Writing "this proves that X causes Y" from observational data is the single error most reliably punished in Unit 2 and beyond.
What two-variable data can and cannot answer
Two-variable data extend what you can ask beyond Unit 1's single distributions: you can now ask whether and how two variables move together, predict one from the other (with regression), and quantify the strength of a linear relationship (with correlation). What they still cannot do, in an observational setting, is establish cause, and they cannot generalize beyond the individuals studied unless those individuals were a random sample of a defined population. Keeping these limits in mind frames the rest of the unit honestly: correlation and regression are tools for describing and predicting an association, not for proving that manipulating one variable would change the other. The most sophisticated exam answers therefore pair a confident description of the relationship with an equally confident statement of its limits, which is exactly the balance the College Board is looking to assess from the very first topic of the unit.
Try this
Q1. In "does fertilizer amount explain crop yield?", identify the explanatory and response variables. [1 point]
- Cue. Fertilizer amount is explanatory; crop yield is the response.
Q2. A study finds taller children have larger vocabularies. Explain why this does not mean height causes vocabulary. [2 points]
- Cue. Age is a lurking variable: older children are both taller and have larger vocabularies, so age drives both and there is no direct causal link.
Exam-style practice questions
Practice questions written in the style of College Board exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.
AP 2018 (style)1 marksSection I (multiple choice). A study finds that towns with more fire trucks at a fire tend to have more fire damage. Which conclusion is best supported? (A) Fire trucks cause damage (B) There is an association between number of fire trucks and damage, likely explained by a third variable, fire size (C) Reducing fire trucks would reduce damage (D) The two variables are unrelatedShow worked answer →
The correct answer is (B).
The data show an association, but a lurking variable (the size of the fire) plausibly drives both: bigger fires bring more trucks and cause more damage. Observational data cannot establish causation.
(A) and (C) wrongly read causation into an association. (D) ignores the clear association. This is the classic confounding example, and the correct stance is to name the association and the likely lurking variable without claiming cause.
AP 2021 (style)3 marksSection II (free response). An observational study records, for many students, hours of sleep and exam score, and finds students who sleep more tend to score higher. (a) Identify the explanatory and response variables. (b) Explain why this study cannot conclude that more sleep causes higher scores. (c) Suggest one lurking variable that could explain the association.Show worked answer →
A 3-point question on association versus causation.
(a) (1 point) The explanatory variable is hours of sleep; the response variable is exam score (sleep is used to explain or predict score).
(b) (1 point) The study is observational, not an experiment: students were not randomly assigned to amounts of sleep, so a lurking variable could be responsible and we cannot rule out reverse causation; therefore causation cannot be concluded.
(c) (1 point) A plausible lurking variable: overall conscientiousness or good time management, which could increase both sleep and study quality; or a less stressful course load. Any reasonable third variable affecting both earns the point.
Markers reward correct identification of explanatory and response roles, a reason grounded in the observational design, and a sensible lurking variable.
Related dot points
- Topic 2.2 Representing Two Categorical Variables: construct and interpret two-way (contingency) tables and segmented or side-by-side bar graphs for two categorical variables.
A focused answer to AP Statistics Topic 2.2, on building and reading two-way tables and segmented or side-by-side bar graphs for two categorical variables, with marginal totals and a worked table.
- Topic 2.3 Statistics for Two Categorical Variables: calculate joint, marginal, and conditional relative frequencies from a two-way table, and use conditional distributions to judge association.
A focused answer to AP Statistics Topic 2.3, on joint, marginal, and conditional relative frequencies from two-way tables, and using conditional distributions to assess association, with full worked proportion calculations.
- Topic 2.4 Representing the Relationship Between Two Quantitative Variables: construct and describe scatterplots by direction, form, strength, and unusual features, in context.
A focused answer to AP Statistics Topic 2.4, on building scatterplots and describing them by direction, form, strength, and unusual features (the DUFS framework), in context, with a worked description.
- Topic 2.5 Correlation: calculate and interpret the correlation coefficient r, understand its properties (range, unit-free, resistance), and recognize what it can and cannot tell you.
A focused answer to AP Statistics Topic 2.5, defining the correlation coefficient r, its range and properties (unit-free, symmetric, non-resistant), what it measures and misses, and the correlation-causation caution, with a worked interpretation.
- Topic 1.1 Introducing Statistics - What Can We Learn from Data?: identify questions to be answered, based on variation in one-variable data, and recognize what a data set can and cannot tell us.
A focused answer to AP Statistics Topic 1.1, on how variation in data raises statistical questions, what kinds of question data can answer, and the limits of what a single data set reveals, with worked examples.
Sources & how we know this
- AP Statistics Course and Exam Description — College Board (2020)