Averages, Spread & Correlation | GCSE Statistics Notes

📖 24 min read📅 Updated: 9 May 2026

Averages and spread allow us to summarise vast amounts of data into a few key numbers. This chapter covers standard calculations, advanced measures like standard deviation, and how to identify trends.

Topic 4.1 — Mean, Median & Mode for Discrete & Grouped Data

The three averages—mean, median, and mode—are the cornerstone measures of central tendency in statistics. The mode is the value that appears most frequently in a dataset. A dataset can have one mode, more than one mode (bimodal or multimodal), or no mode at all if all values appear with the same frequency. The median is the middle value when all the data is arranged in ascending order. If there is an even number of values, the median is the average of the two middle values. The mean is calculated by summing all the values and dividing by the number of values. For data presented in a frequency table, the formula is x̄ = Σfx / Σf, where f is the frequency and x is the value. For grouped data, you must first find the midpoint of each class interval, then calculate Σfx using these midpoints. This result is an estimated mean because you are assuming the values are evenly spread throughout each interval.

For example, in a grouped frequency table with class 10–20 and frequency 5, the midpoint is 15, and fx = 5 × 15 = 75. Summing all fx values and dividing by the total frequency gives the estimated mean. The median for grouped data is typically found using a cumulative frequency graph or by identifying the class interval in which the median lies. The mode for grouped data is called the modal class, which is simply the class interval with the highest frequency. It is crucial to remember that when working with grouped data, the mean and median are estimates.

Topic 4.2 — Weighted Mean, Geometric Mean & Mean Seasonal Variation

At the Higher tier, students encounter more sophisticated measures of average that account for varying importance or multiplicative effects. The weighted mean is used when different values in a dataset have different levels of importance or weight. The formula is: Weighted Mean = Σ(value × weight) / Σ(weights). For example, if a university module is worth 40% of the final grade and an exam is worth 60%, the weighted mean would be used to calculate the overall grade, not a simple average. This ensures that the component with greater influence on the outcome has a proportionally greater impact on the calculated average.

The geometric mean is used when dealing with rates of change, such as investment growth or population growth, where values are multiplied together rather than added. The formula is GM = ⁿ√(x₁ × x₂ × ... × xₙ), where n is the number of values. The geometric mean is always less than or equal to the arithmetic mean and is the appropriate average to use when the data is skewed or when comparing quantities that are meant to be multiplied. Finally, mean seasonal variation involves calculating the average seasonal effect from a time series by finding the average deviation from the trend for each season over several years.

Topic 4.3 — Choosing the Right Average

Selecting the most appropriate average is a critical decision that depends on the nature of the data and the presence of any unusual features. The mean is the most commonly used average because it uses every piece of data in its calculation. It is best used for symmetrical distributions with no extreme outliers. However, its major weakness is that it is distorted by extreme values; a single multimillionaire in a dataset of salaries would pull the mean upwards, making it unrepresentative of the typical person.

The median is the most appropriate average when data is skewed or contains outliers, because it is resistant to extreme values and represents the true "middle" of the dataset. It is also the preferred average for ordinal data where the values can be ranked but the intervals between them are not meaningful. The mode is the only average that can be used for categorical data. It is also useful when you want to know the most popular or common item, such as the most frequently sold shoe size. The geometric mean should be chosen for data involving rates and ratios. In an exam, if a question asks which average best represents the data, you must first inspect the data for outliers and skewness. If they exist, the median is almost always the correct answer.

Topic 4.4 — Range, Quartiles & Interquartile Range (IQR)

While averages tell us where the centre of a dataset is, measures of spread tell us how dispersed the data is around that centre. The simplest measure of spread is the range, calculated as the highest value minus the lowest value. However, the range is heavily influenced by outliers and tells us nothing about the distribution of the middle values. To get a better picture of spread, we use quartiles. The lower quartile (Q1) is the value below which 25% of the data falls, and the upper quartile (Q3) is the value below which 75% of the data falls. To find Q1 and Q3, you first find the median, then find the median of the lower and upper halves of the data, respectively.

The interquartile range (IQR) is the difference between the upper and lower quartiles (Q3 − Q1). It represents the spread of the middle 50% of the data and is therefore not affected by extreme values at either end. This makes the IQR a much more robust measure of spread than the range. In the context of a cumulative frequency graph, the IQR can be found by reading the values corresponding to one-quarter and three-quarters of the total frequency on the vertical axis, moving across to the curve, and reading down to the horizontal axis. A common student error is to confuse the IQR with the range of the middle half, which is technically correct, but you must be clear that it is a single value (the difference) not an interval.

Topic 4.5 — Percentiles, Interpercentile Range & Interdecile Range

Higher-tier students must extend their understanding of spread beyond quartiles to percentiles. A percentile is a measure that indicates the value below which a given percentage of observations in a group of observations falls. For example, the 90th percentile is the value below which 90% of the data lies. The median is the 50th percentile, Q1 is the 25th percentile, and Q3 is the 75th percentile. Percentiles are read from a cumulative frequency graph by finding the relevant percentage of the total frequency on the vertical axis, moving across to the curve, and reading down to the horizontal axis.

The interpercentile range is the difference between any two percentiles, most commonly the 10th and 90th percentiles (P90 − P10). The interdecile range is simply a specific type of interpercentile range, measuring the spread between the 10th and 90th percentiles. These measures are useful because, like the IQR, they eliminate the influence of the most extreme outliers by focusing on the central bulk of the data. They are particularly valuable in fields like finance and standardised testing, where understanding the distribution of the middle 80% of a population is more important than the full range.

Topic 4.6 — Standard Deviation

Standard deviation is the most important measure of spread for symmetrical distributions and is a Higher-tier topic. It measures how far, on average, each data point is from the mean. A low standard deviation indicates that the data points tend to be close to the mean, while a high standard deviation indicates that the data is spread out over a wider range. There are two common formulas. The definitional formula is σ = √[ (1/N) × Σ(x − x̄)² ], which involves finding the deviation of each point from the mean, squaring it, summing these squared deviations, dividing by the number of items, and finally taking the square root.

The computational formula, which is often easier for manual calculation, is σ = √[ (Σx²/N) - (Σx/N)² ]. In an exam, you might be given a small dataset and asked to calculate the standard deviation step-by-step. It is crucial to show all your working, especially the squared deviations, and to remember the final square root—a very common error is to stop at the variance (the value inside the square root). Interpreting standard deviation in context is also key: "The standard deviation of 1.79 hours indicates that most students spent close to the mean revision time, with little variation." Unlike the IQR, standard deviation uses all data points and is therefore appropriate for data without outliers where a full measure of spread is required.

Topic 4.7 — Outliers: Identifying by Inspection & Calculation

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. Identifying outliers is crucial because a single extreme value can distort averages and measures of spread, leading to misleading conclusions. Outliers can be identified by inspection—simply looking at a dataset or a graph like a box plot and spotting a value that is separated from the main cluster. However, a more rigorous, objective method is required for calculations.

For Foundation and general use, the "1.5 × IQR rule" is applied. A value is considered an outlier if it is less than Q1 − 1.5 × IQR or greater than Q3 + 1.5 × IQR. For Higher-tier students, another method is the "3-sigma rule," where any value lying more than three standard deviations from the mean (outside the range μ ± 3σ) is flagged as an outlier. Once an outlier is identified, it is not automatically discarded. A statistician must consider whether it is a genuine but rare value or a result of measurement error. In an exam, you must comment on the outlier in the context of the problem. For example: "The value of 120 minutes is an outlier... it skews the mean, so the median would be a better average to use."

Topic 4.8 — Moving Averages & Identifying Trends

A moving average is a technique used to smooth out short-term fluctuations in time series data and highlight longer-term trends or cycles. The process involves calculating the average of a fixed number of consecutive time periods and then moving the calculation along the dataset. The most common at GCSE is the 4-point moving average for quarterly data, calculated as (Q1 + Q2 + Q3 + Q4) / 4. The resulting value is plotted at the centre of the time period it covers. For example, the average of Q1, Q2, Q3, and Q4 of 2022 would be plotted at the midpoint between Q2 and Q3.

By calculating and plotting a series of moving averages, a trend line can be drawn through the points. This trend line shows the underlying direction of the data, stripped of seasonal variation. Using this trend line, predictions can be made about future values, though care must be taken not to extrapolate too far into the future. A common mistake is plotting the moving average at the end of the four quarters rather than in the middle. It is also important to note that the first few and last few data points will not have a moving average calculated for them, as there are not enough surrounding data points.

Topic 4.9 — Line of Best Fit: Double Mean Point & Regression Line

At the Higher tier, the line of best fit on a scatter graph is treated with greater mathematical rigour. The line of best fit must always pass through the point (x̄, ȳ), known as the **double mean point**, where x̄ is the mean of the explanatory variable and ȳ is the mean of the response variable. This point acts as an anchor for the line. The regression line is the formal name for the line of best fit that minimises the total distance between itself and all the data points, and it is described by the equation y = a + bx.

In the exam, you will typically be given the value of the double mean point and the gradient, or you may need to use these to find the equation of the line. The regression line is used to make predictions. Interpolation, predicting a value within the range of the existing data, is considered reliable. Extrapolation, predicting a value outside the range, is less reliable because the linear relationship may not hold beyond the observed data. The regression line is a tool of estimation, and its predictions are subject to the same caveats as any statistical model—namely, that it is only as good as the data from which it was derived.

Frequently Asked Questions

Why is the mean from a grouped frequency table called an 'estimated' mean?▼

Because we have grouped the raw data into intervals, we no longer know the exact values. We use the midpoint of each interval as an estimate, assuming the data points are evenly distributed within that range.

Which average is best for data with extreme outliers?▼

The median is best. Because it is the middle value, it is 'resistant' to extreme outliers. The mean, however, is calculated using every value and would be pulled towards the outlier, making it unrepresentative.

What is the difference between the Range and the Interquartile Range (IQR)?▼

The Range is the difference between the absolute highest and lowest values, making it very sensitive to outliers. The IQR is the difference between the upper and lower quartiles (Q3 - Q1), measuring only the spread of the middle 50% of the data.

What does a high standard deviation tell you about a dataset?▼

A high standard deviation indicates that the data points are spread out over a wide range from the mean. A low standard deviation means the data points tend to be very close to the mean (more consistent).

Chapter 4: Statistical Measures — Averages, Spread & Correlation