How you collect your data determines the validity of your entire investigation. This chapter covers the different types of data, the importance of reliability and validity, and the various sampling methods used by statisticians.
Topic 2.1 — Types of Data: Categorical, Ordinal, Discrete, Continuous
Understanding the different types of data is the essential first step in any statistical investigation, because the nature of the data dictates every subsequent decision, from the choice of sampling method to the type of chart used for presentation. Data can be broadly classified based on whether it is numerical or non-numerical. Categorical data, also known as qualitative data, consists of descriptions or labels that have no numerical order or ranking. Examples include eye colour, nationality, or favourite genre of music. Because there is no inherent order to these categories, it would be nonsensical to calculate a mean or median for categorical data. Ordinal data, while also consisting of categories, possesses a clear, meaningful order or ranking. Examples include exam grades (A*, A, B, C), satisfaction ratings (very satisfied to very dissatisfied), or clothing sizes (small, medium, large). The critical distinction here is that although the categories are ordered, the intervals between them may not be equal or meaningful.
Turning to numerical data, we encounter discrete and continuous types. Discrete data is countable in whole numbers. You can have 0, 1, 2 or 3 siblings, but you cannot have 2.5 siblings. Other examples include the number of cars in a car park or the number of students in a class. Discrete data is typically represented by isolated points on a graph and is often summarised using bar charts. Continuous data, on the other hand, is measured on a scale and can take any value within a given range. Height, weight, temperature, and time are all continuous variables. Because continuous data can always be measured more precisely (e.g., a person's height could be 1.72m, 1.723m, or 1.7231m depending on the instrument), it is often grouped into class intervals for analysis.
Topic 2.2 — Quantitative, Qualitative & Bivariate Data
Building on the foundational types of data, GCSE Statistics requires a clear grasp of three further classifications that describe the scale and dimensionality of a dataset. Quantitative data is any data that is numerical in nature. This encompasses both discrete and continuous data, as both are measured using numbers. The key advantage of quantitative data is that it lends itself to arithmetic operations; we can calculate means, standard deviations, and correlations. If you can perform meaningful mathematical calculations on the values, the data is quantitative. In stark contrast, qualitative data is descriptive and non-numerical. It deals with qualities and characteristics, such as the colour of a car, the breed of a dog, or a person's favourite sport.
The concept of bivariate data introduces the idea of examining two variables simultaneously on the same set of subjects. The prefix "bi-" means two, so bivariate data involves paired measurements. For example, a researcher might collect data on both the age and the resting heart rate of a group of individuals. Each person in the sample provides two values, creating a pair of coordinates that can be plotted on a scatter graph. The purpose of collecting bivariate data is usually to investigate a relationship or association between the two variables. It is important to distinguish this from univariate data, which only looks at one variable at a time. In the context of bivariate data, one variable is typically designated as the explanatory (or independent) variable, and the other as the response (or dependent) variable.
Topic 2.3 — Multivariate Data
At the Higher tier, students must extend their understanding from bivariate data to the more complex realm of multivariate data. While bivariate data involves exactly two variables measured on the same subject, multivariate data involves three or more variables measured simultaneously on the same subject. For example, a medical study might record a patient's age, blood pressure, cholesterol level, and body mass index. Each patient in the study provides a set of four values, creating a multi-dimensional dataset. The primary difference between bivariate and multivariate analysis is the complexity of the relationships being investigated. With bivariate data, we look for a simple correlation between two variables. With multivariate data, we might investigate how several variables interact to produce a particular outcome.
Multivariate data is prevalent in the real world. Schools might track student attendance, punctuality, and test scores to identify at-risk pupils. Climate scientists might analyse temperature, humidity, and wind speed together to predict weather patterns. Retail companies use multivariate data on customer age, income, and purchasing habits to target their marketing. In examinations, Higher-tier students may be presented with a table containing multiple columns of data and asked to identify that it is multivariate. They may be asked to suggest which variables could be paired to investigate a specific hypothesis, or to critique a study for failing to control for a key variable. Understanding multivariate data is also the gateway to more advanced statistical methods used at A-Level and beyond.
Topic 2.4 — Grouped vs Ungrouped Data & Choosing Class Intervals
Data in its raw, unprocessed form is known as ungrouped data. This consists of individual data points listed as they were collected, such as the exact heights of every student in a class. While ungrouped data contains the most detailed information, it can be unwieldy and difficult to interpret when the dataset is large. To manage this, statisticians group the data into class intervals, creating grouped data. For example, rather than listing every individual height, we might create intervals such as 150–160 cm, 160–170 cm, and so on. The primary advantage of grouping is that it simplifies complex datasets, making them easier to present in tables and draw as histograms or frequency polygons. However, a significant disadvantage is the loss of detail and precision.
Choosing appropriate class intervals is therefore a critical skill. Intervals should be mutually exclusive, exhaustive, and of equal width where possible, though unequal widths can be used with the correct techniques. The number of intervals should be sufficient to show the shape of the distribution—usually between five and fifteen—but not so numerous that the data becomes sparse and loses its summarising power. For example, grouping the ages of 100 people into intervals of 1 year would not simplify the data much, whereas grouping them into a single interval of 0–100 years would obscure all patterns. A decision on class intervals is a balance between clarity and accuracy.
In any investigation that seeks to establish a relationship between two variables, it is essential to distinguish between the explanatory variable and the response variable. The explanatory variable, also known as the independent variable, is the one that is thought to explain or cause changes in the other. It is the variable that the researcher manipulates, controls, or uses to make predictions. For example, in an investigation into whether the amount of fertiliser used affects plant growth, the amount of fertiliser is the explanatory variable. The response variable, also known as the dependent variable, is the outcome that is measured. In the fertiliser example, the height of the plant after a fixed period would be the response variable.
Identifying these variables correctly is crucial for designing the investigation and for presenting the data correctly. By convention, when plotting bivariate data on a scatter graph, the explanatory variable is always placed on the horizontal x-axis, and the response variable is placed on the vertical y-axis. This convention ensures that the graph is read as "the effect of x on y." A common student error is to place the variables on the wrong axes, which can lead to confusion when drawing a line of best fit or interpreting the direction of causality. Understanding this distinction prevents fundamental errors in data collection and analysis and forms the basis for more advanced concepts like regression.
Topic 2.6 — Primary vs Secondary Data
A fundamental decision in the data collection stage is whether to use primary or secondary data, and understanding the distinction, advantages, and limitations of each is a core requirement of GCSE Statistics. Primary data is information that is collected firsthand by the investigator specifically for the purpose of the current investigation. This could be through methods such as questionnaires, interviews, experiments, or direct observation. The main advantage of primary data is its relevance and specificity; because you design the collection method yourself, you can ensure it directly addresses your hypothesis and measures exactly the variables you are interested in. However, the collection of primary data is often time-consuming and expensive.
Secondary data, in contrast, is information that has been previously collected by another person or organisation for a different purpose. Sources include government databases (like ONS), academic studies, company records, and published statistics. The primary advantage of secondary data is its efficiency; it is often free or low-cost and can provide access to large-scale datasets that would be impossible for an individual to collect. The major limitation is the lack of control over how the data was collected. When using secondary data, it is absolutely essential to acknowledge the source. This allows others to verify the data and assess its credibility for themselves.
Topic 2.7 — Methods of Data Collection
The method chosen to collect data has a profound impact on the quality and character of the resulting information. The experimental method involves controlling variables to establish cause-and-effect relationships. Simulation involves creating a model of a real-world situation, often using random numbers to mimic probabilistic events. Questionnaires and interviews are common methods for gathering opinions or self-reported data, though they are susceptible to response bias. Observation involves watching and recording behaviour or events as they occur naturally, which can reduce self-reporting bias but may suffer from observer bias.
On a larger scale, a census is a data collection method that attempts to gather information from every single member of a population. While this provides complete and highly accurate data, it is often prohibitively expensive and time-consuming, which is why most research relies on sampling—the process of selecting a subset of the population to study. Each method carries its own strengths and limitations. For example, a laboratory experiment offers high validity for establishing causation but may lack ecological validity (real-world applicability). A questionnaire is cheap and fast but may yield low-quality data if questions are poorly designed.
Topic 2.8 — Reliability & Validity
Reliability and validity are two cornerstone concepts in assessing the quality of any data collection method. Reliability refers to the consistency of a method. A reliable measurement is one that would produce the same, or very similar, results if it were repeated under the same conditions. For example, if you used a stopwatch to time how long it took ten different people to solve a puzzle, a reliable method would be one where the timings were consistent and repeatable. If the stopwatch was faulty or the instructions for starting and stopping were ambiguous, the method would be unreliable because the results would vary randomly from one attempt to the next.
Validity, however, is about whether the method actually measures what it claims to measure. A method can be highly reliable but completely invalid. For instance, using a ruler to measure a person's "intelligence" would be a highly reliable method—the ruler would give a consistent number every time—but it is utterly invalid because height is not a measure of intelligence. It is possible for a method to be neither reliable nor valid, to be reliable but not valid, or ideally, both reliable and valid. To improve reliability, researchers can standardise procedures, use precise instruments, and repeat measurements. To improve validity, they must ensure that the method is directly and appropriately linked to the concept being investigated.
Topic 2.9 — Bias: Sources, Types & How to Minimise It
Bias refers to a systematic distortion in a statistical process that results in a misrepresentation of the true population parameter. Unlike random error, which affects all measurements non-systematically, bias pulls results consistently in one direction, leading to inaccurate conclusions. Sampling bias occurs when the sample is not representative of the population. Question wording bias occurs when leading or emotionally charged language influences respondents' answers. Sensitivity bias arises when questions are too personal, causing participants to lie or refuse to answer, which can skew results.
To minimise bias, a statistician must employ proactive strategies at every stage. Using random or stratified sampling methods helps ensure every subgroup is represented proportionally. Questions should be neutral, clear, and pre-tested for understanding. Anonymity and confidentiality can reduce sensitivity bias by making respondents feel safer. For Higher-tier students, the concept of level of control is also relevant; in experiments, controlling extraneous variables prevents them from confounding the results. A common mistake among students is to suggest "using a larger sample" as the universal cure for bias. While a larger sample can reduce random variation, it does not eliminate systematic bias.
Topic 2.10 — Population, Sample Frame & Sample
These three terms form the foundational vocabulary of sampling. The population is the entire group of individuals or items that are the subject of the statistical investigation. It is crucial to define the population precisely. The sample frame is the list of all members of the population from which the sample is actually drawn. In an ideal world, the sample frame would be identical to the population, but in practice, there are often discrepancies. For example, if your population is all households in a town but your sample frame is a list of households with a landline telephone, you have immediately excluded households that only use mobile phones, potentially introducing bias.
The sample is the specific subset of the population that is selected for study. The primary reason for using a sample rather than a census is practicality—studying a well-chosen sample is faster, cheaper, and more efficient than studying the entire population, and when the sampling is done correctly, the results can be generalised with a high degree of confidence. However, the quality of the sample directly determines the trustworthiness of the conclusions. If the sample is unrepresentative, the findings will not reflect the reality of the wider population. A strong student will not only define the terms but also be able to critique a given scenario by identifying mismatches between the intended population and the actual sample frame.
Selecting an appropriate sampling method is critical to obtaining a representative dataset. In a simple random sample, every member of the population has an equal chance of being selected. Systematic sampling involves selecting every nth member from an ordered list. Quota sampling divides the population into subgroups and sets a quota for how many participants to select from each; it is cheap and fast but prone to interviewer bias. Stratified sampling also divides the population into strata (subgroups), but then selects participants randomly from within each stratum in proportion to their size in the population.
The formula (stratum size ÷ population size) × sample size ensures each subgroup is represented fairly. It is highly representative but more complex to organise. Opportunity or convenience sampling uses whoever is readily available, such as asking your friends. Judgement sampling relies on the researcher's personal choice. Both of these are quick but carry a very high risk of bias, as they almost guarantee a non-representative sample. For the exam, you must be able to calculate stratified sample sizes and justify your choice of sampling method based on the trade-off between representativeness and practicality.
Topic 2.12 — Questionnaire Design
A well-designed questionnaire is one of the most powerful tools for collecting primary data, but poor design can render the resulting information useless. The wording of each question must be neutral and unambiguous. Leading questions, which suggest a particular answer, must be avoided at all costs. Another critical principle is the inclusion of a timeframe. A question like "How often do you exercise?" is too vague. A better version is, "How many hours of exercise did you do last week?" which provides a specific reference period and makes the responses comparable.
Questionnaires should employ a mix of open and closed questions. Closed questions, with tick-box responses, are quick to answer and easy to analyse quantitatively. Open questions allow for more detailed, qualitative responses but are harder to analyse and may intimidate respondents. The layout should be clear, with logical flow and adequate space for answers. Pre-testing a questionnaire on a small group can help identify confusing questions or misleading terms before the main data collection begins. In the exam, when asked to improve a questionnaire, always ensure you address bias, ambiguity, and timeframe to demonstrate a thorough understanding of sound design principles.
Frequently Asked Questions
Height is a continuous variable because it is measured on a scale and can take any value within a range (e.g., 172.5cm, 172.54cm). Discrete data, by contrast, can only take specific whole number values.
The main disadvantages of a census are the significant cost and time required to survey every single member of a population. For large populations, it is often practically impossible.
Stratified sampling ensures that different subgroups (strata) of a population are represented proportionally in the final sample. Simple random sampling might accidentally miss or under-represent a small but important subgroup.
A sampling frame is a list of all members of the population from which the sample is actually drawn (e.g., a school register or the electoral roll). Bias occurs if the frame doesn't match the true population.