Glossary

Data Point (a.k.a. Example or Observation)

A single, discrete unit of information. In a typical dataset, a data point corresponds to a single row in a table.

Data Set

A collection of all data points or examples.

Outliers

An outlier is an observation that deviates markedly from other observations in the sample. An outlier may be the result of:

Bad data: The data may have been entered incorrectly or resulted from a faulty measurement. If an investigation confirms this, the outlier should be corrected or deleted.
A true, rare event: The outlier may be a valid data point that is simply unusual. In this case, it should not be deleted, but it requires careful consideration during analysis.

Construct

A concept that is difficult to measure because it can be defined in many different ways (e.g., happiness, intelligence, memory).

Operational Definition

The specific way a construct is measured. Once a construct is operationally defined, it is no longer ambiguous (e.g., defining "intelligence" as an "IQ score").

Treatment

In an experiment, a treatment is the specific condition or intervention applied to the subjects. Researchers are interested in how different treatments might yield different results.

Observational Study

In an observational study, researchers observe subjects and measure variables of interest without assigning treatments to the subjects. In contrast, an **experiment** involves applying a treatment to subjects in a controlled environment. With observational studies, we can only make inferences about correlation, but with experiments, we can make conclusions about causation.

Treatment Group

The group in a study that receives the treatment or intervention being tested.

Control Group

The group in a study that does not receive the treatment. This group serves as a baseline for comparison.

Independent Variable (a.k.a. Feature, Predictor)

The variable that is manipulated by the experimenter to observe its effect on the dependent variable. It is the presumed _cause_ and is typically plotted on the x-axis.

Dependent Variable (a.k.a. Outcome, Target)

The variable that is measured by the experimenter to see how it is affected by the independent variable. It is the presumed _effect_ and is typically plotted on the y-axis.

Confounding Variable (a.k.a. Lurking Variable)

An unmeasured third variable that influences both the independent and dependent variables. Confounding variables can distort the results of an experiment by suggesting a relationship between the independent and dependent variables where none exists, or by hiding a true relationship.

Control Variable

A variable that is held constant throughout an experiment to prevent it from influencing the outcome. Unlike a confounding variable, a control variable is known to the experimenter and is intentionally controlled.

Normalization (or Min-Max Scaling)

A technique used to transform numerical data to a common scale, typically between 0 and 1. The formula is:

x_new = (x - x_min) / (x_max - x_min)

Normalization is sensitive to outliers and is usually applied after they have been removed.

Standardization (or Z-Score Normalization)

A technique used to transform numerical data to have a mean of 0 and a standard deviation of 1. The formula is:

x_new = (x - mean) / std_dev

The resulting value, x_new, is called the z-score. Standardization is not affected by outliers and does not have a bounded range like normalization.

Central Limit Theorem

The Central Limit Theorem states that if you take a sufficiently large number of random samples from a population and calculate the mean of each sample, the distribution of those sample means will be approximately normal, regardless of the shape of the original population's distribution. The mean of the sample means will also be a good estimate of the population mean.

Standard Error

The standard deviation of the sampling distribution of a statistic (most commonly, the mean). A small standard error indicates that the sample statistic is a more accurate estimate of the population parameter.

Variance

The average of the squared differences from the mean. The square root of the variance is the **standard deviation**.

Coefficient of Variation (CV)

The ratio of the standard deviation to the mean (std_dev / mean). It is a useful statistic for comparing the degree of variation between two different datasets, even if their means are drastically different.

Covariance and Correlation

**Covariance** measures how two variables change together. It can range from -∞ to +∞. A positive value means the variables tend to move in the same direction, while a negative value means they move in opposite directions. A value of 0 means there is no linear relationship.

Correlation is a scaled version of covariance that ranges from -1 to +1, making it easier to interpret. A value of +1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation.

Confidence Level

The percentage of times you expect to reproduce an estimate of the population parameter within a certain range (the confidence interval). For example, a 95% confidence level means that if you were to repeat the experiment many times, 95% of the calculated confidence intervals would contain the true population parameter.

Confidence Interval

The range of values within which the true population parameter is estimated to lie, with a certain level of confidence. For example, a 95% confidence interval for the mean age of students might be [15, 19].

The width of the confidence interval is affected by:

Sample size: Larger samples produce narrower intervals.
Variability: Greater variability in the sample produces wider intervals.

Discrete Variable

A variable that can only take on a finite number of values (e.g., the number of students in a class).

Continuous Variable

A variable that can take on an infinite number of values within a given range (e.g., the weight of a person).

Frequency Distribution

A summary of the number of times each distinct value of a variable occurs. It is usually applied to categorical variables.

Null Hypothesis (H₀)

The null hypothesis represents the status quo or the existing belief, typically stating that there is no effect, no difference, or no relationship. The goal of hypothesis testing is to determine whether there is enough evidence to reject the null hypothesis in favor of an alternative hypothesis.

p-value

The probability of obtaining a result as extreme or more extreme than the one observed in the data, assuming the null hypothesis is true. A small p-value (typically ≤ 0.05) indicates that the observed result is unlikely to have occurred by chance alone, providing evidence against the null hypothesis.

t-test vs. ANOVA

t-test: Used to compare the means of two groups.
ANOVA (Analysis of Variance): Used to compare the means of three or more groups.

t-test vs. Chi-Squared Test

t-test: Used for continuous data to compare the means of two groups.
Chi-Squared Test: Used for categorical data to test for a relationship between two variables or to see if the observed data fits an expected distribution.