chapter 2 : Data cleaning

Outliers

Outliers are data points that lie far outside the pattern of the rest of the dataset — values that are significantly higher or lower than most others. They may occur due to measurement errors, data entry mistakes, natural variation, or rare events, and they can distort key statistics like averages and standard deviations if not handled correctly.

Example:
A column with ages: 22, 24, 23, 25, 120
Here, 120 is likely an outlier because it differs drastically from the typical age range.


Why Outliers Matter

Outliers can have a large impact on data analysis because:
• They can skew summary statistics such as the mean.
• They may distort trends, predictions, and modeling results.
• Sometimes outliers represent true, important signals (e.g., exceptional sales records), so they shouldn’t always be removed without thought.


How Outliers Are Detected

There are several common ways to identify outliers:

📈 Visualization Methods
Box plots, histograms, scatter plots make it easy to spot values that lie far outside normal ranges.

📊 Statistical Methods
Interquartile Range (IQR): Values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR are flagged as outliers.
Z-Score: Values with a high absolute z-score (e.g., greater than 2 or 3) indicate distances far from the mean.


How Outliers Are Treated

Deciding what to do with outliers depends on context:
Investigate them manually to check whether they are errors or true values.
Remove outliers only if they are clearly due to mistakes or noise.
Use robust statistics like median or IQR-based methods that are less affected by outliers.
Transform or cap values so extreme points don’t overly drive results.