Unit 4.1: Ethical and Social Issues Around Data Collection
In the modern world, data is collected at an unprecedented scale. Programmers must understand the ethical responsibilities and social implications associated with how this data is gathered, stored, and used.
What is Personal Data?
Personal data (or private data) refers to any information that can be used to identify, contact, or locate a specific individual, or that relates to an individual's private life.
Examples of personal data include:
- Personally Identifiable Information (PII): Full name, home address, Social Security number.
- Digital Identifiers: Email addresses, IP addresses, usernames.
- Financial Information: Credit card numbers, bank account details.
- Location Data: Real-time GPS coordinates or travel history.
- Health and Fitness: Medical records or heart rate data from a smartwatch.
Personal Privacy
When using a computer, personal privacy is at risk. Programmers should attempt to safeguard the personal privacy of the user.
- Example (Safeguarding): An app developer decides to hash (scramble) user passwords and encrypt home addresses so that even if the database is hacked, the actual private information remains unreadable.
- Example (Risk): A free flashlight app requests access to your contact list and microphone. This is an unnecessary privacy risk because the data being collected is not required for the app's function.
Algorithmic Bias
Algorithmic bias describes systemic and repeated errors in a program that create unfair outcomes for a specific group of users.
- Example (Hiring): An AI tool used to screen resumes is trained on historical data from a company where most past hires were men. The algorithm learns to prefer resumes containing "masculine" verbs or sports, unfairly penalizing qualified female candidates.
- Example (Credit Scoring): A loan-approval algorithm uses ZIP codes as a factor. Because certain ZIP codes are historically linked to specific ethnic groups, the algorithm may unfairly deny loans to individuals from those groups, even if they are financially stable.
Data Collection and Extrapolation
Programmers should be aware of the data set collection method and the potential for bias when using this method before using the data to extrapolate new information or drawing conclusions.
- Example (Survey Bias): A political poll is conducted entirely through landline phone calls. Since younger people rarely use landlines, the results cannot be used to reliably predict how the entire country (including young voters) will vote.
- Example (Product Testing): A company tests a new "universal" skin-tone sensor for a camera but only collects calibration data from light-skinned individuals. The resulting product fails to work correctly for people with darker skin tones.
Data Quality
Some data sets are incomplete or contain inaccurate data. Using flawed data can cause a program to work incorrectly or inefficiently.
- Example (Incomplete Data): A self-driving car algorithm is trained on millions of miles of highway driving but has zero data on snowy conditions. When the car encounters a blizzard, it cannot operate safely because its training data was incomplete.
- Example (Inaccurate Data): A hospital's patient database has a bug where many weights are recorded in kilograms but labeled as pounds. A medication-dosage program using this data could prescribe dangerous amounts of medicine.
Suitability of Data Sets
The contents of a data set might be related to a specific question or topic and might not be appropriate to give correct answers or extrapolate information for a different question or topic.
- Example (Irrelevant Topic): A scientist uses a data set of basketball player heights to try and predict the average height of the general population. This is inappropriate because basketball players are much taller than average, leading to an incorrect conclusion.
- Example (Context Shift): A developer uses historical stock market data from the 1920s to train a trading bot for the 2020s. Because the economic "rules" and technologies have changed completely, the old data set is no longer suitable for modern predictions.