Lecture 2
Introduction to Machine Learning. Part 2
What is the problem owner hoping to accomplish and why?
Why am I (being asked to) solve it?
Am I the right person to solve this problem?
What are the (psychological, societal, and environmental) repercussions of building this technology?
Should this thing be built at all?
What are the metrics of success?
Know your data!
Data need to be collected —> Datasets
What data is available?
What data should be available, but isn’t?
What population / system / process is your data representing?
And what properties of such population / system / process are included (or excluded)?
What biases (social, population, temporal) are present in your datasets?
Hold out dataset
Modality | Quantity | Quality | Freshness | Cost |
---|---|---|---|---|
Structured | Number of records | Errors | Rate of collection | Acquisition |
Semi-structured | Number of features | Missing data | Licensing | |
Bias | Cleaning and integrations |
Purposefully Collected Data | Administrative Data | Social Data | Crowdsourcing |
---|---|---|---|
Survey | Call records | Web pages | Distributed sensing |
Census | Financial transactions | Social Media | Implicit crowd work (e.g. captcha) |
Economic Indicators | Travel Data | Apps | Micro-work platforms (e.g Amazon Mechanical Turk) |
Ad-hoc sensing | GPS Data | Search Engines |
Purposefully Collected Data | Administrative Data | Social Data | Crowdsourcing |
---|---|---|---|
Modality: mostly structured | Modality: mostly structured | Modality: mostly semi-structured | Modality: all |
Quantity: low | Quantity: high | Quantity: low | Quantity: mid-low |
Quality: high | Quality: high | Quality: low | Quality: mid |
Freshness: low | Freshness: high | Freshness: high | Freshness: mid |
Cost: high | Cost: high | Cost: low | Cost: mid-low |
A physical, mathematical, logical, or conceptual representation of a system, entity, phenomenon, or process
Architecture plans
Maps
Music Sheet
Mathematical laws of physics!
Machine Learning (statistical) Models
Combination of supervised and unsupervised learning
Few labeled data in the input are used to create noisy labeled data
With more labeled data, the machine learns how to make input-output predictions
Often called fine-tuning
Reuse a model trained for one task is re-purposed (tuned) on a different but related task
Useful in tasks lacking abundant data
Data about the environment and reward function as input
The machine can perform actions influencing the environment
The machine learns behaviours that result in greater reward
Lecture 2
Introduction to Machine Learning. Part 2