Training Data

Tutorial Preview

1

Introduction & Learning Objectives

Learning Objectives Define 'training data' and explain its critical role in machine learning. Differentiate between labeled and unlabeled data with clear examples. Explain the importance of data quality, including quantity, diversity, and accuracy. Identify potential sources of bias in a given training dataset. Describe the purpose of splitting data into training, validation, and testing sets. Provide at least two real-world examples of how training data is used to power AI applications. How does your favorite music app seem to know exactly what new song you'll love? 🎶 It's not magic; it's all about the data it learned from! Artificial Intelligence learns from examples, just like you do when you study for a test. In this lesson, we'll explore...

2

Key Concepts & Vocabulary

TermDefinitionExample Training DataA collection of examples, such as images, text, or numbers, used to teach an AI model how to make predictions or decisions.A dataset of 10,000 images, where 5,000 are labeled 'cat' and 5,000 are labeled 'dog', used to train a model to recognize cats and dogs. Labeled DataData where each piece of information is tagged with a correct answer or 'label'. This is used for supervised learning.An email message (the data) that is tagged with the label 'Spam'. The model learns the connection between the email's content and the 'Spam' label. Unlabeled DataData that has not been tagged with any labels or answers. It's often used to find hidden patterns or structures within the data.A large collection of cu...

3

Core Syntax & Patterns

GIGO: Garbage In, Garbage Out Model Quality ≈ Data Quality This is the most important rule in machine learning. The performance of an AI model is fundamentally limited by the quality of its training data. If you train a model on incorrect, biased, or incomplete data, it will make incorrect, biased, or incomplete predictions. The 80/20 Split (or 80/10/10) Dataset = Training Set (≈80%) + Testing Set (≈20%) You should never test your model on the same data it was trained on. We split the data so the model learns from the 'training set' and we evaluate its performance on the 'testing set', which it has never seen before. Sometimes, a third 'validation set' is used for tuning. The Data Diversity Principle Training Data must reflect Real-World D...

4 more steps in this tutorial

Sign up free to access the complete tutorial with worked examples and practice.

Sign Up Free to Continue

Sample Practice Questions

Challenging

A school builds an AI to predict student dropout risk using grades and attendance data. The model performs poorly. Reflecting on the 'Assuming Data is Perfect' pitfall, what crucial investigation might they have missed that could lead to a 'Garbage In, Garbage Out' problem?

A.They did not use a fast enough computer to train the model.

B.They did not split the data into training and testing sets.

C.They did not consider that the data might be too large.

D.They did not check for data entry errors, missing records, or consider if other factors (like socioeconomic status) were unfairly excluded.

Challenging

You are given a large, unlabeled dataset of customer reviews and are tasked with building a supervised AI model to classify them as 'Happy', 'Neutral', or 'Angry'. What is the most critical first step you must take with this data?

A.Manually read a large portion of the reviews and add the correct 'Happy', 'Neutral', or 'Angry' label to each one.

B.Immediately feed the data into a machine learning algorithm to see what happens.

C.Split the data into training and testing sets.

D.Count the number of words in each review to create a feature.

Challenging

A bank's loan approval AI, trained on 50 years of historical data, is found to be biased against applicants from a certain neighborhood. This is likely because the historical data reflects past discriminatory practices. What is the BEST first step to mitigate this bias?

A.Delete all data related to the affected neighborhood.

B.Use an even faster computer to retrain the model on the same data.

C.Create a new rule that automatically approves everyone from that neighborhood.

D.Audit the dataset to identify and address the sources of bias, potentially by collecting new, more representative data or using advanced techniques to re-weight the existing data.

Want to practice and check your answers?

Sign up to access all questions with instant feedback, explanations, and progress tracking.

Start Practicing Free

More from Introduction to AI

What is AI? Machine Learning Basics AI Applications Ethics in AI

Tutorial Preview

Introduction & Learning Objectives

Key Concepts & Vocabulary

Core Syntax & Patterns

Sample Practice Questions

More from Introduction to AI

Ready to find your learning gaps?