Computer Science
Grade 12
20 min
Data Collection
Data Collection
Tutorial Preview
1
Introduction & Learning Objectives
Learning Objectives
Differentiate between primary and secondary data collection methods in a computational context.
Design a data collection plan for a given research question, specifying the method, tools, and data format.
Identify and mitigate potential sources of bias (e.g., sampling, selection, confirmation) in a data collection process.
Implement a simple data collection script using an API, respecting rate limits and terms of service.
Analyze the ethical implications of a data collection strategy, including privacy, consent, and data security.
Explain the purpose and structure of a `robots.txt` file in the context of automated web data collection.
How does a self-driving car collect terabytes of data every day to learn how to navigate a busy street? 🚗 Let's explo...
2
Key Concepts & Vocabulary
TermDefinitionExample
API (Application Programming Interface)A set of rules and protocols that allows different software applications to communicate with each other. In data collection, APIs are used to programmatically request and receive structured data from a service (e.g., Twitter API, OpenWeatherMap API).Using the GitHub API to fetch a list of all repositories for a specific user, receiving the data in a structured JSON format.
Web ScrapingThe process of using bots or scripts to automatically extract large amounts of data from websites. This is often done when a formal API is not available.A Python script that visits a movie review website, parses the HTML of the page, and extracts the movie title, rating, and user comments into a CSV file.
Sampling BiasA type of bias that occurs whe...
3
Core Syntax & Patterns
API Rate Limiting Pattern
while (has_more_data) {
response = make_api_request(url, params);
process_data(response.data);
wait(time_to_respect_rate_limit);
url = response.next_page_url;
}
When collecting data from an API, services impose a 'rate limit'—a maximum number of requests allowed in a given time period. This pseudocode pattern shows a loop that makes a request, processes the data, and then intentionally pauses before making the next request to avoid being blocked by the server.
Robots.txt Protocol
User-agent: *
Disallow: /private/
Disallow: /search
Sitemap: https://example.com/sitemap.xml
Before scraping a website, a well-behaved script must first check the `robots.txt` file (e.g., `www.example.com/robots.txt`). This file specifies which parts of t...
4 more steps in this tutorial
Sign up free to access the complete tutorial with worked examples and practice.
Sign Up Free to ContinueSample Practice Questions
Challenging
You are tasked with designing a data collection plan for the A/B test of a mobile app's 'Sign Up' button color (blue vs. green) to see which one increases user registrations. Which of the following plans is the most robust and methodologically sound?
A.Randomly assign each new user to see either the blue or green button. Log events for 'button_view', 'button_click', and 'registration_complete' as JSON objects to an analytics server. Collect data for two weeks, then compare the registration rate (registrations/views) for each group.
B.Show all users the blue button for one week and log registrations. Then, show all users the green button for the next week and log registrations. Compare the total number of registrations from each week.
C.Ask users in a pop-up survey which button color they prefer and use the majority vote to determine the winner.
D.Release the green button and only collect data on 'registration_complete' events. If the number of registrations is higher than the historical average with the blue button, conclude that green is better.
Easy
What is the primary purpose of a `robots.txt` file in the context of automated data collection?
A.To provide a sitemap of the website for easier navigation by bots.
B.To specify rules for web crawlers, indicating which parts of the site should not be accessed.
C.To encrypt the data transmitted between a web scraper and the server.
D.To list the API endpoints available for programmatic data access.
Easy
A computer science student uses a public dataset of historical stock prices from a government website to train a prediction model. What type of data collection method is this?
A.Primary data collection
B.Secondary data collection
C.Experimental data collection
D.Observational data collection
Want to practice and check your answers?
Sign up to access all questions with instant feedback, explanations, and progress tracking.
Start Practicing Free