Link Lab: https://www.skills.google/course_templates/627/labs/612995
Challenge scenario
You have started a new role as a Data Engineer for TaxiCab Inc. You are expected to import some historical data to a working BigQuery dataset, and build a basic model that predicts fares based on information available when a new ride starts. Leadership is interested in building an app and estimating for users how much a ride will cost. The source data will be provided in your project.
You are expected to have the skills and knowledge for these tasks, so don't expect step-by-step guides to be provided.
Task 1: Clean Your Training Data
We need to copy data from the raw table into a clean training table. We will filter out anomalies like trips with zero distance, zero passengers, impossible fares, and coordinates that fall into the ocean (Null Island). We also combine tolls and fares to create the target fare_amount_617 column (The lab will be specific column name also table name and the conditions).
1. Open the BigQuery SQL Workspace.
2. Paste the following query and click Run. (Don't forget to check the lab specific require for clean the data).
CREATE OR REPLACE TABLE `taxirides.taxi_training_data_243` AS
SELECT
pickup_datetime,
pickup_longitude AS pickuplon,
pickup_latitude AS pickuplat,
dropoff_longitude AS dropofflon,
dropoff_latitude AS dropofflat,
passenger_count AS passengers,
(tolls_amount + fare_amount) AS fare_amount_617
FROM
`taxirides.historical_taxi_rides_raw`
WHERE
RAND() < 0.001
AND trip_distance > 0
AND fare_amount < 3
AND passenger_count > 0
AND pickup_longitude > -78 AND pickup_longitude < -70
AND dropoff_longitude > -78 AND dropoff_longitude < -70
AND pickup_latitude > 37 AND pickup_latitude < 45
AND dropoff_latitude > 37 AND dropoff_latitude < 45Task 2: Create a BigQuery ML Model
Now, we will train a Linear Regression model on clean data. Notice how we use a WITH clause to engineer the euclidean distance feature on the fly before feeding it into the model. With also need to use TRANSFORM to reduce RMSE. Don't forgot to change the fare_model to follow the lab specific.
1. Clear your SQL Editor.
2. Paste the following query and click Run.
CREATE OR REPLACE MODEL `taxirides.fare_model_251`
TRANSFORM(
fare_amount_617,
CAST(EXTRACT(DAYOFWEEK FROM pickup_datetime) AS STRING) AS dayofweek,
CAST(EXTRACT(HOUR FROM pickup_datetime) AS STRING) AS hourofday,
ST_Distance(ST_GeogPoint(pickuplon, pickuplat), ST_GeogPoint(dropofflon, dropofflat)) AS euclidean
)
OPTIONS(
model_type='linear_reg',
input_label_cols=['fare_amount_617']
) AS
SELECT * FROM `taxirides.taxi_training_data_243`;To check the RMSE, we can use this query to verify. (Don't forgot to change fare_model and taxi_training_data name)
SELECT
SQRT(mean_squared_error) AS rmse
FROM
ML.EVALUATE(MODEL `taxirides.fare_model_946`,
(SELECT * FROM `taxirides.taxi_training_data_243`)
);Task 3: Perform Batch Predictions on New Data
Finally, leadership wants we to predict fares for a new batch of 2015 data. We will use the ML.PREDICT function. Crucially, we must apply the exact same euclidean distance calculation to this prediction data so our model knows how to read it.
1. Clear your SQL Editor.
2. Paste the following query and click Run. (Don't forgot to change the fare_model to match the lab require.)
CREATE OR REPLACE TABLE `taxirides.2015_fare_amount_predictions` AS
SELECT * FROM ML.PREDICT(MODEL `taxirides.fare_model_946`,
(SELECT * FROM `taxirides.report_prediction_data`)
);Congratulations! Complete the lab.