Lab Link: https://www.skills.google/course_templates/1232/labs/598657
Challenge Scenario
You are a data scientist at Cymbal, an online retail store. You want to build a pipeline to constantly search for similar products on the market to inform a marketing comparison study. You have a few challenges:
How to handle multimodal data: The data you have collected is multimodal including text, image, and video, including some files in Cloud Storage.
How to perform a semantic similarity search instead of a keyword search: You want to find similar products across multiple dimensions (e.g., image, description, and specific features), where keyword search may not be effective.
How to use BigQuery to do it: Since most of your data is already in BigQuery, using the same tool could minimize the learning curve.
To address these challenges, you decide to implement multimodal vector search with BigQuery.
Task 1: Set up the Environment & Connection
0. Go to BigQuery, then Click on Connections.
1. Create Connection: In the BigQuery, click Create Connection.
* Connection type: Vertex AI remote models, remote functions and BigLake (Cloud Resource).
* Connection ID: vector_conn (or as specified in the lab).
* Location Type: us-west1 (or as specified in the lab).
2. Grant Permissions: Once created, click on the connection ID. Copy the Service Account ID (looks like an email).
3. IAM Policy: Go to IAM & Admin > IAM. Click Grant Access.
* New principals: Paste the Service Account ID.
* Role: BigQuery Data Owner.
* Role: Vertex AI User.
* Role: Storage Object Viewer.
Task 2: Create the Object Table
Create an object table that allows BigQuery to treat files in a GCS buckets as rows in a table. Lab will be provided the query to use. Then replace [] value in the query. Run the query in BigQuery SQL Editor.
Create or replace external table `[PROJECT_ID].[DATASET_NAME].[OBJECT_TABLE_NAME]`
with connection `[PROJECT_ID].[CONNECTION_REGION].[CONNECTION_NAME]`
options
(
object_metadata='SIMPLE',
uris=['gs://[PROJECT_ID]/*']
)Task 3: Generate Embeddings
Part 1: Connect to the multimodal embeddings model
1. Open the BigQuery SQL Editor.
2. Paste the following query.
3. Replace [] to the lab specific.
Create or replace model
`[PROJECT_ID].[DATASET_NAME].[MODEL_NAME]`
remote with connection `[PROJECT_ID].[CONNECTION_REGION].[CONNECTION_NAME]`
options(
endpoint = `[ENDPOINT_MODEL_NAME]`
);Example.
Create or replace model
`[PROJECT_ID].gcc_bqml_dataset.gcc_embedding`
remote with connection `[PROJECT_ID].us-west1.vector_conn`
options(
endpoint = 'multimodalembedding@001'
);4. Click Run.
Part 2: Generate embeddings
1. Clear BigQuery SQL Editor.
2. Paste the following query.
3. Replace [] to the lab specific.
Create or replace table `[PROJECT_ID].[DATASET_NAME].[EMBEDDINGS_TABLE_NAME]`
as select *, REGEXP_EXTRACT(uri, r'[^/]+$') as product_name
from [EMBEDDINGS_FUNCTION]
(
MODEL `[PROJECT_ID].[DATASET_NAME].[MODEL_NAME]`,
TABLE `[PROJECT_ID].[DATASET_NAME].[OBJECT_TABLE_NAME]`
)Example.
Create or replace table `[PROJECT_ID].gcc_bqml_dataset.gcc_retail_store_embeddings`
as select *, REGEXP_EXTRACT(uri, r'[^/]+$') as product_name
from ML.GENERATE_EMBEDDING
(
MODEL `[PROJECT_ID].gcc_bqml_dataset.gcc_embedding`,
TABLE `[PROJECT_ID].gcc_bqml_dataset.gcc_image_object_table`
)4. Click Run.
Task 4. Run a vector search
1. Clear BigQuery SQL Editor.
2. Paste the following query.
3. Replace [] to the lab specific.
Create or replace table `[PROJECT_ID].[DATASET_NAME].[SEARCH_RESULTS_TABLE]` AS
select base.uri,
base.product_name,
base.content_type,
distance
from
[VECTOR_SEARCH_FUNCTION](table [DATASET_NAME].[EMBEDDINGS_TABLE_NAME],'ml_generate_embedding_result',
(
SELECT ml_generate_embedding_result as embedding_col
FROM
[EMBEDDINGS_FUNCTION]
(
MODEL `[DATASET_NAME].[MODEL_NAME]`,
(select 'Men Sweaters' as content),
STRUCT(TRUE AS flatten_json_output)
)
),
[STATEMENT_TO_SELECT_TOP_2_RESULTS],
distance_type => 'COSINE'
);Example.
Create or replace table `[PROJECT_ID].gcc_bqml_dataset.gcc_vector_search_table` AS
select base.uri,
base.product_name,
base.content_type,
distance
from
VECTOR_SEARCH(table gcc_bqml_dataset.gcc_retail_store_embeddings,'ml_generate_embedding_result',
(
SELECT ml_generate_embedding_result as embedding_col
FROM
ML.GENERATE_EMBEDDING
(
MODEL `gcc_bqml_dataset.gcc_embedding`,
(select 'Men Sweaters' as content),
STRUCT(TRUE AS flatten_json_output)
)
),
top_k => 2,
distance_type => 'COSINE'
);Done the lab.