The answers to this question explain how to randomly sample from a BigQuery table. Is there an efficient way to do this with replacement?
As an example, suppose I have a table with 1M rows and I wish to select 100K independently random sampled rows.
The answers to this question explain how to randomly sample from a BigQuery table. Is there an efficient way to do this with replacement?
As an example, suppose I have a table with 1M rows and I wish to select 100K independently random sampled rows.
Found a neat solution:
Code:
# randomly sample 100K rows from `table` with replacement
with large_table as (select *, row_number() over() as rk from `table`),
num_elements as (select count(1) as n from large_table),
dummy_table as (select 1 + cast(rand() * (select n - 1 from num_elements) as int64) as i from unnest(generate_array(1, 100000)))
select * from dummy_table
inner join large_table on dummy_table.i = large_table.rk