22

I would like to be able to generate random bytea fields of arbitrary length (<1Gb) for populating test data.

What is the best way of doing this?

Evan Carroll
  • 63,051
  • 46
  • 242
  • 479
Jack Douglas
  • 39,869
  • 15
  • 101
  • 176

3 Answers3

26

Enhancing Jack Douglas's answer to avoid the need for PL/PgSQL looping and bytea concatenation, you can use:

CREATE OR REPLACE FUNCTION random_bytea(bytea_length integer)
RETURNS bytea AS $body$
    SELECT decode(string_agg(lpad(to_hex(width_bucket(random(), 0, 1, 256)-1),2,'0') ,''), 'hex')
    FROM generate_series(1, $1);
$body$
LANGUAGE 'sql'
VOLATILE
SET search_path = 'pg_catalog';

It's a simple SQL function that's cheaper to call than PL/PgSQL.

The difference in performance due to the changed aggregation method is immense for larger bytea values. Though the original function is actually up to 3x faster for sizes < 50 bytes, this one scales much better for larger values.

Or use a C extension function:

I've implemented a random bytea generator as a simple C extension function. It's in my scrapcode repository on GitHub. See the README there.

It nukes the performance of the above SQL version:

regress=# \a
regress=# \o /dev/null
regress=# \timing on
regress=# select random_bytea(2000000);
Time: 895.972 ms
regress=# drop function random_bytea(integer);
regress=# create extension random_bytea;
regress=# select random_bytea(2000000);
Time: 24.126 ms
Craig Ringer
  • 56,343
  • 5
  • 158
  • 190
  • 1
    Well, I came up with nearly the same solution, but tested only for lower values. There @Jack's solution was a clear winner. +1 for you for not stopping here :) – András Váczi Aug 16 '12 at 04:35
  • Thank you - this is excellent and thought provoking. I think FROM generate_series(0, $1); needs to be FROM generate_series(1, $1);. Have you tried recursion? My limited testing implies that this scales better: – Jack Douglas Aug 16 '12 at 05:45
  • CREATE OR REPLACE FUNCTION random_bytea(bytea_length integer) RETURNS bytea AS $body$ SELECT case when $1<1000 then decode(string_agg(lpad(to_hex(width_bucket(random(), 0, 1, 256)-1),2,'0') ,''), 'hex') else random_bytea($1/2)||random_bytea($1-$1/2) end FROM generate_series(1, case when $1<1000 then $1 else 0 end); $body$ LANGUAGE 'sql' VOLATILE SET search_path = 'pg_catalog'; – Jack Douglas Aug 16 '12 at 05:45
  • @JackDouglas Interesting idea. The original in this answer scales nearly perfectly at O(n) according to measurements. Your original is harder to measure and more variable, but somewhere in the order of O(n log n). Will check your recursive version – Craig Ringer Aug 16 '12 at 06:12
  • @JackDouglas I don't see a ton of difference between the recursive and non-recursive approaches on my 9.1 here. Both are about O(n), with the non-recursive 20-30% faster for any given n. You have to remove the search_path setting or add public to it to make the recursion work, btw. – Craig Ringer Aug 16 '12 at 06:23
  • 2
    I tried symlinking /dev/urandom into /var/lib/pgsql/data and reading it with pg_read_file() for bonus crazy points, but unfortunately pg_read_file() reads text input via an encoding conversion, so it can't read bytea. If you really want max speed, write a C extension function that uses a fast pseudo-random number generator to produce binary data and wrap a bytea datum around the buffer :-) – Craig Ringer Aug 16 '12 at 06:29
  • That had crossed my mind too :-) Thanks for your effort on this. – Jack Douglas Aug 16 '12 at 06:38
  • 1
    @JackDouglas I couldn't help it. C extension version of random_bytea. https://github.com/ringerc/scrapcode/tree/master/postgresql/random_bytea – Craig Ringer Aug 16 '12 at 08:57
  • You are still returning 1 byte too many with the SQL version ;-) – Jack Douglas Aug 16 '12 at 10:37
  • @JackDouglas Whoops, you're right. Fixed. Glad you enjoyed the C extension version. I've been doing a little with Pg's backend lately anyway so it wasn't too much of a stretch. Fun :-) – Craig Ringer Aug 16 '12 at 10:41
  • 1
    Another excellent answer! Actually one of the best I've seen so far. I haven't tested the extension, but I trust it works as advertised. – Erwin Brandstetter Aug 16 '12 at 22:42
  • I did a like-for-like test on dbfiddle (my new toy) of your PL/PgSQL versus mine. It clearly demonstrates the much better scaling of your function. – Jack Douglas Mar 02 '17 at 10:56
9

The pgcrypto extension has gen_random_bytes(count integer):

test=# create extension pgcrypto;
test=# select gen_random_bytes(16);
          gen_random_bytes
------------------------------------
 \xaeb98ae41489460c5292aafade4498ee
(1 row)

The create extension only needs to be done once.

Martin Tournoij
  • 225
  • 2
  • 9
6

I would like to be able to generate random bytea fields of arbitrary length

This function will do it, but 1Gb will take a long time because it does not scale linearly with output length:

create function random_bytea(p_length in integer) returns bytea language plpgsql as $$
declare
  o bytea := '';
begin 
  for i in 1..p_length loop
    o := o||decode(lpad(to_hex(width_bucket(random(), 0, 1, 256)-1),2,'0'), 'hex');
  end loop;
  return o;
end;$$;

output test:

select random_bytea(2);

/*
|random_bytea|
|:-----------|
|\xcf99      |
*/

select random_bytea(10);

/*
|random_bytea          |
|:---------------------|
|\x781b462c3158db229b3c|
*/

select length(random_bytea(100000))
     , clock_timestamp()-statement_timestamp() time_taken;

/*
|length|time_taken     |
|-----:|:--------------|
|100000|00:00:00.654008|
*/

dbfiddle here

Jack Douglas
  • 39,869
  • 15
  • 101
  • 176
  • Nice use of width_bucket. Handy. – Craig Ringer Aug 16 '12 at 00:22
  • 1
    I've enhanced your approach to avoid the PL/PgSQL and expensive concatenation loop; see new answer. By using string_agg over generate_series instead of a PL/PgSQL concatenation loop on bytea I'm seeing a 150-fold improvement in performance. – Craig Ringer Aug 16 '12 at 00:38