2

I have a procedure that loops through a query that returns a set of dates, recreates a set of views using the date in the WHERE clause, and the calls a series of other procedures that use these views. So, if the loop contains three dates, it runs the procedures with each date in turn coded into the WHERE clause of the views. The tables that the views select are partitioned on the date in the WHERE clause, the views exist in order to eliminate all the unnecessary partitions.

On the first iteration around the loop, everything is fine. The views get created, the procedures get called, and the data is processed.

On the second and subsequent iterations, the procedures behave as if the views still had the date from the first iteration. After the loop finishes, the views have the last of the dates in their WHERE clauses, but the procedures had clearly just been processing the first date over and over.

I have created a trivial test that just creates a view, selects the count from it, creates it again with a different date and selects the count from that, and I get the two correct counts, so redefining and using a view within a procedure clearly is possible.

Any ideas why it doesn't work in a loop?

Here is a demonstration script, the first run should process 2 records and the second should process 4, but both runs process 2:

create schema dv;

create schema rpt;

drop table dv.batch_data;
create table dv.batch_data
(
  std_date date
, dv_load_dt timestamp with time zone
)
WITH (APPENDONLY=false, OIDS=FALSE)
DISTRIBUTED BY (std_date)
;

insert into dv.batch_data values( '2016-01-02'::date, current_timestamp );
insert into dv.batch_data values( '2016-01-04'::date, current_timestamp );

drop table dv.table_1;
create table dv.table_1
(
  std_date date
, col_1 text
, col_2 text
)
WITH (APPENDONLY=false, OIDS=FALSE)
DISTRIBUTED BY (std_date)
;

insert into dv.table_1 values( '2016-01-01'::date, 'First',   'Record' );
insert into dv.table_1 values( '2016-01-02'::date, 'Second',  'Record' );
insert into dv.table_1 values( '2016-01-02'::date, 'Third',   'Record' );
insert into dv.table_1 values( '2016-01-03'::date, 'Fourth',  'Record' );
insert into dv.table_1 values( '2016-01-03'::date, 'Fifth',   'Record' );
insert into dv.table_1 values( '2016-01-03'::date, 'Sixth',   'Record' );
insert into dv.table_1 values( '2016-01-04'::date, 'Seventh', 'Record' );
insert into dv.table_1 values( '2016-01-04'::date, 'Eighth',  'Record' );
insert into dv.table_1 values( '2016-01-04'::date, 'Ninth',   'Record' );
insert into dv.table_1 values( '2016-01-04'::date, 'Tenth',   'Record' );

create table rpt.stage_table_a
(
  std_date date
, col text
)
WITH (APPENDONLY=false, OIDS=FALSE)
DISTRIBUTED BY (std_date)
;

CREATE OR REPLACE FUNCTION rpt.custom_proc_rpt_loop()
  RETURNS integer AS
$BODY$
DECLARE
  v_count       integer;         

  rec           RECORD;
  v_days        integer;
  v_query       varchar(10000);
  v_return_msg  varchar(256);

BEGIN
  v_count := 0;
  v_days := 0;

  BEGIN

  FOR rec IN
    SELECT DISTINCT std_date
    FROM dv.batch_data
    ORDER BY std_date
  LOOP
    v_days := v_days + 1;

    v_query := 'CREATE OR REPLACE VIEW rpt.view_batch_table_1 AS ' ||
               'SELECT std_date ' ||
                    ', col_1 ' ||
                    ', col_2 ' ||
               'FROM dv.table_1 ' || 
               'WHERE std_date = ''' || rec.std_date || '''';

    RAISE NOTICE '%',v_query;
    EXECUTE v_query;

    v_count := v_count + 1;

    PERFORM rpt.update_stage_table_a();

    ANALYZE rpt.stage_table_a;

  END LOOP;

  END;

  v_return_msg := 'custom_proc_rpt_loop completed. '
    || CAST(v_days AS VARCHAR(64)) || ' loops performed. ';

  RAISE NOTICE '%',v_return_msg;

  RETURN 1;

END;
$BODY$
  LANGUAGE plpgsql VOLATILE;


CREATE OR REPLACE FUNCTION rpt.update_stage_table_a()
  RETURNS integer AS
$BODY$
DECLARE
  v_count             integer;         
  v_return_msg  varchar(256);

  BEGIN

  TRUNCATE rpt.stage_table_a;

  INSERT INTO rpt.stage_table_a
  (
    std_date
  , col
  )
  SELECT DISTINCT 
    std_date
  , COALESCE(col_1,'') || CASE WHEN length(col_1) > 1 AND length(col_2) > 1 THEN ' ' ELSE '' || COALESCE(col_1,'') END
  FROM rpt.view_batch_table_1
  ;

  GET DIAGNOSTICS v_count = row_count;

  v_return_msg := 'stage_table_a completed. '
    || CAST(v_count AS VARCHAR(64)) || ' records processed. ';

  RAISE NOTICE '%',v_return_msg;

  RETURN 1;

END;
$BODY$
  LANGUAGE plpgsql VOLATILE;

select rpt.custom_proc_rpt_loop();

select count(*) from rpt.view_batch_table_1;

In the real world, the inner procedures are generated by a data warehouse automation tool, so I don't want to change them. I can do whatever I want in the calling procedure that contains the loop.

Jack Douglas
  • 39,869
  • 15
  • 101
  • 176
PhilHibbs
  • 539
  • 1
  • 6
  • 20

1 Answers1

2

Not using Greenplum, but guessing from its PostgreSQL heritage it's probably due to the way PL/pgSQL treats SQL statements: like prepared statements. Those are planned and cached on their first use. That means, among other things, that the view is resolved to its underlying tables.

The same plan is reused in subsequent iterations, which typically saves time for re-planning. But it also foils your attempt to recreate views.

There are various ways to fix this. One quick solution is to convert the fooled SQL statements to dynamic SQL as well, to force re-planning every time. So:

...

EXECUTE '
   SELECT * 
    FROM rpt.update_stage_table_a
    ( p_sequence        
    , p_job_name        
    , p_task_name       
    , p_job_id          
    , p_task_id          
    , p_return_msg      
    , p_status
    )'
INTO v_parameters
   , v_return_msg
   , v_status
;

...

Etc.

Related:

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • That would require changing the procedures, which we are trying to avoid since they are generated by a data warehouse automation tool. In other contexts we redefine views and they work, but it only seems to be calling procedures with views redefined in between calls that is the problem. – PhilHibbs Nov 29 '16 at 16:16
  • @PhilHibbs: But that's what you do to fix a bug: you change the procedure. You have to, one way or the other. – Erwin Brandstetter Nov 29 '16 at 16:21
  • Your answer to one of those other issues says: "The query plan for prepared statements is cached for the duration of the session unless any involved objects are changed (including the function itself), which deallocates all depending prepared statements." - surely, I am changing an involved object by redefining the view. – PhilHibbs Nov 29 '16 at 16:39
  • I'm not convinced it's the plan - surely the cached plan will just say that it should do a table scan, or something like that. If the table were partitioned on std_date then the plan would cause it to only look at one partition, but it isn't partitioned. (it will be in the real environment, but that's by the by). – PhilHibbs Nov 29 '16 at 16:50
  • If it were plans, then DISCARD PLANS would be what I want, but that isn't in Greenplum (which is based on PostgreSQL 8.2) – PhilHibbs Nov 29 '16 at 17:05
  • So you reason if it's not in Greenplum then that is not the problem? Discard plans was introduced in pg 8.3. If Greenplum is a fork of 8.2, then they missed the upstream addition. That's normal. They're way behind at this point. It's still scheduled to happen though. Perhaps migrate back to postgresql, if possible. – Evan Carroll Nov 29 '16 at 17:36
  • 1
    It actually looks like Greenplum has support DISCARD PLANS since 2007. Even though it's a fork of 8.2, they backported it. – Evan Carroll Nov 29 '16 at 17:41
  • @PhilHibbs: If Greenplum does in fact have DISCARD PLANS (I doubt it: This merge is yet to be part of an official Greenplum release but it is in the master branch.) use it to determine whether cached plans are the problem. Else test with dynamic SQL to see if the problem goes away. If it's cached plans, dynamic SQL is still substantially more efficient anyway. Discarding all plans can have side effects for performance, and dynamic SQL does not try to save the plan to begin with. – Erwin Brandstetter Nov 30 '16 at 03:02
  • @PhilHibbs: About the depending prepared statements: I am not sure any actual dependency on the view is saved. Might be resolved to underlying tables (like mentioned above). I am not too sure about the behavior of Postgres 8.2 - ish plpgsql. It had several issues in this area which have been fixed in mainline Postgres plpgsql since. – Erwin Brandstetter Nov 30 '16 at 03:07
  • DISCARD PLANS is not in our version of Greenplum, and wrapping the function call in a PERFORM doesn't make any difference either. – PhilHibbs Dec 12 '16 at 11:20
  • Then I am fresh out of ideas, sorry. I would try to replace the views with actual queries ... – Erwin Brandstetter Dec 13 '16 at 15:40