Parallel PostgreSQL Queries with R

Question

I'm running a large query that for various reasons, I've broken into a series of smaller queries. The example below is just to show how the query is broken up by id; the actual query is much more complex:

UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 0  AND id < 10;
 UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 10 AND id < 20;;
  UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 20  AND id < 30;

etc...

I've written a function in R that generates the queries sequentially and sends them to the database using the R library RPostgreSQL:

library(RPostgreSQL)

num <- seq(0, 100, 10)

 query.func <- function(num){

      con <- dbConnect(PostgreSQL(), dbname = "name", post = 5432, user = "user_name", password = "password")

      num2 <- num + 10

       q1 <- paste('UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >=', num, 'AND id <', num2, sep = "")

         dbSendQuery(con, q1)

           dbDisconnect(con)
}  

lapply(num, query.func)

Rather than having this long sequence of queries run one after another, I wanted to try and take advantage of my server's multiple CPU's and run them in parallel. Using the R library parallel I've created a cluster and sent multiple queries to the cluster simultaneously:

no_cores <- detectCores() - 1

 cl <- makeCluster(no_cores)

  clusterExport(cl, "query.func")
  clusterExport(cl, "num")

  clusterEvalQ(cl, library(RPostgreSQL))

    parLapply(cl, num, query.func)

stopCluster()

Where query.func is defined as above.

When I run this R script at the terminal I receive no errors and if I check pg_stat_activity I see a list of active queries, each incrementally operating on a separate block of data. However, when I use top to check the resource management on my system, I still only see one postgres process. When I look at the CPU usage, I only see one CPU active at a time.

Are these queries really running simultaneously on separate CPU's? My understanding was that, prior for 9.6, a single postgres query could not be split across multiple cores but each connection could utilize a separate core (related question). Does the process I've outlined above open multiple connections and send a query to the database via each connection?

PostgreSQL 9.3/ Ubuntu 14.04 / R 3.3.2

score 1 · Answer 1 · answered Jan 09 '17 at 18:49

1

UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 0  AND id < 10;
 UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 10 AND id < 20;;
  UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 20  AND id < 30;

That's insane especially if the query is not CPU bound. Just run one query with one update.

UPDATE targate_table set col1_target = col1_source
FROM source_table
WHERE (id BETWEEN 0 AND 10)
  OR (id BETWEEN 10 AND 20)
  OR (id BETWEEN 20 AND 30);

PostgreSQL doesn't lock the table for row updates. It only locks the rows. There is no reason to break this up.

answered Jan 09 '17 at 18:49

Evan Carroll

63,051
46
242
479

1

Why do you need the OR then? – mustaccio Jan 09 '17 at 19:00
I'm just showing how to join the condition with multiple ranges. You don't need the OR: you (id between 0 and 30). I am just saying that the only way this update can move faster with multiple threads is if the update itself is CPU bound. With 9.6 that sounds really hard to do. – Evan Carroll Jan 09 '17 at 19:02
Thanks but the actual query is much more complex. I just provided that as an example. I'll modify the question to make that more clear. – Matt Jan 09 '17 at 19:39
I think that there's something wrong in here: How are the source and the target tables related? Are they supposed to have the same id? If so, the query should be `WHERE (source_table.id = target_table.id) AND (... whatever)'. – joanolo Jan 09 '17 at 21:08

score 0 · Answer 2 · answered Dec 08 '21 at 08:31

In a context different from this one, running functions in parallel worked fine for me using the future.apply package .

In your case it should look similar to this:

library(future.apply)
plan(multisession)
future_apply(num, query.func)

The future_apply() function will then start multiple R instances and execute your function qurey.func() in parallel.

(Maybe also check the CRAN Task-View on high performance computing, if not already done)

Parallel PostgreSQL Queries with R

2 Answers2