2

What is a good way to insert/copy plain files into a postgres table, preferrably using the psql command-line?

In my case the files are a bunch of eMails from Maildir archives, so I tried to use COPY:

psql -c "COPY emails (data) FROM '/tmp/emailfile' WITH (FORMAT text);" emails

which I would use in a for-loop shell script (for file in $(ls dir); do psql ...; done).

However, I have trouble finding a good "delimiter" that couldn't possibly be in the file, and I'm getting these errors: ERROR: extra data after last expected column.

So I considered using the COPY ... FORMAT binary version and a BYTEA field in the db (then convert the column to TEXT inside the db), but that requires a file header and trailer, which I have no easy way of building on-the-fly.

Is there an easy way to do this from the commandline, or do I need to write a python script for this?

nyov
  • 135
  • 1
  • 8
  • I think seeing a sample of your input file and the target table definition might be helpful. – mustaccio Nov 15 '19 at 23:57
  • @mustaccio, as stated, the input files are raw emails. Just look at any email source. So in effect, utf-8 encoded text-files with any possible kind of content. The table definition is called emails with a column named mail_data of type TEXT or BYTEA; but can be changed. – nyov Nov 16 '19 at 00:07
  • I'm not sure how I missed this earlier, but: Related: https://dba.stackexchange.com/q/1742 and another one: https://stackoverflow.com/a/9786524/9214854 (once you have the answer, you know what to search for ...) – nyov Nov 25 '19 at 11:41

2 Answers2

5

Server-side files

Assuming the account has the pg_read_server_files role (or is superuser), if the files are accessible on a filesystem mounted on the server, and their paths have been collected in the table, this will grab the contents efficiently:

UPDATE emails SET mail_data = pg_read_binary_file(emails.fullpath);

It is more efficient than using large objects as an intermediate storage area.

Client-side files

When the files are not accessible server-side or the account does not have elevated permissions, with psql a more generic solution may be, for each file:

\set clientpath '/path/to/file'
-- assume clean paths (without any character that would be special to the shell)
\set contents `base64 :clientpath`

insert into email_data([other columns...], mail_data)
 values ( [other columns values...], decode(:'contents','base64'));

A base64 intermediate representation is used because psql doesn't support parameters in binary form. The :'contents' syntax instructs psql to inject the variable in text form into the query.

Daniel Vérité
  • 31,182
  • 3
  • 72
  • 80
  • +1 Thanks. But I'm sorry I can't accept this as my solution; it looks nice but the server-side version fails with ERROR: absolute path not allowed (and I really don't want to fiddle with moving data into pg's home), and the client-side version is also slower because of the base64 overhead (data+cpu). I'd rather accept the Large Objects overhead. – nyov Nov 25 '19 at 02:00
  • 1
    @nyov: this error does not occur on recent versions of Postgres (>=11) with the pg_read_server_files built-in role. IIRC with older versions creating a soft link from inside PGDATA to an outside directory does the trick. – Daniel Vérité Nov 25 '19 at 06:50
  • Alrighty! I'll revisit that once on PG11. And now I wonder how psycopg handles the binary transfer - whether the base64 transcoding couldn't be replaced with something more efficient in psql? For now, I've scripted the \lo_import variant with psql, and written a python script after all. – nyov Nov 25 '19 at 10:25
0

This PDF document PGOpen2018_data_loading.pdf I found, explains data imports using PG's Large Objects support.
It's been a bit unclear, and is slightly cumbersome; so I scripted it to automate the process (see below).

The procedure works as follows:

-- Create table
CREATE TABLE tmp_docs(file_name text PRIMARY KEY);
-- Pull list of filenames from folder
psql -c "\copy tmp_docs FROM PROGRAM 'ls /file/location/data/* -R' WITH (FORMAT csv);" emails
-- Add fields to hold file link ID and blob of the files
ALTER TABLE tmp_docs ADD COLUMN doc bytea, ADD COLUMN doc_oid oid;
-- Add the documents to large object storage and return the link id
UPDATE tmp_docs SET doc_oid = lo_import(file_name);
-- Pull documents from large object storage into table
UPDATE tmp_docs SET doc = lo_get(doc_oid);
-- Delete the files from large object storage
SELECT lo_unlink(doc_oid) FROM tmp_docs;

Note, the last SELECT did error out for me, I increased shared_buffers and max_locks_per_transaction for my number of files.

[...]
WARNING:  out of shared memory
WARNING:  out of shared memory
ERROR:  out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

I've written scripts to automate the whole process for me, a server-side variant and a client-side (remote) version using psql-local functions.

They've become slightly large for a SO codeblock, so instead they got a GitHub repo here

https://github.com/nyov/postgres-bindata-batchimport

and this is their usage:

mkdir /tmp/randomtext
for OF in `seq 101 120`; do # twenty random files, please
   curl -s http://metaphorpsum.com/paragraphs/5/5 > /tmp/randomtext/${OF};
done

DB="postgres"

# Usage: ./pg_bytea_batchimport_*.sh <dbname> <source-dir> [<tablename>]
# Run script
./pg_bytea_batchimport_cs.sh $DB /tmp/randomtext tmp_docs
# (or) Server-side version; faster, but requires a local db cluster
#./pg_bytea_batchimport_ss.sh $DB /tmp/randomtext tmp_docs

# Verify; should count 20 rows
psql -c "SELECT count(*) FROM tmp_docs" $DB
psql -c "SELECT encode(doc::bytea, 'escape')::text FROM $TABLE LIMIT 1" $DB

# clean up
rm -r /tmp/randomtext
psql -c "DROP TABLE tmp_docs" $DB
unset DB

Comparing client-side vs server-side script run-time, the client-side variant is about 50% slower on average. Here timed on a test dataset of many smallish text-files:

time ./pg_bytea_batchimport_ss.sh emails /mnt/stor/

real    2m48.045s
user    0m0.368s
sys 0m0.092s

time ./pg_bytea_batchimport_cs.sh emails /mnt/stor/

real    4m46.319s
user    0m1.944s
sys 0m4.448s
nyov
  • 135
  • 1
  • 8