3

Question: Which column of fortune500 (a table) has the most missing values? To find out, I have checked each column individually mentioned below, which is very tedious.

SELECT 
    SUM(CASE WHEN ticker IS NULL THEN 1 ELSE 0 END) AS ticker_null_num, 
    SUM(CASE WHEN profits_change IS NULL THEN 1 ELSE 0 END) AS profits_change_null_num,
    SUM(CASE WHEN industry IS NULL THEN 1 ELSE 0 END) AS industry_null_num
FROM fortune500;

And also performed the following query for each column individually:

SELECT count(*) - count(ticker) AS missing
  FROM fortune500;
 etc....

My Question: Is there any better/dynamic way of doing it because this approach is very cumbersome and it will take a lot of time as I have like 50 - 60 columns in a table then what should I do in that case instead of this manual approach. Can somebody help me find the missing values of each column sorted in a descending order with a good method? Like for example:

COLUMN_NAME    MISSING_VALUES_COUNT
col1               60
col2               50
col3               45
col4               40
etc.....  
Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
Fiona Daniel
  • 31
  • 1
  • 3
  • Do you know there is a table or view containing the column names (and more) of each table or view in the DB? Also, when you analyze a table, PostgreSQL stores an estimate of what you're asking. – Gerard H. Pille Sep 27 '20 at 00:36
  • @GerardH.Pille Couldn't get you. Can you be more precise? Is this a right platform to ask or I should ask this on stack overflow instead? – Fiona Daniel Sep 27 '20 at 01:45
  • Do it in 2 steps. First use information_schema.columns to generate your query, execute your query. For your second question, calculate count(*) once and use the result – Lennart - Slava Ukraini Sep 27 '20 at 02:34
  • Why do you mind that it is "tedious" and "cumbersome"? You have to write it only once. I'd like to know to what use you need this information. – Gerard H. Pille Sep 27 '20 at 04:14

2 Answers2

1

You can convert the rows into JSON to dynamically generate one row for each column:

select colname, 
       count(cols.value) as non_null_values,
       (select count(*) from the_table) - count(cols.value) as missing
from the_table t
  cross join jsonb_each(jsonb_strip_nulls(to_jsonb(t))) as cols(colname, value)
group by colname;

But this isn't going to be fast on large tables.

  • If you define "the_table" as a CTE, 2 minutes for 7 million rows of 7 columns. A beautiful solution, wish I knew what the problem was. – Gerard H. Pille Sep 27 '20 at 10:37
1

A query of this basic form only uses a single sequential scan and is as efficient as it gets:

SELECT x.*
FROM  (
   SELECT count(*) AS ct
        , count(ticker) AS ticker
        , count(profits_change) AS profits_change
        , count(industry) AS industry
     -- , more?
   FROM   fortune500
   ) t
CROSS  JOIN LATERAL (
   VALUES
      ('ticker', ct - ticker)
    , ('profits_change', ct - profits_change)
    , ('industry', ct - industry
 -- , more?
   ) x(column_name, missing_values)
   ORDER  BY missing_values DESC, column_name DESC;

This function generates and executes the query for all columns of a given table dynamically:

CREATE OR REPLACE FUNCTION f_count_nulls(_tbl regclass)
  RETURNS TABLE (column_name text, missing_values bigint)
  LANGUAGE plpgsql STABLE PARALLEL SAFE AS
$func$
BEGIN
   RETURN QUERY EXECUTE (
   SELECT format(
   $$
   SELECT x.*
   FROM  (SELECT count(*) AS ct, %s FROM %s) t
   CROSS  JOIN LATERAL (VALUES %s) x(col, nulls)
   ORDER  BY nulls DESC, col DESC
   $$, string_agg(format('count(%1$I) AS %1$I', attname), ', ')
     , $1
     , string_agg(format('(%1$L, ct - %1$I)', attname), ', ')
      )
   FROM   pg_catalog.pg_attribute
   WHERE  attrelid = $1
   AND    attnum > 0
   AND    NOT attisdropped
   -- more filters?
   );
END
$func$;

Call:

SELECT * FROM f_count_nulls('public.fortune500');  -- optionally schema-qualified

Produces the requested result.
All identifiers are handled safely (quoted when required, no SQL injection).

db<>fiddle here

Related:

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • I'm trying to partition the data from the first query, where exactly should I use group by and where should I add the values I'm grouping for. For now, I created a view and ran it for every instance of it, but it seems that this should work. (Suppose that the column is year_transfer) – Braiam Sep 20 '23 at 19:42
  • @Braiam: Please ask your question as new question. Doesn't fit into a comment. You can always link to this Q or A for reference. And you can drop a comment here to link back / get my attention. – Erwin Brandstetter Sep 20 '23 at 20:45