3
  1. What are the cons of using the below approach?
  2. What steps can be taken to prevent someone from incorrectly updating the count?
  3. Is the below approach an anti-pattern? If yes, what better approaches exist?

Assumptions:

Application is a public wiki/forum (like stackexchange)

  1. The application is read-heavy. Selects will be 99% of the queries made by end-users
  2. bulk inserts are not performed by end-users. They might only be needed for development/maintenance work
  3. Concurrent writes to the table are necessary
  4. Count estimates are good enough. Exact count will be cherry on top

Approach

Consider two entities, Students and Courses each having their corresponding tables. We also have a table student_courses that stores the Many-to-Many mapping b/w Students and Courses.

create table students (
  id bigserial primary key,
  name text
);

create table courses ( id bigserial primary key, content text );

create table student_courses ( student_id bigint not null references students, course_id bigint not null references courses, primary key (student_id, course_id) ); create index on student_courses (course_id);

To find the number of courses for a given student, we can perform

select count(1)
from student_courses
where student_id = 123;

If these count queries are frequent (say you wish to always display student name with course count), one way to optimize would be to maintain a count variable inside both entities, and then place triggers for insert and delete.

alter table students add column course_count bigint default 0;
alter table courses add column student_count bigint default 0;

create function increment_student_course_count() returns trigger as $$begin update students set course_count := course_count + 1 where student_id = new.student_id; update courses set student_count := student_count + 1 where course_id = new.course_id; return new; end;$$

create trigger after_insert_update_counts after insert on student_courses for each row execute procedure increment_student_course_count();

create function decrement_student_course_count() returns trigger as $$begin update students set course_count := course_count - 1 where student_id = new.student_id; update courses set student_count := student_count - 1 where course_id = new.course_id; return new; end;$$

create trigger after_delete_update_counts after delete on student_courses for each row execute procedure decrement_student_course_count();

Once this is in place, finding the course count for a student is simply

select course_count
from students where student_id = 123;
  • 1
    Have you considered the option of use functions to add/remove courses instead of using triggers. – McNets Nov 23 '21 at 15:18
  • @McNets I should've mentioned - I'm using https://supabase.io/ for the backend, that is, the only backend I have is the PostgreSQL db. All authentication, authorization and server side logic is being managed by the database itself. – Pragy Agarwal Nov 23 '21 at 15:35
  • 1
    One huge con to this trigger-based approach is that it’s very slow, especially turning set-based operations into row-by-row aka slow-by-slow processing. – Colin 't Hart Nov 23 '21 at 16:38
  • 1
    Secondly, performance gets even worse when you try to insert (or delete) multiple student courses at the same time. The DBMS is forced to serialize updates of the count variables because each update statement needs to wait for the current value. – Colin 't Hart Nov 23 '21 at 16:46
  • 1
    @Colin'tHart That could be improved with statement level triggers that use a transition table. – Laurenz Albe Nov 23 '21 at 17:30
  • 2
    Central pieces of information are missing: Roughly how many courses per student? Roughly how many inserts / upserts / deletes as compared to selects? Are concurrent writes to the table possible? Do you need exact counts or are estimates good enough? For starters, SELECT count(*) is faster than SELECT count(1) ... – Erwin Brandstetter Nov 23 '21 at 17:31
  • Great points! I've edited the post to include additional information.

    Assumptions: Application is a public wiki/forum (like StackExchange).

    1. The application is read-heavy. Selects will be 99% of the queries made by end-users
    2. bulk inserts are not performed by end-users. They might only be needed for development/maintenance work
    3. Concurrent writes to the table are necessary
    4. Count estimates are good enough. Exact count will be cherry on top
    – Pragy Agarwal Nov 23 '21 at 17:39
  • Another use case for this is maintaining counts for user-votes. For a given item, if you have to frequently retrieve the number of votes it has received, how do you optimize the queries? – Pragy Agarwal Nov 23 '21 at 17:41
  • 1
    @ErwinBrandstetter Just curious, why is SELECT COUNT(*) faster than SELECT COUNT(1) in PostgreSQL? I know it makes no difference in some other database systems (Microsoft SQL Server, Oracle, etc) but wondering if there's something different in PostgreSQL? – J.D. Nov 24 '21 at 02:28
  • 2
    @J.D. I added a paragraph to address this at the bottom of my answer. – Erwin Brandstetter Nov 24 '21 at 06:08

2 Answers2

2

Concurrent writes to the table are necessary

In addition to what Laurenz already brought up, this increases the likelihood of deadlocks if transactions contain operations on more than a single row. Since your trigger locks a row in the students table and another one in the courses table, you better make sure that all write access that can possibly intertwine strictly follows some universal sort order. Seems especially problematic for the courses table, where many students are likely to register around the same time in separate transactions. If these transactions do more than writing a single row in student_courses, the problem may arise.

I don't see a danger of data errors due to concurrency, as each transaction has to wait for competing transactions to finish before it can update a contested row - which highlights the potential cost penalty for write operations. Keep your transactions with write activity tight.

Another issue is that the repeated writes bloat your main tables students and courses. Every UPDATE writes a new row version. The wider those rows, the bigger the impact. Especially in the face of concurrent write access, visibility may be impacted (see visibility map). autovacuum may be hard put to it to keep up. Otherwise possible index-only scans may be disabled. Tables may start to bloat. The whole effort aiming at faster reads may end up slowing reads!

Depending on the frequency of writes, consider "vertical partitioning" for the counts. This may also be relevant if many queries only fetch the count to begin with and don't need the rest of the (wide?) main row. Basically a MATERIALIZED VIEW focusing on just the counts, but managed manually with triggers instead of a fully fledged MATERIALIZED VIEW. The point is to write to a separate table with tiny rows to counter lock contention, visibility issues and bloat. Yet more added overhead, of course. See:

With the PK and the additional index on (course_id) for student_courses, you already have the perfect setup for index or index-only scans for both counts. This query will be extremely fast:

SELECT count(*) FROM student_courses WHERE student_id = 123;

Fractions of a millisecond. Combined with caching on the application level, that may be all you need. All of the above really only comes into play if writes are extremely rare as compared to reads (the mentioned 99% may not warrant the effort, yet), and getting the counts quickly is of paramount importance.

In Postgres, count(*) is implemented separately from count(<value>). The former does not even begin to look at the expression or row values to be as fast as possible. (count(1) is only slightly more expensive, as the constant also does not involve row values, but still.)
Don't be fooled by the * symbol which in the context of SELECT * imposes maximum cost, fetching all columns. The effect in this context is really the opposite.

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • Thanks for the detailed answer.

    Please let me know if the following question should be posted as a separate question - how do services like reddit/youtube/stackexchange efficiently maintain accurate vote-count information? Counting the votes seems to be exactly the same problem - you have a user_votes_on_posts(post_id, user_id) table, and you always need to display a post along with the corresponding vote-count.

    – Pragy Agarwal Nov 24 '21 at 07:43
  • 1
    @PragyAgarwal: Related, but different question, I'd say. I know that Stackoverflow uses SQL Server, but that's about it. – Erwin Brandstetter Nov 24 '21 at 10:56
  • 2
    @PragyAgarwal I'll also add that most sites, including Reddit and YouTube explicitly have approximated counts. Reddit even obfuscates the "actual score" per their algorithm, so they really don't need that accurate of counts to begin with then. Sites that do need accurate counts but want to maximize performance might use an eventually consistent method such as implementing caching and/or queue tables to store new writes that will eventually be updated to the main table generally throughout the normal day when possible between heavy reads. Otherwise realtime updates can be performant too. – J.D. Nov 24 '21 at 12:33
1

The disadvantages are:

  • reduced DML performance

    That is the price you are paying: faster SELECT against slower data modifications. If you are running statements that modify many rows at once, you may be more efficient with statement level triggers that use transition tables.

  • danger of inconsistencies through redundancy

    There is little protection you can have against manual updates of the count. The best you can do is to use a different user as table owner than you use for data modifications. Then you can use column level permissions to keep the latter user from modifying the column. The trigger function would have to be SECURITY DEFINER.

If the importance of fast queries is paramount, I think your approach is acceptable.

Laurenz Albe
  • 51,298
  • 4
  • 39
  • 69
  • Thanks. Yes, the application is read-heavy. Can I use the same approach to calculate the number of user_votes some entity has received? (for example, how does reddit efficiently find the number of votes for a given post?) – Pragy Agarwal Nov 23 '21 at 17:44
  • I don't know what user votes are in this context... – Laurenz Albe Nov 23 '21 at 17:46
  • basically, if we have users and posts, and we wish to maintain who has voted on what post, then how do we optimize counting the number of votes a post has received? – Pragy Agarwal Nov 23 '21 at 17:48
  • That sounds like a completely different question... – Laurenz Albe Nov 23 '21 at 18:19