Replication Hung - Seconds_Behind_Master Increasing

Question

One of my slave is no longer replicating. The seconds_behind_master continue to increase, Exec_Master_Log_Pos does not increase, and Relay_Log_Space does increase. Slave_IO_Running and Slave_SQL_Running are yes (unless I stop it, or encounter the 1205).

I've tried the solutions on this thread which sounded similar but haven't had any luck, Slave SQL thread got hanged. I also tried a RESET SLAVE which still produce the same behavior.

Additionally when I run:

stop slave;

on my instance it takes +30 seconds to execute.

Query OK, 0 rows affected (33.97 sec)

show slave status\G

returns:

               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.0.40.203
                  Master_User: replicant
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000779
          Read_Master_Log_Pos: 881930813
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 283
        Relay_Master_Log_File: mysql-bin.000779
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: test
           Replicate_Do_Table: Users,corporations,dates,systemspecs,test_replication,domains,test,ips,deleteddate,percona_checksum,accesslevels,status,collectionsdata,orders,email_to_user,requests,userprops,percona_checksum,useremails,requests_site,sections,ordertosection,UserToGroup,validkeys
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: percona.%
  Replicate_Wild_Ignore_Table: test.%
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 771399898
              Relay_Log_Space: 110531372
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 4784
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2222
                  Master_UUID: example
             Master_Info_File: /mnt/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: updating
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0

I have four other slaves of the master that all are functional so I know the master logs aren't corrupt.

If I leave the replication running I end up with a 1205 error:

Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

~~UPDATE:~~

Running SHOW PROCESSLIST brought back:

348 | replicant  | serverDNS | NULL  | Binlog Dump | 1107340 | Master has sent all binlog to slave; waiting for binlog to be updated

After finding this we altered innodb_lock_wait_timeout from 50, its default value, to 14400. This allowed the replication to process again. However it is unclear why the 50 timeout would occur on only one of 5 slaves. All slaves are m5.2xlarge AWS instances so they have the same number of resources.

Additionally, should I stop at 14400 or should I just set this to the max 1073741824?

Update 2:

If I issue a restart for the mysql service replication processes as expected for about a day then the issue reproduces.

Additionally this slave is also a master of another system if that makes a difference. The slave of this master is running fine.

Current relevant (or in my eyes) slave output lines:

         Master_Log_File: mysql-bin.000786
      Read_Master_Log_Pos: 131895019
           Relay_Log_File: mysqld-relay-bin.000025
            Relay_Log_Pos: 52668949
    Relay_Master_Log_File: mysql-bin.000786
        Exec_Master_Log_Pos: 91692081
          Relay_Log_Space: 131895472
             Seconds_Behind_Master: 12163

The 91692081 is the value it currently is stuck at.

Update 3:

Looking into it further OS file reads, OS file writes, and OS fsyncs are consistently increasing. I also have found a warning being logged:

Warning: difficult to find free blocks in the buffer pool (324 search iterations)! 0 failed attempts to flush a page! Consider increasing the buffer pool size. It is also possible that in your Unix version fsync is very slow, or completely frozen inside the OS kernel. Then upgrading to a newer version of your operating system may help. Look at the number of fsyncs in diagnostic info below.

Are you sure your replica is read-only? Also you have to look at SHOW PROCESSLIST \G for some additional info. — Kondybas, Jan 15 '20 at 15:29
The update resolved the issue temporarily but it doesn't appear to have found the real issue. I'm retracting the update. SHOW PROCESSLIST shows replication threads and executing queries. The executing queries always process and the replication threads are always present. If a particular value is worth a look there please let me know which one(s). Thanks. — user3783243, Jan 16 '20 at 19:59
so the other slaves that are not lagging, are also replicating the exact data or is there any replication filters involved that may differ one slave from another? what binary logging format are you using across all servers? Do all slaves have similar hardware specifications and mysql configuration? have you ruled out possible issues with hardware? did you check what transactions are happening whenever the slave gets stuck? — jerichorivera, Jan 22 '20 at 13:41
@jerichorivera Correct, other slaves process fine. Additionally this slave increases by about a half hour every day the processing time for stop slave to run (yesterday's return Query OK, 0 rows affected (3 hours 54 min 16.67 sec)). All servers are on AWS m5.2xlarge. Nothing is in the process list that shows locks, and all queries their process out aside from the replication threads. For binary log format what should I run to return that? — user3783243, Jan 22 '20 at 15:02
@user3783243 ask For AWS support to check the hardware. Your update 3 seems to point to slow FSU Clemson — jerichorivera, Jan 24 '20 at 05:05
@jerichorivera Can you clarify what FSU Clemson is? That seems to be US college sports related. — user3783243, Jan 24 '20 at 05:11
@user3783243 it was auto-correct on my device. I meant to say this: Your update 3 seems to point to a slow filesystem — jerichorivera, Jan 24 '20 at 05:42
@jerichorivera I tried using iotop after I found that but nothing stood out to me as egregious after four hours monitoring. Do you have any recommendations for how I could measure potentials issues with this? I'll also open a ticket with AWS support to see if there's anything they can see. Generally I don't get much support with application issues with AWS though. If I used their RDS I'm sure they'd be quicker — user3783243, Jan 24 '20 at 05:46

score -1 · Answer 1 · answered Jan 16 '20 at 23:09

-1

Since your other slaves are running fine, this is possibly caused by user error. Some row was probably changed on this slave manually, which was meant to be changed on the master. The slave then encounters something like duplicate key constraint 10 times and fails. If the transaction causing this error is slow or it effects a lot of rows combined with ROW based replication, it can take a long time before the replication fails. See MySQL replication slave hangs after encountering SET @@SESSION.GTID_NEXT= 'ANONYMOUS';

Try to issue SHOW BINLOG EVENTS IN 'mysql-bin.000779' FROM 771399898 LIMIT 500; to find the offending query.

These values are from your SHOW SLAVE STATUS output. I put the LIMIT at 500 as a starting point, but increase it if it doesn't give you enough data.

It may be faster to set up the replication again (using a non-locking tool like Innobackupex) than to troubleshoot the root cause and solution of this problem.

answered Jan 16 '20 at 23:09

Pan

184
1
6

I will preface with I think you are vastly off topic by the Some row was probably changed on this slave manually statement. Users dont have permissions to edit rows they can only insert new rows. If there were one offending query this would be an easy situation, I could skip that log. As a final point your last statement contradicts this whole question than to troubleshoot the root cause and solution of this problem. – user3783243 Jan 17 '20 at 04:52
My point is that the data on this slave has somehow become different than on the master and the other slaves. I should probably have used the word "human" instead of "user". I meant someone with write access (an admin). You mention skipping logs. If that has been done previously, it could be the cause of the current problem. This not only skips the problematic row but also all other changes in the same transaction. This could affect a lot of rows. It may no longer be possible to figure out why the data differs or how much it differs which only leaves setting up replication again. – Pan Jan 17 '20 at 08:38
Are you saying that master -> slave data integrity would halt replication? – user3783243 Jan 17 '20 at 15:02
It can, yes, if the difference affects primary keys, foreign keys, unique indexes, schema etc. If you can find the offending query in the log, you should be able to figure out if that's the cause. The hint is that the transaction failed 10 times on the slave. Why does it fail on this slave, but not on the master or the other slaves? – Pan Jan 17 '20 at 15:45
1

Those would all fail with an error in Slave_SQL_Running_State related to the conflict, not Slave SQL thread retried transaction 10 time(s) in vain nor the other state of just not updating. – user3783243 Jan 17 '20 at 16:27
I'm trying to help you, but I'm starting to feel like you don't want help. I don't understand why you don't find the transaction that's causing the problem, and start your investigation from there. I've suggested it in my answer and again in the comments. If it's an enormous transaction and it's difficult to figure out where it goes wrong, you could make a standalone copy of the problem slave and run all the queries or the entire transaction manually there. – Pan Jan 17 '20 at 17:12
On a side note, you would probably have been up and running again long ago if you set up replication again from scratch. This is not an "answer" but instead a "solution". – Pan Jan 17 '20 at 17:13
Rebuilding the instance will take +2 days. In addition to being a slave this also is a master with hundreds of millions of rows. I've looked at the transactions and they are not related to the tables being replicated. They also are not large queries, single insert or updates. – user3783243 Jan 17 '20 at 17:30
You say that the transactions are not related to replicated tables. Exec_Master_Log_Pos is the last executed (completed) transaction on the slave from the master bin-log. Non-replicated tables do not go into the relay log on the slaves and are skipped. Increase your LIMIT BY and find the first transaction involving replicated tables after Exec_Master_Log_Pos. That's your culprit. – Pan Jan 17 '20 at 17:47
I have a database with several tables containing hundreds of millions of rows. I've set up replication again from scratch, using innobackupex, in a few hours after it was broken by a colleague. This slave also being a master has not been mentioned previously. I would also not want to rebuild it and instead try to solve the problem, as rebuilding it also requires rebuilding all its slaves. This multiplies the time and work required. – Pan Jan 17 '20 at 17:59
So SHOW BINLOG EVENTS is executed on the server having the issue or the master? I was running it on the master. – user3783243 Jan 17 '20 at 18:41
It's on the master, yes. But the binlog contains all changes on the master, also not replicated changes. The Exec pos on the slave shows you the last successful transaction executed on the slave from the master binlog. The next log entry on the master that is affecting the slave will be the problem. Skip past all entries that do not affect the slave. – Pan Jan 17 '20 at 19:39
In my linked answer, it's not the query in the question title (SET @@SESSION.GTID_NEXT= 'ANONYMOUS'), that's was causing the problem. In my case, it was the next one: BEGIN, followed by more than a thousand DELETEs and then finally a COMMIT. – Pan Jan 17 '20 at 20:02
The stop slave; shouldn't be affected by transactions though right? Also the service mysql restart should still have the same result if it were a particular transaction/statement, right? – user3783243 Jan 20 '20 at 14:12
I've also rebuilt the replicated tables and reset the slave settings. The behavior persists. It starts every morning at 9:00am EST which seems like it would be related to either something that cycles or a CRON (I've gone through CRONs and see nothing there). – user3783243 Jan 21 '20 at 14:19

Replication Hung - Seconds_Behind_Master Increasing

1 Answers1