One of my slave is no longer replicating. The seconds_behind_master continue to increase, Exec_Master_Log_Pos does not increase, and Relay_Log_Space does increase. Slave_IO_Running and Slave_SQL_Running are yes (unless I stop it, or encounter the 1205).
I've tried the solutions on this thread which sounded similar but haven't had any luck, Slave SQL thread got hanged. I also tried a RESET SLAVE which still produce the same behavior.
Additionally when I run:
stop slave;
on my instance it takes +30 seconds to execute.
Query OK, 0 rows affected (33.97 sec)
show slave status\G
returns:
Slave_IO_State: Waiting for master to send event
Master_Host: 10.0.40.203
Master_User: replicant
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000779
Read_Master_Log_Pos: 881930813
Relay_Log_File: mysqld-relay-bin.000002
Relay_Log_Pos: 283
Relay_Master_Log_File: mysql-bin.000779
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB: test
Replicate_Do_Table: Users,corporations,dates,systemspecs,test_replication,domains,test,ips,deleteddate,percona_checksum,accesslevels,status,collectionsdata,orders,email_to_user,requests,userprops,percona_checksum,useremails,requests_site,sections,ordertosection,UserToGroup,validkeys
Replicate_Ignore_Table:
Replicate_Wild_Do_Table: percona.%
Replicate_Wild_Ignore_Table: test.%
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 771399898
Relay_Log_Space: 110531372
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 4784
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 2222
Master_UUID: example
Master_Info_File: /mnt/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: updating
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0
I have four other slaves of the master that all are functional so I know the master logs aren't corrupt.
If I leave the replication running I end up with a 1205 error:
Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
UPDATE:
Running SHOW PROCESSLIST brought back:
348 | replicant | serverDNS | NULL | Binlog Dump | 1107340 | Master has sent all binlog to slave; waiting for binlog to be updated
After finding this we altered innodb_lock_wait_timeout from 50, its default value, to 14400. This allowed the replication to process again. However it is unclear why the 50 timeout would occur on only one of 5 slaves. All slaves are m5.2xlarge AWS instances so they have the same number of resources.
Additionally, should I stop at 14400 or should I just set this to the max 1073741824?
Update 2:
If I issue a restart for the mysql service replication processes as expected for about a day then the issue reproduces.
Additionally this slave is also a master of another system if that makes a difference. The slave of this master is running fine.
Current relevant (or in my eyes) slave output lines:
Master_Log_File: mysql-bin.000786
Read_Master_Log_Pos: 131895019
Relay_Log_File: mysqld-relay-bin.000025
Relay_Log_Pos: 52668949
Relay_Master_Log_File: mysql-bin.000786
Exec_Master_Log_Pos: 91692081
Relay_Log_Space: 131895472
Seconds_Behind_Master: 12163
The 91692081 is the value it currently is stuck at.
Update 3:
Looking into it further OS file reads, OS file writes, and OS fsyncs are consistently increasing. I also have found a warning being logged:
Warning: difficult to find free blocks in the buffer pool (324 search iterations)! 0 failed attempts to flush a page! Consider increasing the buffer pool size. It is also possible that in your Unix version fsync is very slow, or completely frozen inside the OS kernel. Then upgrading to a newer version of your operating system may help. Look at the number of fsyncs in diagnostic info below.
SHOW PROCESSLIST \Gfor some additional info. – Kondybas Jan 15 '20 at 15:29SHOW PROCESSLISTshows replication threads and executing queries. The executing queries always process and the replication threads are always present. If a particular value is worth a look there please let me know which one(s). Thanks. – user3783243 Jan 16 '20 at 19:59stop slaveto run (yesterday's returnQuery OK, 0 rows affected (3 hours 54 min 16.67 sec)). All servers are on AWSm5.2xlarge. Nothing is in the process list that shows locks, and all queries their process out aside from the replication threads. For binary log format what should I run to return that? – user3783243 Jan 22 '20 at 15:02FSU Clemsonis? That seems to be US college sports related. – user3783243 Jan 24 '20 at 05:11update 3seems to point to a slow filesystem – jerichorivera Jan 24 '20 at 05:42iotopafter I found that but nothing stood out to me as egregious after four hours monitoring. Do you have any recommendations for how I could measure potentials issues with this? I'll also open a ticket with AWS support to see if there's anything they can see. Generally I don't get much support with application issues with AWS though. If I used their RDS I'm sure they'd be quicker – user3783243 Jan 24 '20 at 05:46