vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I've been having trouble with my master/slave server - recently I was having a few repeated issues where the mysql slave would stop due to "invalid sql syntax", but the queries executed fine on the master. I would have to manually dig through the logs and then find the query to manually execute on the slave, then use skip_counter to resume the replication skipping the corrupted statement on the slave. I thought it might be hardware related since it was only affecting the slave, so I moved it to a different blade (both the servers are blades). However, today I was greeted with a nagios alert that the slave had stopped again. This time, it seems like the relay log is definitely corrupt. I was able to run mysqlbinlog > /dev/null on all the master logs, none are corrupt (including the one it had read up to on the slave). The relay log on the slave is though - it reports "[root@mysql02 mysql]# mysqlbinlog mysql02-relay-bin.010923 > /dev/null ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 38210134, event_type: 0 Could not read entry at offset 618730:Error in log format or read error" _Nothing too much different in the logs either: _071006 11:18:52 [Note] Slave I/O thread: connected to master 'replica@10.26.3.12 4:3306', replication started in log 'mysql-bin.000104' at position 906124600 071008 9:07:12 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013) 071008 9:07:13 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000105' position 766367499 071008 9:07:13 [Note] Slave: connected to master 'replica@10.26.3.124:3306',replication resumed in log 'mysql-bin.000105' at position 766367499 071008 10:08:16 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013) 071008 10:08:16 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000105' position 819300906 071008 10:08:16 [Note] Slave: connected to master 'replica@10.26.3.124:3306',replication resumed in log 'mysql-bin.000105' at position 819300906 071008 12:12:40 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013) 071008 12:12:40 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000105' position 893443034 071008 12:12:40 [Note] Slave: connected to master 'replica@10.26.3.124:3306',replication resumed in log 'mysql-bin.000105' at position 893443034 071008 12:15:33 [ERROR] Error in Log_event::read_log_event(): 'read error', data_len: 38210134, event_type: 0 071008 12:15:33 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error 071008 12:15:33 [ERROR] Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a netwo rk problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0 071008 12:15:33 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000105' position 893425700 Any help or ideas tracking this down would be appreciated - I think we are going to have to take down the production database to resync the two and get replication going again. We mainly use the replica for backup purposes in order to avoid downtime during the backup and in the event of a hardware issue with the master. Thanks, Frank -- The sender of this email subscribes to Perimeter eSecurity's email anti-virus service. This email has been scanned for malicious code and is believed to be virus free. For more information on email security please visit: http://www.perimeterusa.com/email-defense-content.html This communication is confidential, intended only for the named recipient(s) above and may contain trade secrets or other information that is exempt from disclosure under applicable law. Any use, dissemination, distribution or copying of this communication by anyone other than the named recipient(s) is strictly prohibited. If you have received this communication in error, please delete the email and immediately notify our Command Center at 203-541-3444. Thanks |
| |||
| Frank, Frank Bottone wrote: > I've been having trouble with my master/slave server - recently I was > having a few repeated issues where the mysql slave would stop due to > "invalid sql syntax", but the queries executed fine on the master. I > would have to manually dig through the logs and then find the query to > manually execute on the slave, then use skip_counter to resume the > replication skipping the corrupted statement on the slave. I thought it > might be hardware related since it was only affecting the slave, so I > moved it to a different blade (both the servers are blades). > > However, today I was greeted with a nagios alert that the slave had > stopped again. This time, it seems like the relay log is definitely > corrupt. I was able to run mysqlbinlog > /dev/null on all the master > logs, none are corrupt (including the one it had read up to on the > slave). The relay log on the slave is though - it reports > "[root@mysql02 mysql]# mysqlbinlog mysql02-relay-bin.010923 > /dev/null > ERROR: Error in Log_event::read_log_event(): 'read error', data_len: > 38210134, event_type: 0 > Could not read entry at offset 618730:Error in log format or read error" > > _Nothing too much different in the logs either: > > _071006 11:18:52 [Note] Slave I/O thread: connected to master > 'replica@10.26.3.12 > 4:3306', replication started in log 'mysql-bin.000104' at position > 906124600 > 071008 9:07:12 [ERROR] Error reading packet from server: Lost > connection to MySQL server during query ( server_errno=2013) > 071008 9:07:13 [Note] Slave I/O thread: Failed reading log event, .... snip ... > their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0 > 071008 12:15:33 [ERROR] Error running query, slave SQL thread aborted. > Fix the problem, and restart the slave SQL thread with "SLAVE START". We > stopped > at log 'mysql-bin.000105' position 893425700 > > > Any help or ideas tracking this down would be appreciated - I think we > are going to have to take down the production database to resync the two > and get replication going again. We mainly use the replica for backup > purposes in order to avoid downtime during the backup and in the event > of a hardware issue with the master. No need to take down the master or re-initialize the slave, given what I've seen so far. Just tell the slave to throw away its relay logs and re-fetch from the master. From the output you showed, CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000105', MASTER_LOG_POS=893425700; This will discard the relay logs and re-fetch them. As long as that master log hasn't been purged on the master, you might be OK. You might want to take a look at mysql-table-checksum. Your data could be fine, but it might also be different on the slave. But there's no need to worry about it until you prove it: http://mysqltoolkit.sourceforge.net/ Your corruption in the relay logs could be caused by any number of things -- bad network, bad hardware, software bug... You could add your voice to an outstanding bug request: http://bugs.mysql.com/bug.php?id=25737 Hope that helps Baron |
| |||
| Baron, Thanks for the quick response. I do have the binlogs still on the master, so I should be able to do that - however I saw a post somewhere (lost the link at this time) saying that resetting the slave will drop any temporary tables which could cause issues. I'm not sure at this point if that would affect me or not. It is definitely worth a shot I guess, since worst case I will still need to resync from the master. I will try this and give the checksum tool a try as well (although, I think I might have crippled myself from the earlier issues we've been having. They only occurred in a spam-related table and we were able to prove out that the messages were clearly spam and could be left out of the slave/backup by just skipping that transaction. The issue was happening frequently enough that digging through the binlogs to get the query to manually replicate became more effort than it was worth, so the systems might be slightly out of sync. Perhaps I can just ignore the checksum differences for that particular table... I'll let you know the results. Thanks, Frank Baron Schwartz wrote: > Frank, > > Frank Bottone wrote: >> I've been having trouble with my master/slave server - recently I was >> having a few repeated issues where the mysql slave would stop due to >> "invalid sql syntax", but the queries executed fine on the master. I >> would have to manually dig through the logs and then find the query >> to manually execute on the slave, then use skip_counter to resume the >> replication skipping the corrupted statement on the slave. I thought >> it might be hardware related since it was only affecting the slave, >> so I moved it to a different blade (both the servers are blades). >> >> However, today I was greeted with a nagios alert that the slave had >> stopped again. This time, it seems like the relay log is definitely >> corrupt. I was able to run mysqlbinlog > /dev/null on all the master >> logs, none are corrupt (including the one it had read up to on the >> slave). The relay log on the slave is though - it reports >> "[root@mysql02 mysql]# mysqlbinlog mysql02-relay-bin.010923 > /dev/null >> ERROR: Error in Log_event::read_log_event(): 'read error', data_len: >> 38210134, event_type: 0 >> Could not read entry at offset 618730:Error in log format or read error" >> >> _Nothing too much different in the logs either: >> >> _071006 11:18:52 [Note] Slave I/O thread: connected to master >> 'replica@10.26.3.12 >> 4:3306', replication started in log 'mysql-bin.000104' at position >> 906124600 >> 071008 9:07:12 [ERROR] Error reading packet from server: Lost >> connection to MySQL server during query ( server_errno=2013) >> 071008 9:07:13 [Note] Slave I/O thread: Failed reading log event, > > ... snip ... > >> their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0 >> 071008 12:15:33 [ERROR] Error running query, slave SQL thread >> aborted. Fix the problem, and restart the slave SQL thread with >> "SLAVE START". We stopped >> at log 'mysql-bin.000105' position 893425700 >> >> >> Any help or ideas tracking this down would be appreciated - I think >> we are going to have to take down the production database to resync >> the two and get replication going again. We mainly use the replica >> for backup purposes in order to avoid downtime during the backup and >> in the event of a hardware issue with the master. > > No need to take down the master or re-initialize the slave, given what > I've seen so far. Just tell the slave to throw away its relay logs > and re-fetch from the master. From the output you showed, > > CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000105', > MASTER_LOG_POS=893425700; > > This will discard the relay logs and re-fetch them. As long as that > master log hasn't been purged on the master, you might be OK. > > You might want to take a look at mysql-table-checksum. Your data > could be fine, but it might also be different on the slave. But > there's no need to worry about it until you prove it: > > http://mysqltoolkit.sourceforge.net/ > > Your corruption in the relay logs could be caused by any number of > things -- bad network, bad hardware, software bug... You could add > your voice to an outstanding bug request: > > http://bugs.mysql.com/bug.php?id=25737 > > Hope that helps > Baron > -- The sender of this email subscribes to Perimeter eSecurity's email anti-virus service. This email has been scanned for malicious code and is believed to be virus free. For more information on email security please visit: http://www.perimeterusa.com/email-defense-content.html This communication is confidential, intended only for the named recipient(s) above and may contain trade secrets or other information that is exempt from disclosure under applicable law. Any use, dissemination, distribution or copying of this communication by anyone other than the named recipient(s) is strictly prohibited. If you have received this communication in error, please delete the email and immediately notify our Command Center at 203-541-3444. Thanks |
| ||||
| Frank- I've had the exact same issue crop up in our prod servers - it was very frustrating, as it was intermittent and would affect some of our slaves, but not all. We had a lot of back and forth with MySQL support without really being able to consistently pin down or reproduce the issue. Ultimately we had to perform a RESET SLAVE (back up those relay logs first, just in case) and reset the master log position (see Baron's CHANGE MASTER TO statements below) to the point where it failed. We also increased the max allowed packet size on both the master and the slave, as it seemed like the issue was occurring with large transactions or large single rows the most. Doing those two things seemed to fix the issue. Baron's checksum script is excellent and I can recommend using it if you haven't rolled your own script. One caveat is that if your slave is using different engine types (to use full text indexes, say), I think the checksums are different. But you may not have that problem. I believe your assertion about temp tables is correct, according to this: http://dev.mysql.com/doc/refman/5.0/en/reset-slave.html Good luck, Dan -----Original Message----- From: Frank Bottone [mailto:fbottone@perimeterusa.com] Sent: Monday, October 08, 2007 3:49 PM To: Baron Schwartz Cc: mysql@lists.mysql.com Subject: Re: Problem with repeated replication corruption - Could not parse relay log event entry Baron, Thanks for the quick response. I do have the binlogs still on the master, so I should be able to do that - however I saw a post somewhere (lost the link at this time) saying that resetting the slave will drop any temporary tables which could cause issues. I'm not sure at this point if that would affect me or not. It is definitely worth a shot I guess, since worst case I will still need to resync from the master. I will try this and give the checksum tool a try as well (although, I think I might have crippled myself from the earlier issues we've been having. They only occurred in a spam-related table and we were able to prove out that the messages were clearly spam and could be left out of the slave/backup by just skipping that transaction. The issue was happening frequently enough that digging through the binlogs to get the query to manually replicate became more effort than it was worth, so the systems might be slightly out of sync. Perhaps I can just ignore the checksum differences for that particular table... I'll let you know the results. Thanks, Frank Baron Schwartz wrote: > Frank, > > Frank Bottone wrote: >> I've been having trouble with my master/slave server - recently I was >> having a few repeated issues where the mysql slave would stop due to >> "invalid sql syntax", but the queries executed fine on the master. I >> would have to manually dig through the logs and then find the query >> to manually execute on the slave, then use skip_counter to resume the >> replication skipping the corrupted statement on the slave. I thought >> it might be hardware related since it was only affecting the slave, >> so I moved it to a different blade (both the servers are blades). >> >> However, today I was greeted with a nagios alert that the slave had >> stopped again. This time, it seems like the relay log is definitely >> corrupt. I was able to run mysqlbinlog > /dev/null on all the master >> logs, none are corrupt (including the one it had read up to on the >> slave). The relay log on the slave is though - it reports >> "[root@mysql02 mysql]# mysqlbinlog mysql02-relay-bin.010923 > /dev/null >> ERROR: Error in Log_event::read_log_event(): 'read error', data_len: >> 38210134, event_type: 0 >> Could not read entry at offset 618730:Error in log format or read error" >> >> _Nothing too much different in the logs either: >> >> _071006 11:18:52 [Note] Slave I/O thread: connected to master >> 'replica@10.26.3.12 >> 4:3306', replication started in log 'mysql-bin.000104' at position >> 906124600 >> 071008 9:07:12 [ERROR] Error reading packet from server: Lost >> connection to MySQL server during query ( server_errno=2013) >> 071008 9:07:13 [Note] Slave I/O thread: Failed reading log event, > > ... snip ... > >> their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0 >> 071008 12:15:33 [ERROR] Error running query, slave SQL thread >> aborted. Fix the problem, and restart the slave SQL thread with >> "SLAVE START". We stopped >> at log 'mysql-bin.000105' position 893425700 >> >> >> Any help or ideas tracking this down would be appreciated - I think >> we are going to have to take down the production database to resync >> the two and get replication going again. We mainly use the replica >> for backup purposes in order to avoid downtime during the backup and >> in the event of a hardware issue with the master. > > No need to take down the master or re-initialize the slave, given what > I've seen so far. Just tell the slave to throw away its relay logs > and re-fetch from the master. From the output you showed, > > CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000105', > MASTER_LOG_POS=893425700; > > This will discard the relay logs and re-fetch them. As long as that > master log hasn't been purged on the master, you might be OK. > > You might want to take a look at mysql-table-checksum. Your data > could be fine, but it might also be different on the slave. But > there's no need to worry about it until you prove it: > > http://mysqltoolkit.sourceforge.net/ > > Your corruption in the relay logs could be caused by any number of > things -- bad network, bad hardware, software bug... You could add > your voice to an outstanding bug request: > > http://bugs.mysql.com/bug.php?id=25737 > > Hope that helps > Baron > -- The sender of this email subscribes to Perimeter eSecurity's email anti-virus service. This email has been scanned for malicious code and is believed to be virus free. For more information on email security please visit: http://www.perimeterusa.com/email-defense-content.html This communication is confidential, intended only for the named recipient(s) above and may contain trade secrets or other information that is exempt from disclosure under applicable law. Any use, dissemination, distribution or copying of this communication by anyone other than the named recipient(s) is strictly prohibited. If you have received this communication in error, please delete the email and immediately notify our Command Center at 203-541-3444. Thanks -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=d...tionhealth.com |