首页 > 资讯 > 数据库 >详细分析Redis集群故障

830

分享到

详细分析Redis集群故障

集群故障详细 2022-06-04 17:06:56 830人浏览八月长安

摘要

故障表象：业务层面显示提示查询Redis失败集群组成： 3主3从，每个节点的数据有8GB 机器分布：在同一个机架中， xx.x.xxx.199 xx.x.xxx.200 xx.x.xxx.

故障表象：

业务层面显示提示查询Redis失败

集群组成：

3主3从，每个节点的数据有8GB

机器分布：

在同一个机架中，

xx.x.xxx.199
xx.x.xxx.200
xx.x.xxx.201

redis-server进程状态：

通过命令ps -eo pid,lstart | grep $pid，

发现进程已经持续运行了3个月

发生故障前集群的节点状态：

xx.x.xxx.200:8371(bedab2c537fe94f8c0363ac4ae97d56832316e65) master
xx.x.xxx.199:8373(792020fe66c00ae56e27cd7a048ba6bb2b67adb6) slave
xx.x.xxx.201:8375(5ab4f85306da6d633e4834b4d3327f45af02171b) master
xx.x.xxx.201:8372(826607654f5ec81c3756a4a21f357e644efe605a) slave
xx.x.xxx.199:8370(462cadcb41e635d460425430d318f2fe464665c5) master
xx.x.xxx.200:8374(1238085b578390f3c8efa30824fd9a4baba10ddf) slave

---------------------------------下面是日志分析--------------------------------------

步1：
主节点8371失去和从节点8373的连接：
46590:M 09 Sep 18:57:51.379 # Connection with slave xx.x.xxx.199:8373 lost.

步2：
主节点8370/8375判定8371失联：
42645:M 09 Sep 18:57:50.117 * Marking node bedab2c537fe94f8c0363ac4ae97d56832316e65 as failing (quorum reached).

步3：
从节点8372/8373/8374收到主节点8375说8371失联：
46986:S 09 Sep 18:57:50.120 * FAIL message received from 5ab4f85306da6d633e4834b4d3327f45af02171b about bedab2c537fe94f8c0363ac4ae97d56832316e65

步4：
主节点8370/8375授权8373升级为主节点转移：
42645:M 09 Sep 18:57:51.055 # Failover auth granted to 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 for epoch 16

步5：
原主节点8371修改自己的配置，成为8373的从节点：
46590:M 09 Sep 18:57:51.488 # Configuration change detected. Reconfiguring myself as a replica of 792020fe66c00ae56e27cd7a048ba6bb2b67adb6

步6：
主节点8370/8375/8373明确8371失败状态：
42645:M 09 Sep 18:57:51.522 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: master without slots is reachable again.

步7：
新从节点8371开始从新主节点8373，第一次全量同步数据：
8373日志：：
4255:M 09 Sep 18:57:51.906 * Full resync requested by slave xx.x.xxx.200:8371
4255:M 09 Sep 18:57:51.906 * Starting BGSAVE for SYNC with target: disk
4255:M 09 Sep 18:57:51.941 * Background saving started by pid 5230
8371日志：：
46590:S 09 Sep 18:57:51.948 * Full resync from master: d7751c4ebf1e63D3baebea1ed409e0e7243a4423:440721826993

步8：
主节点8370/8375判定8373(新主)失联：
42645:M 09 Sep 18:58:00.320 * Marking node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 as failing (quorum reached).

步9：
主节点8370/8375判定8373(新主)恢复：
60295:M 09 Sep 18:58:18.181 * Clear FAIL state for node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6: is reachable again and nobody is serving its slots after some time.

步10：
主节点8373完成全量同步所需要的BGSAVE操作：
5230:C 09 Sep 18:59:01.474 * DB saved on disk
5230:C 09 Sep 18:59:01.491 * RDB: 7112 MB of memory used by copy-on-write
4255:M 09 Sep 18:59:01.877 * Background saving terminated with success

步11：
从节点8371开始从主节点8373接收到数据：
46590:S 09 Sep 18:59:02.263 * MASTER <-> SLAVE sync: receiving 2657606930 bytes from master

步12：
主节点8373发现从节点8371对output buffer作了限制：
4255:M 09 Sep 19:00:19.014 # Client id=14259015 addr=xx.x.xxx.200:21772 fd=844 name= age=148 idle=148 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16349 oll=4103 omem=95944066 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
4255:M 09 Sep 19:00:19.015 # Connection with slave xx.x.xxx.200:8371 lost.

步13：
从节点8371从主节点8373同步数据失败，连接断了，第一次全量同步失败：
46590:S 09 Sep 19:00:19.018 # I/O error trying to sync with MASTER: connection lost
46590:S 09 Sep 19:00:20.102 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:20.102 * MASTER <-> SLAVE sync started

步14：
从节点8371重新开始同步，连接失败，主节点8373的连接数满了：
46590:S 09 Sep 19:00:21.103 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:21.103 * MASTER <-> SLAVE sync started
46590:S 09 Sep 19:00:21.104 * Non blocking connect for SYNC fired the event.
46590:S 09 Sep 19:00:21.104 # Error reply to PING from master: '-ERR max number of clients reached'

步15：
从节点8371重新连上主节点8373，第二次开始全量同步：
8371日志：
46590:S 09 Sep 19:00:49.175 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:49.175 * MASTER <-> SLAVE sync started
46590:S 09 Sep 19:00:49.175 * Non blocking connect for SYNC fired the event.
46590:S 09 Sep 19:00:49.176 * Master replied to PING, replication can continue...
46590:S 09 Sep 19:00:49.179 * Partial resynchronization not possible (no cached master)
46590:S 09 Sep 19:00:49.501 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440780763454
8373日志：
4255:M 09 Sep 19:00:49.176 * Slave xx.x.xxx.200:8371 asks for synchronization
4255:M 09 Sep 19:00:49.176 * Full resync requested by slave xx.x.xxx.200:8371
4255:M 09 Sep 19:00:49.176 * Starting BGSAVE for SYNC with target: disk
4255:M 09 Sep 19:00:49.498 * Background saving started by pid 18413
18413:C 09 Sep 19:01:52.466 * DB saved on disk
18413:C 09 Sep 19:01:52.620 * RDB: 2124 MB of memory used by copy-on-write
4255:M 09 Sep 19:01:53.186 * Background saving terminated with success

步16：
从节点8371同步数据成功，开始加载经内存：
46590:S 09 Sep 19:01:53.190 * MASTER <-> SLAVE sync: receiving 2637183250 bytes from master
46590:S 09 Sep 19:04:51.485 * MASTER <-> SLAVE sync: Flushing old data
46590:S 09 Sep 19:05:58.695 * MASTER <-> SLAVE sync: Loading DB in memory

步17：
集群恢复正常：
42645:M 09 Sep 19:05:58.786 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: slave is reachable again.

步18：
从节点8371同步数据成功，耗时7分钟：
46590:S 09 Sep 19:08:19.303 * MASTER <-> SLAVE sync: Finished with success

8371失联原因分析：

由于几台机器在同一个机架，不太可能发生网络中断的情况，于是通过SLOWLOG GET命令查看了慢查询日志，发现有一个KEYS命令被执行了，耗时8.3秒，再查看集群节点超时设置，发现是5s(cluster-node-timeout 5000)

出现节点失联的原因：

客户端执行了耗时1条8.3s的命令，

2016/9/9 18:57:43 开始执行KEYS命令
2016/9/9 18:57:50 8371被判断失联（redis日志）
2016/9/9 18:57:51 执行完KEYS命令

总结来说，有以下几个问题：

1.由于cluster-node-timeout设置比较短，慢查询KEYS导致了集群判断节点8371失联

2.由于8371失联，导致8373升级为主，开始主从同步

3.由于配置client-output-buffer-limit的限制，导致第一次全量同步失败了

4.又由于PHP客户端的连接池有问题，疯狂连接服务器，产生了类似SYN攻击的效果

5.第一次全量同步失败后，从节点重连主节点花了30秒（超过了最大连接数1w）

关于client-output-buffer-limit参数：


# The syntax of every client-output-buffer-limit directive is the following: 
# 
# client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds> 
# 
# A client is immediately disconnected once the hard limit is reached, or if 
# the soft limit is reached and remains reached for the specified number of 
# seconds (continuously). 
# So for instance if the hard limit is 32 megabytes and the soft limit is 
# 16 megabytes / 10 seconds, the client will get disconnected immediately 
# if the size of the output buffers reach 32 megabytes, but will also get 
# disconnected if the client reaches 16 megabytes and continuously overcomes 
# the limit for 10 seconds. 
# 
# By default nORMal clients are not limited because they don't receive data 
# without asking (in a push way), but just after a request, so only 
# asynchronous clients may create a scenario where data is requested faster 
# than it can read. 
# 
# Instead there is a default limit for pubsub and slave clients, since 
# subscribers and slaves receive data in a push fashion. 
# 
# Both the hard or the soft limit can be disabled by setting them to zero. 
client-output-buffer-limit normal 0 0 0 
client-output-buffer-limit slave 256mb 64mb 60 
client-output-buffer-limit pubsub 32mb 8mb 60

采取措施：

1.单实例的切割到4G以下，否则发生主从切换会耗时很长

2.调整client-output-buffer-limit参数，防止同步进行到一半失败

3.调整cluster-node-timeout，不能少于15s

4.禁止任何耗时超过cluster-node-timeout的慢查询，因为会导致主从切换

5.修复客户端类似SYN攻击的疯狂连接方式

总结

以上就是本文关于详细分析Redis集群故障的全部内容，希望对大家有所帮助。感兴趣的朋友可以参阅：spring aop实现Redis缓存 数据库查询源码、简述Redis和Mysql的区别、oracle 数据库启动阶段分析等，如有不足之处，请留言之处。小编会及时更正。感谢朋友们对编程网网站的支持！

您可能感兴趣的文档:

点击免费下载>>软考高级考试备考技巧/历年真题/备考精华资料

--结束END--

本文标题: 详细分析Redis集群故障

本文链接: https://www.lsjlt.com/news/11778.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

本篇文章演示代码以及资料文档资料下载

下载Word文档到电脑，方便收藏和打印～

下载Word文档

去做题

猜你喜欢

oracle怎么查询当前用户所有的表

要查询当前用户拥有的所有表，可以使用以下 sql 命令：select * from user_tables; 如何查询当前用户拥有的所有表要查询当前用户拥有的所有表，可以使...

99+

2024-05-14

oracle
oracle怎么备份表中数据

oracle 表数据备份的方法包括：导出数据 (exp)：将表数据导出到外部文件。导入数据 (imp)：将导出文件中的数据导入表中。用户管理的备份 (umr)：允许用户控制备份和恢复过程...

99+

2024-05-14

oracle
oracle怎么做到数据实时备份

oracle 实时备份通过持续保持数据库和事务日志的副本来实现数据保护，提供快速恢复。实现机制主要包括归档重做日志和 asm 卷管理系统。它最小化数据丢失、加快恢复时间、消除手动备份任务...

99+

2024-05-14

oracle 数据丢失
oracle怎么查询所有的表空间

要查询 oracle 中的所有表空间，可以使用 sql 语句 "select tablespace_name from dba_tablespaces"，其中 dba_tabl...

99+

2024-05-14

oracle
oracle怎么创建新用户并赋予权限设置

答案：要创建 oracle 新用户，请执行以下步骤：以具有 create user 权限的用户身份登录；在 sql*plus 窗口中输入 create user identified ...

99+

2024-05-14

oracle
oracle怎么建立新用户

在 oracle 数据库中创建用户的方法：使用 sql*plus 连接数据库；使用 create user 语法创建新用户；根据用户需要授予权限；注销并重新登录以使更改生效。如何在 ...

99+

2024-05-14

oracle
oracle怎么创建新用户并赋予权限密码

本教程详细介绍了如何使用 oracle 创建一个新用户并授予其权限：创建新用户并设置密码。授予对特定表的读写权限。授予创建序列的权限。根据需要授予其他权限。如何使用 Oracle 创...

99+

2024-05-14

oracle
oracle怎么查询时间段内的数据记录表

在 oracle 数据库中查询指定时间段内的数据记录表，可以使用 between 操作符，用于比较日期或时间的范围。语法：select * from table_name wh...

99+

2024-05-14

oracle
oracle怎么查看表的分区

问题：如何查看 oracle 表的分区？步骤：查询数据字典视图 all_tab_partitions，指定表名。结果显示分区名称、上边界值和下边界值。如何查看 Oracle 表的分区...

99+

2024-05-14

oracle
oracle怎么导入dump文件

要导入 dump 文件，请先停止 oracle 服务，然后使用 impdp 命令。步骤包括：停止 oracle 数据库服务。导航到 oracle 数据泵工具目录。使用 impdp 命令导...

99+

2024-05-14

oracle