CockroachDB的raft优化

论坛 期权论坛 脚本     
已经匿名di用户   2021-12-20 01:35   1997   0

读了一下CockroachDB的16年的设计翻译,找了一些CockroachDB的raft优化方法

1.选举优化,Cockroach使用随机时间,这样通信往返时间短的会更易第一个发起选举,更容易成为leader,减少了组内出现没有leader的时间

2.相对于tidb的multi raft来说,多了心跳合并的优化。减少了大量耗费在心跳上的流量,我不太清楚tidb的PD模块是不是也有这个功能。原理是引入了一种Node级别的lease,只要Node级别的lease有效,那么这个Node上的所有的raft group的leader的lease都是有效的。这样就不必频繁更新range级别的lease,只需要更新Node级别的lease即可

3.也有和tidb类似的分片过大拆分和过小合并政策,来控制multi raft组的数量

4.还有一个优化就是关闭某一些不活跃的raft group,基于的假设就是数据库这么大,其实,很多数据很长时间根本都碰不到,所以维护着lease没啥用。

5.范围租赁(Range Leases)优化,我看了一下介绍,感觉和tidb中的Lease Read读优化很像,也是确定一段时间的leader,将走raft确定leader的步骤优化了

粗略的看了一遍raft相关部分,优化部分提到的不多,而且感觉基本在tidb中都有相关的优化介绍

来源:

https://lihuanghe.github.io/2016/05/06/cockroachdb-design.html

https://github.com/cockroachdb/cockroach/blob/master/docs/design.md

以下是4个节点的store示意图。每个range都使用Raft协议复制3份。 相同颜色表示是相同的range副本。这里的架构和tidb的multiraft非常类似

Ranges

Cockroach weights random timeouts such that the replicas with shorter round trip times to peers are more likely to hold elections first (not implemented yet).

Cockroach使用随机时间,这样通信往返时间短的会更易第一个发起选举。

Our Raft implementation was developed together with CoreOS, but adds an extra layer of optimization to account for the fact that a single Node may have millions of consensus groups (one for each Range). Areas of optimization are chiefly coalesced heartbeats (so that the number of nodes dictates the number of heartbeats as opposed to the much larger number of ranges) and batch processing of requests. Future optimizations may include two-phase elections and quiescent ranges (i.e. stopping traffic completely for inactive ranges).

Cockroach的Raft实现在CoreOS的基础上,增加额外的优化层,因为考虑到一个节点可能有几百万的一致性组(每个range一个)。少部分优化主要是合并心跳(与数量巨大的range相反,节点数量决定了心跳的数量)和 请求批处理。将来的优化还包括二阶段选举和静态range.

For these reasons, Cockroach introduces the concept of Range Leases: This is a lease held for a slice of (database, i.e. hybrid logical) time. A replica establishes itself as owning the lease on a range by committing a special lease acquisition log entry through raft. The log entry contains the replica node's epoch from the node liveness table--a system table containing an epoch and an expiration time for each node. A node is responsible for continuously updating the expiration time for its entry in the liveness table. Once the lease has been committed through raft the replica becomes the lease holder as soon as it applies the lease acquisition command, guaranteeing that when it uses the lease it has already applied all prior writes on the replica and can see them locally.

分享到 :
0 人收藏
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

积分:81
帖子:4969
精华:0
期权论坛 期权论坛
发布
内容

下载期权论坛手机APP