All you need to know about distributed file system

Rong
56 min readJun 28, 2023

--

All you need to know about distributed file system

Category

1. Centralized storage structure

2. Distributed storage

  • 1. The rise of distributed storage
  • 2. The importance of distributed storage
  • 3. Types and comparison of distributed storage

3. Analysis of Distributed Theory

  • 1. Consistency and availability
  • 2. Data distribution
  • 3. Copy
  • 4. Distributed protocol
  • 5. Deployment across computer rooms

4. Distributed file system

  • 1. Google File System (GFS)
  • 2. Taobao File System (TFS)
  • 3. Facebook Haystack file system
  • 4. CDN content distribution network

5. Distributed key-value system

  • 1. Amazon Dynamo
  • 2. Taobao Tiar
  • 3. ETCD
  • 4. Product selection comparison ( Etcd , Zookeeper , Consul )

1, Centralized storage structure

Speaking of distributed storage, let’s first look at what traditional storage looks like.

Traditional storage is also called centralized storage. It can be seen from the concept that it is centralized, that is, the entire storage is concentrated in one system, but centralized storage is not a single device, but centralized Multiple devices in the system, such as the EMC storage in the figure below, require several cabinets for storage.

This storage system contains many components, in addition to the controller, disk array (JBOD) and switches, and other equipment, there are also auxiliary equipment such as management equipment.

The structure includes a system, which is the core component of the storage system. Usually, there are two controllers in the machine, which serve as backups for each other, so as to avoid the unavailability of the entire storage system caused by hardware failure. The system usually includes a front-end port and a back-end port. The user of the front-end port provides storage services for the server, while the back-end port is used to expand the capacity of the storage system. More storage devices can be connected through the rear port head, thus forming a very large storage resource pool.

In the entire structure, the system is the core part of the entire storage system, and the advanced functions of the entire storage system are realized in it. The software in the controller manages the disks, abstracts the disks into storage resource pools, and divides them into LUNs for servers to use. The LUN here is actually the disk seen on the server. Of course, some centralized storage itself is also a file server, which can provide shared file services. In any case, from the above, we can see that the biggest feature of centralized storage is that there is a unified entrance through which all data must pass. This entrance is the head of the storage system. This is the most notable feature that distinguishes centralized storage from distributed storage. As shown below:

2. Distributed storage

Distributed storage was first proposed by Google. Its purpose is to use cheap servers to provide web access problems in large-scale and high-concurrency scenarios. It adopts an expandable system structure, uses multiple storage servers to share storage load, and uses location servers to locate and store information. It not only improves the reliability, availability and access efficiency of the system, but is also easy to expand.

1. The rise of distributed storage

The rise of distributed storage is inseparable from the development of the Internet. Internet companies usually use large-scale distributed storage systems due to their large data volume and low capital accumulation.

Different from traditional high-end servers, high-end memory and high-end processors, the distributed storage system of Internet companies is formed by a large number of low-cost and high-cost-effective ordinary PC servers connected through the network. The main reasons are the following three points

(1) Internet business is developing rapidly, and attention is paid to cost consumption, which prevents the storage system from relying on the traditional vertical expansion method, that is, buying minicomputers first, and then buying mid-sized or even mainframes when there is not enough. The distributed system at the back end of the Internet is required to support horizontal expansion, that is, to increase the overall processing capacity of the system by adding ordinary PC servers.

(2) Ordinary PC servers are cost-effective and have a high failure rate. Automatic fault tolerance needs to be implemented at the software level to ensure data consistency.

(3) In addition, with the continuous addition of servers, it is necessary to be able to realize automatic load balancing at the software level, so that the processing capacity of the system can be linearly expanded.

2. The importance of distributed storage

From a single-machine single-user to a single-machine multi-user, and then to the current Internet age, many changes have taken place in the application system. The distributed system is still a hot topic of discussion at present, so what does the distributed system bring us, or why do we have a distributed system?

(1) The cost performance of upgrading the processing capacity of a single machine is getting lower and lower;

Enterprises find that it is becoming less and less cost-effective to improve performance by replacing hardware for vertical expansion;

(2) There is a bottleneck in the processing capacity of a single machine;

At a certain point in time, a single processor has its own performance bottleneck, which means that even if you are willing to spend more money to buy computing power, you can’t buy it;

(3) For stability and usability considerations

If the one-click system is used, everything is OK when the machine is normal, and once something goes wrong, the system will be completely unusable. Of course, disaster recovery and backup solutions can be considered, and these solutions will allow the system to evolve into a distributed system;

(4) Inevitable requirements for the development of cloud storage and big data

Cloud storage and big data are applications built on distributed storage. Mobile terminals have limited computing power and storage space, and there is a strong need to share resources between multiple devices, which makes cloud storage applications such as network disks and photo albums popular very quickly. However, everything remains the same, the core of cloud storage is the back-end large-scale distributed storage system. Big data goes a step further, not only needing to store massive amounts of data, but also to analyze the data through appropriate computing frameworks or tools, and extract valuable parts of it. Without distributed storage, there can be no analysis of big data. Careful analysis will also reveal that distributed storage technology is an artifact of the Internet’s back-end architecture. After mastering this skill, it will become very easy to understand the essence of other technologies in the future.

3. Types and comparison of distributed storage

Distributed storage includes a wide variety, in addition to the traditional distributed file system, distributed block storage and distributed object storage, it also includes distributed database and distributed cache, etc., but the architecture is nothing more than three

A. Intermediate control node architecture

The architecture represented by HDFS (Hadoop Distribution File System) is a typical representative. In this architecture, some nodes NameNode stores management data (metadata), and another part of nodes DataNode stores business data. This type of server is responsible for managing specific data. This structure is like a company’s hierarchical organizational structure. Namenode is like a boss. It only manages the subordinate managers (datanode), and the subordinate managers manage the data on the local disk under the node.

In the figure above, if the client needs to read data from a file, it first obtains the location of the file from the NameNode (specifically which DataNode it is in), and then obtains the specific data from the NameNode. In this architecture, NameNode is usually an active and standby deployment (Secondary NameNode), while DataNode is a cluster composed of a large number of nodes. Since the access frequency and access volume of metadata are much smaller than that of data, NameNode usually does not become a performance bottleneck, while data in DataNode clusters can have copies, which can ensure high availability and disperse client requests. Therefore, through this distributed storage architecture, the carrying capacity can be increased by expanding the number of datanodes horizontally, that is, the ability of dynamic horizontal expansion is realized.

B. Completely Decentralized Architecture — Computing Mode

The architecture represented by Ceph is its typical representative. The difference from HDFS in this architecture is that there is no central node in this architecture. The client calculates the location where it writes data through a device mapping relationship, so that the client can directly communicate with the storage node, thereby avoiding the performance bottleneck of the central node.

As shown in the figure above, the core components in the Ceph storage system architecture include MON services, OSD services, and MDS services.

(1) The MON service is used to maintain the hardware logical relationship of the storage system, mainly online information such as servers and hard disks. The MON service guarantees the availability of its services through clustering.

(2) The OSD service is used to manage the disk and realize the real data reading and writing. Usually, one disk corresponds to one OSD service.

(3) MDS tracks file hierarchy and stores metadata only for the CephFS file storage system. Ceph block devices and RADOS do not require metadata and therefore do not require a Ceph MDS daemon

(4) RADOS: RADOS is a ceph storage cluster that includes the above three services. All data in Ceph exists in the form of objects, and no matter what type of data RADOS object storage will be responsible for saving these objects. The RADOS layer ensures that data is always consistent. To do this, data replication, failure detection and recovery must be performed, as well as data migration and the balancing of cluster nodes.

(5) RBD (block device): formerly known as RADOS block device, it provides reliable distributed and high-performance block storage disks to clients.

(6) CephFS: The Ceph file system provides a POSIX-compatible file system that uses the Ceph storage cluster to store user data

(7) Librados: The libRADOS library provides a convenient way to access the RADOS interface for PHP, RUBY, Java, Python, C++ and other languages

(8) RADOS GW: RGW provides object storage services, which allows applications to establish connections with Ceph object storage, and RGW provides RUSTFUL APIs compatible with Amazon S3 and openstack Swift

The general flow of the client accessing the storage is that the client will first enter through the RADOS GW after startup, pull the storage resource layout information from the MON service, and then calculate the location of the desired data based on the layout information and the name of the written data. (including specific physical server information and disk information), and then directly communicate with the location corresponding to the CephFS corresponding to the location information, and read or write data

C. Completely decentralized architecture — consistent hashing

The architecture represented by swift is its typical representative. Different from Ceph’s method of obtaining data location through calculation, another way is to obtain data location through consistent hashing. The way of consistent hashing is to make the device into a hash ring, and then map the hash value calculated according to the data name to a certain position of the hash ring, so as to realize the positioning of the data.

There are two mapping relationships in Swift. For a file, find the corresponding virtual node (one-to-one mapping relationship) through the hash algorithm (MD5), and then find the corresponding virtual node through the mapping relationship (two-dimensional array in the ring file). Devices (many-to-many mapping relationship), thus completing the mapping of a file stored on the device.

D. Comparison of distributed storage

So now the question is, if we want to choose distributed storage, which one should we choose? In fact, they each have their own advantages and usage scenarios, depending on the specific needs.

(1)HDFS

It is mainly used in big data storage scenarios and is a storage component in the Hadoop big data architecture. When HDFS was first designed, its application scenario was clearly defined, which is big data services. The main application scenarios are:

a. The storage performance of large files is relatively high, such as hundreds of megabytes and several gigabytes of large files. Because HDFS uses metadata to manage files, and metadata related directories and blocks and other information are stored in NameNode memory, the increase in the number of files will occupy a large amount of NameNode memory. If there are a large number of small files, it will take up a lot of memory space and cause the performance of the entire distributed storage to decline, so it is more appropriate to use HDFS to store large files as much as possible.

b. Suitable for low-write, multiple-read business. As far as big data analysis business is concerned, its processing mode is to write once, read multiple times, and then perform data analysis. HDFS has relatively high data transmission throughput, but the data read delay is relatively poor, which is not suitable for frequent data. write.

c. HDFS adopts a multi-copy data protection mechanism, and the reliability of data can be guaranteed by using an ordinary X86 server. It is not recommended to use it in a virtualized environment.

( 2 ) Ceph

At present, the most widely used open source distributed storage system has been supported by many manufacturers, and the distributed storage of many hyper-converged systems is deeply customized based on Ceph. Moreover, Ceph has become the “standard configuration” of LINUX system and OpenStack to support their respective storage systems. Ceph can provide object storage, block device storage and file system storage services. The feature of supporting three different types of storage services at the same time is rare in distributed storage systems.

a. Ceph does not use the metadata addressing scheme of HDFS, and uses the CRUSH algorithm, with balanced data distribution and high parallelism. Moreover, in terms of supporting block storage features, the data can have strong consistency, and the experience of traditional centralized storage can be obtained.

b. Object storage service, Ceph supports Swift and S3 API interfaces. In terms of block storage, it supports thin provisioning, snapshots, and clones. In terms of file system storage services, it supports Posix interfaces and snapshots. However, the performance of Ceph-supported files is comparable to that of other distributed storage systems. The deployment is a little more complicated and the performance is a little weaker. Ceph is generally applied to block and object storage.

c. Ceph is a decentralized distributed solution, which needs to be planned and designed in advance, and requires a relatively high level of ability for the technical team. Especially when Ceph expands, due to its balanced data distribution, it will lead to a decline in the performance of the entire storage system

( 3 )Swift

Mainly for object storage. Similar to the object storage service provided by Ceph. It is mainly used to solve the problem of unstructured data storage. The main difference between it and Ceph’s object storage service is.

a. When the client accesses the object storage system service, Swift requires the client to access the Swift gateway to obtain the data. Ceph uses an OSD (object storage device) running on each storage node to obtain data information, without a single entry point, which is more flexible than Swift.

b. In terms of data consistency, Swift’s data is eventually consistent, and it is more efficient in processing massive data, but it is mainly for object storage services that do not require high data consistency but require relatively high data processing efficiency. And Ceph is always strong consistency across clusters. The main application scenario, in OpenStack, the object storage service uses Swift instead of Ceph.

3. Analysis of Distributed Theory

1. Consistency and availability

Due to the existence of exceptions, distributed storage systems often store multiple copies of data redundantly when designing, and each copy is called a copy). In this way, when a node fails, data can be read from other replicas. It can be considered that the copy is the only means of fault-tolerant technology in the distributed storage system. Due to the existence of multiple copies, how to ensure the consistency between copies is the theoretical core of the entire distributed system.

The word data consistency can often be seen in daily development or in various articles. We often hear that something has inconsistent data, which has caused a certain loss, so fix it as soon as possible. How many kinds of consistency are there?

a. Time consistency: The data of all data components is required to be completely consistent at any time;

b. Consistency of things: transaction consistency can only exist before the transaction starts and after the transaction is completed. During the transaction, the data may be inconsistent. For example, A transfers 100 yuan to B, A deducts 100, and B adds 100. If they can ensure that their accounts are correct before the transaction starts and after the transaction is completed, then this is transaction consistency. However, during the transaction process, it may happen that A deducts 100 yuan and B does not add 100 yuan. This is the inconsistency

c. Multiple different stand-alone transactions are involved in the application program, and the data is completely consistent only before and after all stand-alone transactions are completed.

Only relying on these three types of consistency is difficult to describe clearly in some complex situations in practice. Therefore, we have introduced the consistency model. Here we briefly introduce several common consistency models from strong to weak.

A. Linear consistency

Also known as strong consistency, it can be seen as having only one single-core processor, or as having only one copy of data, and all operations are atomic.

As shown in the figure above, for events e1 and e2, if the response of event e1 is before the invoke of event e2, we say e1 happens before e2.

For the same thread, the previous event must happen before the subsequent event. But for two events on different threads, there will be a happen before relationship between them only if there is no intersection on the timeline. For those events that overlap, such as event2 and event3 in the figure below, there is no happen before relationship between them. For the legal sequential execution process we are looking for, the order of the two can be arbitrary.

B. Sequential consistency

Sequential consistency is weaker than strict consistency. The write operation to the variable does not have to be seen instantly, but the write operation to the variable by different processors must be seen in the same order on all processors, where the processor can be replaced by a different node in a distributed system .

Suppose there are two threads A and B executing concurrently. Among them, thread A is composed of 3 operations, and their order in the program is: A1->A2->A3. Thread B also has 3 operations, and their order in the program is: B1->B2->B3. Suppose if The effect in the sequentially consistent model is as shown in the previous two figures.

C. Causal consistency

Causal consistency is a consistency model that is weaker than sequential consistency. Sequential consistency requires that the order of all operations must be in the order of a single processor (node), while causal consistency only needs to satisfy causal operations. Sequential consistency is sufficient.

Simply put, if someone asks you a question, then you give the answer, and the two are causal, but if you give the answer before the question, then this violates the causal relationship. To give a simple example, if node 1 updates data A, node 2 reads data A, and updates data B, here data B may be calculated based on data A, all have causality, but if node 3 sees If B is updated first, then A is updated, then the causal consistency will be destroyed.

D, eventual consistency

In fact, except for strong consistency, other consistency can be regarded as final consistency, but many specific consistency models have been derived according to the different requirements of different consistency models. Of course, the simplest final consistency does not need to pay attention to the order of intermediate changes, but only needs to be consistent at a certain point in time. It’s just that this certain point in time needs to be measured according to different systems and different businesses. Before eventual consistency is achieved, any value may be returned, and no order guarantees are made for these values.

E. Availability

Availability refers to “Reads and writes always succeed”, that is, the service is always available and the response time is normal. For an available distributed system, every non-faulty node must respond to every request. Therefore, generally when we measure the availability of a system, we calculate it by downtime.

Availability Classification

Available level (%)

Years of Tolerable Downtime

fault-tolerant availability

99.9999

<1 min

high availability

99.999

<5 min

Availability with failover capability

99.99

<53 min

high availability

99.9

<8.8h

Product availability

99

<43.8 min

Usually when we describe the availability of a system, we say that Taobao’s system availability can reach 5 nines, which means that its availability level is 99.999%, that is, the annual downtime does not exceed (1–0.99999)36524*60 = 5.256 min , which is an extremely high requirement.

Good usability mainly means that the system can serve users well, and there will be no bad user experience such as user operation failure or access timeout. In a distributed system, many upstream and downstream systems are designed, such as load balancing, WEB servers, application codes, database servers, etc. The instability of any node can affect availability

F. Consistency of distributed systems

In July 2000, Professor Eric Brewer of the University of California, Berkeley proposed the CAP conjecture at the ACM PODC meeting. 2 years later, Seth Gilbert and Nancy Lynch of the Massachusetts Institute of Technology proved CAP theoretically. Afterwards, the CAP theory officially became a recognized theorem in the field of distributed computing.

Overview of CAP theory: A distributed system can only satisfy at most two of the three items of consistency ( Consistency ), availability ( Availability ) and partition tolerance ( Partition tolerance ) at the same time.

The consistency in CAP that needs to be pointed out in particular is all nodes see the same data at the same time, that is, linear consistency.

Consistency must be viewed in two dimensions:

(1) From the perspective of the client, when multiple processes access concurrently, the non-distributed database requires that the updated data can be seen by subsequent visits, and all are strongly consistent;

(2) From the perspective of the server, how to distribute the updated data to the entire system as soon as possible and reduce the time window for achieving final consistency is a very important aspect to improve system availability and user experience.

Refer to the following formula:

N — the number of copies of the data

W — the number of nodes that need to be guaranteed to be written when updating data

R — the number of nodes that need to be read when reading data

(1) If W+R>N, the writing node and the reading node overlap, which means strong consistency. For example, for a typical relational database with one primary and one standby synchronous replication, N=2, W=2, R=1, no matter whether the data in the primary database or the standby database is read, it is consistent.

(2) If W+R<=N, it is weak consistency. For example, for a relational database with one primary and one standby asynchronous replication, N=2, W=1, R=1, if you read the standby database, you may not be able to read the updated data of the main database, so it is weakly consistent sex.

For a distributed system. P is a basic requirement. Among the three CAPs, only CA can make a trade-off, and try to improve P.

Contains two systems: CP without A**, AP without C**

The Hdfs, Ceph, and Swift we mentioned above all belong to the category of CP without A, but it is not completely without A. In order to achieve a certain availability, the number of replicas is generally set to N>=3. Different combinations of N, W, and R are to strike a balance between usability and consistency to suit different application scenarios.

Then, in real life, there are also some cases of AP without C. As shown in the CAP diagram, most of them are Nosql, CoachDB, and Cassandra databases. What are the scenarios?

In fact, it is an occasion that does not require correctness, such as the scene of buying a mobile phone in a certain meter or the scene of buying a train ticket in 12306. Maybe a few seconds ago when you browsed the product, the page indicated that it was in stock. When you finished selecting the product and were ready to place an order , the system prompts you that the order failed and the product is sold out. This is actually to ensure that the system can serve normally in terms of A (availability), and then make some sacrifices in terms of data consistency.

2. Data distribution

Distributed systems are different from traditional stand-alone systems in that they can distribute data to multiple nodes and achieve . There are two main ways of data distribution, one is hash distribution, such as consistent hash, which means that the system is

Amazon’s Dynamo system, Openstack’s Swift system; another method is sequential distribution, that is, the data on each table is ordered according to the primary key as a whole, representing the system as Google’s Bigtable system. Bigtable divides a large table into ordered ranges according to the primary key, and each ordered range is a sub-table.

A. Hash distribution (Swift)

The hash function of the hash function is very good, and the hash method can distribute the data evenly in the cluster. Moreover, the meta-information that needs to be recorded in the hash method is also very simple. Each node only needs to know the calculation method of the hash function and the number of modulo servers to calculate which machine the processed data should belong to.

However, finding a hash function with good hashing properties is difficult. This is because, if hashed according to the primary key, the data under the same user id may be distributed to multiple servers, which will make it difficult to operate multiple records under the same user id at one time; if hashed according to the user id , prone to the “data skew” (data skew) problem, that is, some large users have a large amount of data, no matter how large the cluster is, these users are always processed by one server.

There are generally two ways to deal with the problem of large users. One way is to manually split, that is, mark the large users in the system offline (for example, run a MapReduce job), and split them into multiple machines according to the data volume of these large users. on the server. This is equivalent to special treatment for these large users on the basis of hash distribution;

Another method is automatic splitting, that is, the data distribution algorithm can be dynamically adjusted to automatically split the data of large users to multiple servers. It contains two algorithms.

One is the traditional hash algorithm. When accessing data, the hash value is first calculated, and then the metadata server is queried to obtain the server corresponding to the hash value. Under this algorithm, the online and offline of the server will cause a large amount of data migration, which is not suitable for production.

Another consistent hash (Distributed Hash Table, DHT) algorithm. The idea of ​​the algorithm is as follows: assign a random token to each node in the system, and these tokens form a hash ring. When performing data storage operations, first calculate the hash value of the Key (primary key), and then store it in the node where the first token greater than or equal to the hash value in the clockwise direction is located. The advantage of consistent hashing is that when a node is added/deleted, it only affects adjacent nodes in the hash ring, and has no effect on other nodes.

As shown in the figure above, the characteristics of the algorithm itself can make the disk be divided into more uniform virtual partitions. Each virtual partition is a node on the hash ring, and the entire ring is an interval from 0 to the maximum value of 32 bits. And connected end to end, when the hash value of the data (or data name) is calculated, it must fall into a certain interval of the hash ring, and then in a clockwise direction, a node must be found. Then this node is where the data is stored. It can be seen that if there is only one node, up to 32 nodes have not been found yet, then the data is on the first unique node.

The entire data location is based on the above-mentioned consistent algorithm to redirect the request to the device for processing

(1) On the object storage, a location identifier is formed by three names of account name/container name/object name, and an integer number can be calculated through the unique identifier;

(2) In terms of storage devices, Swift builds a virtual partition table. The size of the table is determined when the cluster is created (usually hundreds of thousands). This table is actually an array;

(3) The integer value and this array can determine the position of the integer in the array through the consistent hash algorithm.

In principle, the consistency algorithm can ensure the balance and monotonicity of the data, avoid the dispersion of the data, effectively ensure the consistency of the data, and make the load be mapped to a specific buffer area as much as possible.

Because the consistent hash algorithm is prone to data skew problems due to uneven distribution of nodes when there are too few service nodes. Therefore, in practical applications, the number of virtual nodes is usually set to a value greater than 32, so even a few service nodes can achieve relatively uniform data distribution.

B. Sequential distribution (Bigtable).

Hash destroys the order of data, only supports random read operations, and cannot support sequential scans. Some systems can make compromises at the application layer. For example, Internet applications often split data according to users and distribute data through the hash method. The data of the same user is distributed to the same storage node, allowing the data of the same user to be distributed to the same storage node. Sequential scans , and operations across multiple users are resolved by the application layer. In addition, this method may cause the problem that the amount of data of some users is too large . Since the user’s data is limited to one storage node, the multi-machine parallel processing capability of the distributed storage system cannot be utilized.

Sequential distribution is relatively common in distributed table systems (Bigtable). The general method is to divide large tables into continuous ranges in sequence. Each range is called a sub-table, and the master control server is responsible for allocating these sub-tables according to a certain strategy. to the storage node.

As shown in the figure, the primary key range of the user table (User table) is 1 to 7000, which is divided into multiple sub-tables in the distributed storage system, corresponding to the data ranges of 1 to 1000, 1001 to 2000, … 6001 to 7000. Among them, the Meta table is to support a larger cluster scale. It divides the original one-layer index structure into two layers, and uses the Meta table to maintain the node where the User sub-table is located, thereby reducing the burden on the Root node.

Sequential distribution is similar to the B+ tree data structure. Each subtable is equivalent to a leaf node. As data is inserted and deleted, some subtables may become large, and some may become small, and the data distribution is uneven. If sequential distribution is adopted, the splitting and merging of sub-tables needs to be considered during system design, which will greatly increase the complexity of the system.

C, CRUSH distribution

The full name of the CRUSH algorithm is Controlled, Scalable, Decentralized Placement of Replicated Data. Strictly speaking, the Crush algorithm is actually based on the Hash algorithm. It’s just that the method of mapping is different from consistent hashing. We use the process of Ceph distribution to illustrate.

The process of Ceph distributing data: first calculate the Hash value of data x and take the remainder of the result and the number of PGs to obtain the PG number corresponding to data x. Then, the PG is mapped to a group of OSDs through the CRUSH algorithm. Finally, store the data x in the OSD corresponding to PG. Note: The full name of PG is Placement Group.

This process includes two mappings, the first is the mapping of data x to PG. If PG is used as a storage node, then the traditional Hash algorithm is the same. The difference is that PG is an abstract storage node, it will not increase or decrease as physical nodes join or leave, so the mapping of data to PG is stable.

Taking Dynamo as an example, in this process, PG plays two roles: the first role is to divide data partitions. The data interval managed by each PG is the same, so the data can be evenly distributed to the PG; the second role is to act as the Token in Dynamo, that is, to determine the location of the partition. In fact, this is the same thing as the fixed number of partitions in Dynamo and the principle of maintaining the number of partitions equal to the number of virtual nodes.

Taking Ceph as an example, the CRUSH algorithm calculates the data storage location through two mappings to determine how to store and retrieve data. CRUSH enables Ceph clients to communicate directly with OSDs rather than through a centralized server or proxy.

Through algorithmically determined methods of data storage and retrieval, Ceph avoids single points of failure, performance bottlenecks, and physical limitations to its scalability. CRUSH requires a map of the cluster, and uses the CRUSH map to pseudo-randomly store and retrieve data in OSDs, and the data is evenly distributed in the cluster.

3. Copy

In order to ensure the high reliability and high availability of the distributed storage system, data is generally stored in multiple copies in the system. When the storage node where a certain replica is located fails, the distributed storage system can automatically switch the service to other replicas to achieve automatic fault tolerance. A distributed storage system synchronizes data to multiple storage nodes through a replication protocol and ensures data consistency among multiple copies.

A. Strong synchronous replication

The client sends the write request to the primary copy, and the primary copy copies the write request to other standby copies. A common practice is to synchronize the operation log (Commit Log). The primary copy first synchronizes the operation log to the standby copy, the standby copy plays back the operation log, and notifies the primary copy after completion. Then, the master copy modifies the local machine, and waits until all operations are completed before notifying the client that the write is successful. The replication protocol in the figure below requires the master and backup to be synchronized successfully before returning the client to write successfully. This protocol is called a strong synchronization protocol.

Assume that the number of all replicas is N, and N > 2, that is, the number of backup replicas is greater than 1. Then, when the strong synchronization protocol is implemented, the primary copy can concurrently send the operation log to all backup copies and wait for a reply. As long as at least one backup copy returns successfully, it can reply to the client that the operation is successful. The advantage of strong synchronization is that if the primary copy fails, at least one backup copy has complete data, and the distributed storage system can automatically switch the service to the latest backup copy without worrying about data loss.

B. Asynchronous replication

The replication method corresponding to strong synchronization is asynchronous replication. In the asynchronous mode, the primary copy does not need to wait for the response from the standby copy, but only needs the local modification to be successful to notify the client that the write operation is successful. In addition, the primary replica pushes client modifications to other replicas through asynchronous mechanisms, such as a separate replication thread. The advantage of asynchronous replication is that the system has good availability, but the consistency is poor. If the primary copy fails unrecoverably, the last part of the update operation may be lost.

C, NWR Copy

A distributed storage system may also use a replication protocol (Replicated-write protocol) based on writing to multiple storage nodes. For example, the NWR replication protocol in the Dynamo system, where N is the number of copies, W is the number of copies for write operations, and R is the number of copies for read operations.

In the NWR protocol, multiple copies no longer distinguish between primary and backup. The client writes data to the W copies and reads the R copies according to a certain strategy. As long as W+R > N, it is guaranteed that at least one of the replicas read contains the latest update. However, the problem with this protocol is that the order of operations on different replicas may not be consistent, and conflicts may occur when reading from multiple replicas. This method is relatively rare in actual systems and is not recommended.

4. Distributed protocol

There are many distributed protocols, among which two-phase commit and Paxos protocol are the most representative. Two-phase commit protocol ( 2PC ) or three-phase commit ( 3PC ) is used to ensure the atomicity of operations across multiple nodes, that is, operations across multiple nodes are either executed successfully on all nodes, or all fail. The Paxos protocol is used to ensure that multiple nodes agree on a certain vote (eg which node is master).

A. Two-phase commit

The algorithm idea of ​​the two-phase submission can be summarized as follows: Participants will notify the coordinator of the success or failure of the operation, and then the coordinator will decide whether each participant should submit the operation or abort the operation based on the feedback information of all participants.

(1) Request phase (voting):

Transaction coordinator The coordinator notifies the transaction participants to prepare to submit or cancel the transaction, and then enters the voting process. During the voting process, the participants will inform the coordinator of their decision: agree (the transaction participant’s local execution is successful) or cancel (the transaction participant’s local execution fails).

(2) Submit phase (execution):

In this stage, the coordinator will make a decision based on the voting results of the first stage: submit or cancel, if and only if all participants agree to submit the transaction, the coordinator will notify all participants to submit the transaction, otherwise the coordinator will notify All participants cancel the transaction and the participants will perform corresponding operations after receiving the message from the coordinator.

(3) Problems that cannot be solved by two-phase commit

A) If a participant does not vote for a long time, the whole stage will be in a waiting state, but this can be solved by a timeout mechanism

B) When the coordinator makes mistakes and the participants make mistakes at the same time, the two phases cannot guarantee the integrity of transaction execution.

Consider that the coordinator crashes after sending a commit message, and the only participant that received the message crashes at the same time.

Then even if the coordinator generates a new coordinator through the election agreement, the status of this transaction is uncertain, and no one knows whether the transaction has been committed.

B. Three-phase commit

The three-phase commit has three phases: CanCommit, PreCommit, and DoCommit

(1) The CanCommit phase is approximately equal to the two-phase request phase; DoCommit is approximately equal to the two-phase commit phase. In the preparation stage, PreCommit is a buffer, which ensures that the state of each participating node is consistent before the final commit stage.

(2) The three-phase commit inserts a preparation phase (PreCommit) between the first phase and the second phase of the two-phase commit, so that in the original two-phase commit, after the participants vote, the coordinator crashes or errors , and the problem of potentially considerable delays caused by participants being in an “indeterminate state” where they cannot know whether to commit or abort is resolved.

(3) Problems that cannot be solved by three-phase commit

If after entering PreCommit, the coordinator sends a commit request, assuming that only one participant receives and performs the commit operation, while other participants do not receive the network interruption, they will choose abort operation according to 3PC, and the system state is inconsistent at this time sex.

C, Paxos protocol

To talk about Paxos, we must first start with the Byzantine issue. The background of the story is this: Byzantium is located in Istanbul, Turkey, and is the capital of the Eastern Roman Empire. Due to the vast territory of the Byzantine Roman Empire at that time, for the purpose of defense, each army was separated far away, and the generals could only send messages by messengers. During the war, all the generals in the Byzantine army must reach a consensus to decide whether they have a chance of winning before attacking the enemy’s camp. However, the military may have traitors and enemy spies, and these traitorous generals can disrupt or sway the decision-making process. At this time, in the case that some members are known to rebel, how can the remaining loyal generals reach a consensus without being influenced by traitors? This is the Byzantine general problem.

We deny the assumption and give the definition of the non-Byzantine model:

(1) The behavior of the consistency module can be executed at any speed, allowing the operation to fail, and may restart and run again after failure;

(2) Consistency modules send information communication in an asynchronous manner, the communication time can be arbitrarily long, information may be lost during transmission, and the same information is also allowed to be sent repeatedly, and the order of multiple information can be arbitrary. But there is one thing: information is not allowed to be tampered with.

From this, we come to the basic two phases of Paxos: Prepare phase and Accept phase. The logic of these two phases is very complex and is the basis of the mutual trust algorithm. This article does not intend to do an in-depth interpretation. Interested readers can refer to the book “Blockchain Algorithm”.

D. Raft protocol

Raft is the abbreviation of Replicated And Fault Tolerant, a simplified version of Paxos.

In a distributed system, what is the most correct posture for us to improve the robustness, availability and data security of the system? Of course, it relies on multiple backups, multiple backups of services, multiple backups of data, and the removal of single points to ensure that even if some related components hang up, the system can still serve healthily.

It is good to remove a single point and have no fixed authority, but the problem is, whose opinion shall prevail. In an environment where information may be lost, this is a very difficult task (it can be seen how important communication is. )!

When it was proposed in 1990, few people understood it. After many times of simplification and re-interpretation by the author, including the practice of re-creation and re-interpretation by Google and other teams, more than ten years have passed before it gradually becomes the de facto standard and is understood and accepted by everyone. But until now, few people can understand the extremely abstract Paxos algorithm.

From the road to simplicity! This is an eternal truth. The goal of Raft is to build a distributed consensus protocol that is easy to understand and build, and to ensure that the theory is correct on an easy basis.

If the Raft protocol is read as it is, it still takes some time. This article uses a popular method to explain it to everyone

The general principle of Raft is an algorithm based on the idea of ​​leader selection. Each node in the cluster has three possible roles:

( 1 ) leader

The entrance to client communication, the initiator of internal data synchronization, a cluster usually has only one leader node

( 2 ) follower:

Non-leader nodes passively accept data requests from the leader

( 3 ) candidate:

A temporary role that only exists in the leader election phase. If a node wants to become a leader, it initiates a voting request and becomes a candidate at the same time. If the election is successful, it becomes candidate, otherwise it returns to follower.

The algorithm consists of two processes: leader election and log replication:

(1) Election process: (assuming there are 5 nodes, S1~S5)

a. In the initial state, everyone is an equal follower, so who is the follower, you must choose a boss. Everyone is ready to move, and each follower maintains a random timer inside;

b. If no one actively contacts it when the timer time is up, it will become a candidate and send a voting request (RequestVote) to others, assuming that S1 and S3 become candidates

c. For candidates with the same conditions, followers adopt a first-come, first-served voting strategy. If more than half of the followers think that he is suitable to be a leader, then congratulations, a new leader has been born, assuming that S3 becomes the new leader of the big brother;

d. Unfortunately for S1, no one is willing to choose this tragic candidate, so it can only honestly change back to the status of a follower;

e. Similarly, if there is no contact from the elder brother within the timer period, it is likely that the elder brother has knelt down at this time, as shown in the picture below, and all the younger brothers start to move around again, and a new round of (term) elections begins.

(2) Log replication: (assuming there are 5 nodes, S1~S5)

a. The leader acts as the coordinator in the distributed transaction, and generates a two-phase commit (two-phase commit) every time there is a data update. When the leader receives a request for data operation, it does not rush to update the local data (the data is persisted on the disk), but generates the corresponding log, and then broadcasts the request for generating the log to all followers;

b. Each follower has two choices after receiving the request: one is to obey the leader’s command, also write to the log, and then return success; Should not obey the leader’s command, return false;

c. At this time, if more than half of the followers have successfully written the log, then the leader will start the submission of the second stage: formally write the data, and then broadcast it to the followers, and the followers will also choose to write or not write according to their own conditions. The result is returned to the leader, and finally the data of all nodes reach a consensus.

d. If more than half of the followers in any of these two stages return false or do not return at all, then the distributed transaction is unsuccessful. Although there will be no rollback process at this time, since the data will not be actually submitted on most nodes, it will be overwritten in the subsequent process

The Raft protocol guarantees the strong leadership of the _leader_, and the client _reads and writes_through the _leader__, which has a high consistency, but some students will ask, what is the value of distribution? How can load balance it? In practice, we adopt the Multi Raft architecture, combined with applications, different applications elect different leader nodes for load balancing.

5. Deployment across computer rooms

In distributed systems, the problem of cross-computer rooms has always been a long-standing problem. The network delay between computer rooms is large and unstable. Cross-computer room problems mainly include two aspects: data synchronization and service switching. There are three cross-computer room deployment schemes: overall cluster switching, single cluster cross-computer room, and Paxos election master copy. Introduce respectively below.

A. Overall switching of the cluster

Overall cluster switching is the most common solution. As shown in the figure, suppose a system is deployed in two computer rooms: computer room 1 and computer room 2. The two computer rooms remain independent, and each computer room deploys a separate master control node, and each master control node has a backup node. When the master control node fails, it can automatically switch the backup node in the computer room to the master control node to continue to provide services. In addition, the two computer rooms have deployed the same number of copies. For example, the copies of data slice A stored in computer room 1 are A11 and A12, and the copies stored in computer room 2 are A21 and A22. At a certain moment, computer room 1 is the master computer room, and computer room 2 is the standby computer room.

The data synchronization mode between computer rooms may be strong synchronization or asynchronous. If the asynchronous mode is adopted, then the data in the standby computer room always lags behind the primary computer room. When the main computer room fails as a whole, there are two options: either switch the service to the backup computer room and endure the risk of data loss; or stop the service until the main computer room recovers. Therefore, if the data synchronization is asynchronous, then the switch between the main and standby computer rooms is often manual, allowing users to choose “lose data” or “stop service” according to the characteristics of the business.

If the strong synchronization mode is adopted, then the data in the standby computer room is consistent with that in the primary computer room. When the main computer room fails, in addition to manual switching, an automatic switching method can also be used, that is, the service of the main computer room is detected through the distributed lock service. When the main computer room fails, the backup computer room is automatically switched to the main computer room.

B. A single cluster across computer rooms

Deploying a single cluster to multiple computer rooms allows the master copies of different data fragments to be located in different computer rooms, as shown in Figure 3–11. Each data fragment is in computer room 1 and computer room 2, and contains 4 copies in total, among which A1, B1, and C1 are primary copies, A1 and B1 are in computer room 1, and C1 is in computer room 2. There is only one master control node in the entire cluster, which needs to maintain communication with all the working nodes in computer room 1 and computer room 2. When the master control node fails, the distributed lock service will detect it and switch the backup node in computer room 2 to the master control node.

If this deployment method is adopted, the master control node needs to consider the information of the computer room when performing data distribution, that is, try to distribute multiple copies of the same data fragment to multiple computer rooms, so as to prevent the failure of a single computer room from affecting Normal service.

C, Paxos elects the primary copy

If the Paxos protocol is used to select the primary copy, then multiple copies of each data shard constitute a Paxos replication group. As shown in the figure, B1, B2, B3, and B4 constitute a replication group. At a certain moment, B1 is the master copy of the replication group. When B1 fails, other copies will try to switch to the master copy. The Paxos protocol guarantees that only one copy will success. In this way, there is no need to maintain a lease between the master control node and the working nodes, and the failure of the master control node will not affect the working nodes. Its advantage is that it can reduce the dependence on the master control node. The disadvantage is that the engineering complexity is too high, and it is difficult to simulate all abnormal situations offline.

4. Distributed file system

There are two main functions of the distributed file system: one is to store Blob type data such as documents, images, and videos; the other is to serve as the persistence layer of the distributed table system.

1. Google File System (GFS)

GFS, Big Table, and Map Reduce are called Google’s troika and are the cornerstone of many basic services.

GFS was proposed in 2003. It is a distributed file system. It is very different from the assumptions of many previous distributed systems. It is suitable for the following scenarios

(1) Considering that component failure is a normal state, it provides a fault tolerance mechanism and automatic load balancing, so that the distributed file system can run on cheap machines;

(2) For the storage of large files, the main workload of the system is large-scale streaming reading, and the writing operation is mainly written in append mode, and there are few random writes;

(3) Write once, read many times, such as web page storage on the Internet

GFS files are divided into fixed-size data blocks (chunk), and a 64-bit globally unique chunk handle is allocated by the master server when it is created. CS stores chunks on disk in the form of ordinary Linux files. In order to ensure reliability, chunks are replicated multiple times in different machines, and the default is three.

The master server maintains system metadata, including file and chunk namespaces, mapping between files and chunks, and chunk location information. It is also responsible for the global control of the entire system, such as chunk lease management, garbage collection of useless chunks, chunk replication, etc. The master server will periodically exchange information with CS through heartbeat.

The client is the access interface provided by GFS to the application program. It is a set of dedicated interfaces that do not follow the POSIX specification and are provided in the form of library files. When the client accesses GFS, it first accesses the main control server node to obtain the information of CSs interacting with it, and then directly accesses these CSs to complete the data access work.

It should be noted that the client in GFS does not cache file data, but only caches metadata obtained from the master server, which is determined by the application characteristics of GFS. There are two main applications of GFS: MapReduce and Bigtable. For MapReduce, the GFS client uses sequential reading and writing, and there is no need to cache file data; while Bigtable, as a distributed table system, implements a set of caching mechanisms internally. In addition, how to maintain the consistency between the client cache and the actual data is an extremely complex issue.

It can be seen that Hadoop’s HDFS is actually a simplified version of GFS, which is the product of Dr. Cutting’s “cottage” GFS. It is the product of stealing fire.

2. Taobao File System (TFS)

Internet applications often need to store documents, pictures, videos, etc. uploaded by users, such as Facebook albums, Taobao pictures, Dropbox documents, etc. Documents, pictures, and videos are generally called Blob data. The characteristic of the Blob file system is that after data is written, it is basically read-only, and there are few update operations. This is the main feature of the Taobao file system (TFS).

The TFS architecture borrows from GFS, but it is very different from GFS.

(1) TFS does not maintain the file directory tree inside, and the flat data organization structure can map the file name to the physical address of the file, which simplifies the file access process;

(2) It is specially optimized for the random read and write access performance of massive small files, which meets Taobao’s demand for small file storage, and is widely used in various applications of Taobao;

(3) The HA architecture and smooth expansion are adopted to ensure the availability and scalability of the entire file system.

A TFS cluster consists of two NameServer nodes (one primary and one standby) and multiple DataServer nodes. NameServer monitors the status of DataServer through heartbeat. NameServer is equivalent to Master in GFS, and DataServer is equivalent to ChunkServer in GFS. NameServer is divided into active NameServer and standby NameServer, only the active NameServer provides services, when the active NameServer fails, it can be detected by the heartbeat daemon, and the service will be switched to the standby NameServer. Multiple dsp processes will run on each DataServer, and one dsp corresponds to one mount point, which generally corresponds to an independent disk, so as to manage multiple disks.

In TFS, a large number of small files (actual data files) are merged into one large file (this is more optimized and improved than HDFS). This large file is called a block (Block), and each Block has a unique number in the cluster (block ID), a file can be uniquely determined by <block ID, block offset>. The actual data of Block in TFS is stored in DataServer, the size is generally 64MB, and three copies are stored by default, which is equivalent to the chunk in GFS. The application client is the access interface provided by TFS to the application program. The application client does not cache file data, but only caches the metadata of the NameServer.

3. Facebook Haystack file system

By 2014, Facebook probably had more than 400 billion pictures, with a total size of 30PB. Through calculation, the average size of each photo can be found to be 30PB/260GB, which is about 100KB. The number of new photos added by users every week is 1 billion (total size is 60TB), and the average number of new photos added per second is 109/7/40000 (based on 40000s per day), about 3800 write operations per second, and the peak value of read operations can reach Millions of times per second.

NAS-based storage was used in the backend of Facebook photo albums early, and photo files in NAS were mounted through NFS to provide services. Later, due to performance and cost considerations, Facebook Haystack was independently developed to store album data.

Similar to TFS, Facebook Haystack’s new architecture mainly solves files with too many image access IO times. The main idea is that multiple logical files share the same physical file. The flow chart of Haystack architecture and read request processing is as follows

Haystack architecture has three main parts: Haystack Directory, Haystack Store and Haystack Cache. Haystack Store is a physical storage node that organizes storage space in the form of physical volumes. Each physical volume is generally large, such as 100GB, so there are only 100 physical volumes for 10TB of data. Each physical volume corresponds to a physical file, therefore, the meta information of the physical file on each storage node is very small. Physical volumes on multiple physical storage nodes form a logical volume for backup. Haystack Directory stores the corresponding relationship between logical reels and physical reels. Assuming that the size of each reel is 100GB, the number of corresponding relations is 20PB / 100GB = 0.2MB, and the occupied memory can be ignored. Haystack cache is mainly used to solve the problem of over-reliance on CDN providers, and provides cache services for recently added images.

The general flow of the Haystack image read request is: when the user visits a page, the Web Server requests the Haystack Directory to construct a URL: http://<CDN>/<Cache>/<Machine id>/<Logical volume,Photo>, and follow-up according to The information of each part visits the CDN, Cache and the back-end Haystack Store storage node in turn. Haystack Directory can omit some parts when constructing URL so that users can directly request Haystack Cache without going through CDN. The request received by Haystack cache includes two parts: the request from the user’s browser and the request from the CDN. Haystack cache only caches the request sent by the user’s browser and requires the requested Haystack Store storage node to be writable. Generally speaking, the storage nodes of Haystack Store reach the upper limit of capacity after being written for a period of time and become read-only. Therefore, the pictures of writable nodes are newly added pictures, which are hot data.

Haystack’s write request (picture upload) processing flow is: Web Server first requests Haystack Directory to obtain the id of the picture and the writable logical volume, and then writes the data to each corresponding physical volume (the number of backups is generally 3).

File systems such as Facebook Haystack and Taobao TFS are generally called Blob file systems. They both solve the problem of a large number of small image files, so the architecture is very similar, the differences include

(1) Selection of logical volume size, for example, Haystack chooses a logical volume size of 100GB, and the block size in TFS is generally 64MB;

(2) Haystack uses RAID 6, and the underlying file system uses XFS with better performance. Taobao later eliminated the RAID mechanism, and the file system uses Ext3;

(3) Haystack uses Akamai & Limelight’s CDN service, while Taobao has used its own CDN. Of course, Facebook is also considering self-built CDN.

4. CDN content distribution network

The full name of CDN is Content Delivery Network, that is, content distribution network. Its purpose is to distribute website content to the ‘edge’ of the network closest to the user by adding a new layer of network architecture to the existing Internet. To achieve the following three purposes

(1) Solve the problem of access delay caused by distribution, bandwidth, and server performance, and is applicable to scenarios such as site acceleration, on-demand, and live broadcast. It enables users to obtain the required content nearby, solves the situation of Internet network congestion, and improves the response speed and success rate of users’ access to websites.

(2) Delay control is undoubtedly an important indicator of modern information technology. The intention of CDN is to reduce resources as much as possible to ensure the continuity of information smoothly under the conditions of forwarding, transmission, and link jitter.

(3) CDN plays the role of escort and accelerator, triggering information and reaching every user faster and more accurately, bringing a more extreme user experience.

As shown in the figure below, DNS no longer returns the IP of the source server to the user when resolving the domain name, but returns the IP of an edge node selected by the smart CDN load balancing system. The user uses this IP to access the edge node, and then the node obtains the source server IP through its internal DNS resolution and sends a request to obtain the page required by the user. If the request is successful, the edge node will cache the page, and the next time the user visits, it can directly Read without having to hit the origin server each time.

Taobao’s CDN architecture is self-developed to support user shopping, especially the massive image requests during the “Double 11” Singles’ Day. The images are stored in the TFS cluster in the background, and the CDN system caches these images to the edge closest to the user node. CDN uses two levels of Cache: L1-Cache and L2-Cache. When a user accesses a picture on Taobao, it is scheduled to a certain L1-Cache node through the global scheduling system (Global Load Balancing). If the L1-Cache hits, then directly return the image data to the user; otherwise, request the L2-Cache node, and cache the returned image data to the L1-Cache node. If the L2-Cache hits, return the picture data directly to the L1-Cache node; otherwise, request the picture server cluster of the source server. Each picture server is a web server running Nginx, and it also caches pictures locally. Only when the local cache does not hit, it will request the back-end TFS cluster. The picture server cluster and TFS cluster are deployed in the same data center .

For each CDN node, its architecture is shown in Figure 4–11. It can be seen from the figure that each CDN node implements load balancing through LVS+Haproxy. Among them, LVS is a four-layer load balancing software with good performance; Haproxy is a seven-layer load balancing software that can support more flexible load balancing strategies. By combining the two, different image requests can be dispatched to different Squid servers.

The above figure shows the single-node architecture of CDN, which has the following three characteristics

(1) The Squid server constitutes the distributed cache in the CDN of Taobao’s single node. This implementation is much simpler than the distributed cache because there is no need to consider data persistence.

(2) Hierarchical cache. Due to the high locality of cached data, SSD+SAS+SATA hybrid storage is used on the Squid server, and pictures migrate as hotspots change. The most popular ones are stored in SSD, and the medium-hot ones are stored in SAS , light heat storage to SATA. In this way, the performance of SSD and the cost advantages of SAS and SATA disks can be well combined;

(3) Low-power server customization, CDN cache service is IO-intensive rather than CPU-intensive. Therefore, Intel Atom CPU is used to customize low-power servers, which greatly reduces the overall power consumption under the premise of ensuring service performance.

5. Distributed key-value system

The distributed key-value system is used to store semi-structured data with simple relationships. The semi-structured data is encapsulated into an object composed of key-value pairs, where key is a unique identifier; value is an attribute value, which can be of any type, such as Text and pictures can also be empty; timestamp is a timestamp, providing support for multiple versions of the object. The distributed key-value system is stored in key-value pairs, and its structure is not fixed. Each tuple can have different fields, and key-value pairs can be added as needed, so that it is not limited to a fixed structure. It is more applicable and scalable Sex is better.

The distributed key-value system supports adding, deleting, checking, and changing operations for a single key-value pair. It can run on a PC server cluster and realize cluster expansion on demand to process large-scale data and ensure fault tolerance through data backup. Avoid the complexity and cost of segmenting data.

Generally speaking, from the perspective of storage data structure, the distributed key-value system is similar to the traditional hash table. The difference is that the distributed key-value system supports distributing data to multiple nodes in the cluster. Storage nodes. The distributed key-value system can configure the number of data backups, and can store all copies of a piece of data on different nodes. When a node fails to provide services normally, the remaining nodes will continue to provide services.

1、Amazon Dynamo

Dynamo stores data in a very simple key-value manner and does not support complex queries. What is stored in Dynamo is the original form of the data value, and the specific content of the data is not parsed. Dynamo is mainly used for Amazon’s shopping cart and S3 cloud storage service. During the implementation process, the following problems were solved:

question

technology used

data distribution

Improved hash algorithm

copy agreement

Copy and write protocol (NRW parameter adjustable)

Data Conflict Resolution

vector clock

Temporary Troubleshooting

Data return mechanism

permanent fault handling

Merkle hash tree

Membership and error detection

Gossip-based membership and error detection protocol

Dynamo uses consistent hashing to distribute data to multiple storage nodes. In a nutshell: assign a random token to each node in the system, and these tokens form a hash ring. When performing data storage operations, first calculate the hash value of the primary key, and then store it in the node where the first token greater than or equal to the hash value in the clockwise direction is located. The advantage of consistent hashing is that node addition/deletion will only affect nodes adjacent to the hash ring, but not other nodes.

A. Dynamo Architecture

Considering the heterogeneity of nodes, the processing capabilities of different nodes vary greatly. Dynamo uses an improved consistent hash algorithm: each physical node is assigned multiple tokens according to its performance difference, and each token corresponds to a virtual node. The processing capacity of each virtual node is basically the same, and they are randomly distributed in the hash space. When storing, the data falls into the area responsible for a certain virtual node according to the hash value, and then is stored in the physical node corresponding to the virtual node.

As shown in the figure below, there are originally 3 nodes in a Dynamo cluster, and each node is assigned 3 tokens. When storing data, first calculate the hash value of the primary key, and store the data to the node where the corresponding token is located according to the hash value. Assuming that node 4 is added, the distribution of node tokens changes, which realizes automatic load balancing.

In order to find the node to which the data belongs, each node is required to maintain certain cluster information for positioning. Each node in the Dynamo system maintains the information of the entire cluster, and the client also caches the information of the entire cluster. Therefore, most requests can be located at the target node at one time.

B. Gossip protocol

Due to machine or human factors, the addition or deletion of node members in the system often occurs. In order to ensure that each node caches the latest member information in the Dynamo cluster, all nodes pass the Gossip protocol at regular intervals (such as 1s). Arbitrarily choose a node to communicate with from other nodes. If the connection is successful, both parties exchange their saved cluster information.

The Gossip protocol is used for the autonomous node coordination in the P2P system to understand the entire cluster, such as the node status and load status of the cluster. Let’s first look at how two nodes A and B exchange knowledge of the world.

(1) A tells B the versions of all nodes it manages (including nodes in Down state and Up state);

( 2 ) B tells A which versions are older and which versions it has the latest, and then sends the latest nodes to A (the nodes in the Down state will not be followed because the version has not been updated);

(3) A sends the older nodes in B to B, and at the same time updates the latest node information sent by B locally;

(4) After receiving the latest node information from A, B updates the older nodes in the local cache.

Due to the existence of seed nodes, it is relatively simple for new nodes to join. When a new node joins, it first exchanges cluster information with the seed node, so that it has an understanding of the cluster. Other original nodes in the DHT (Distributed Hash Table, also known as Consistent Hash Table) ring will also periodically exchange cluster information with seed nodes to discover the addition of new nodes.

The cluster is constantly changing, and machines may go offline at any time. Therefore, each node needs to regularly exchange cluster information with other nodes through the Gossip protocol. If it is found that the status of a node has not been updated for a long time, for example, the time interval from the last update exceeds a certain threshold, the node is considered to be offline.

2. Taobao Tiar

Tair is a distributed key/value system.

Tair has four engines: mdb, rdb, kdb and ldb. Based on four open source key/value databases: memcached, Redis, Kyoto Cabinet and leveldb. Tair allows you to use these KV databases more conveniently. For example, Redis does not provide sharding operations. If there are multiple Redis Servers, you need to write your own code to implement sharding. Tair encapsulates these for you.

Tair has the following advantages:

(1) Unified API. No matter which engine is used at the bottom layer, the API of the upper layer is the same.

(2) Tair encapsulates cluster operations and liberates developers. When Taobao uses Tair internally, it is generally fault-tolerant with two computer rooms and two clusters. The invalid server is used to ensure the consistency between the two clusters, which is transparent to developers.

A. Tair usage scenarios

(1) Non-persistent (mdb,rdb)

· Data can be stored in the form of key/value

· Data loss is acceptable

· High access speed requirements

· The size of a single data is not very large, generally at the KB level

· The amount of data is large, and there is a greater possibility of growth

· Data updates are infrequent

(2) Persistence (kdb,ldb)

· Data can be stored in the form of key/value

· Data needs to be persisted

· The size of a single data is not very large, generally at the KB level

· The amount of data is large, and there is a greater possibility of growth

· The ratio of data read and write is high

B. Tair’s architecture

As a distributed system, Tair is composed of a central control node and several service nodes.

a. config server function:

(1) Obtain the information of the surviving nodes in the cluster through maintenance and data server heartbeat;

(2) Build a data distribution table in the cluster according to the information of the surviving nodes;

(3) Query service according to the data distribution table;

(4) Scheduling data migration and replication between data servers;

b. Data server function

(1) Provide a storage engine;

(2) Accept client’s put/get/remove operations;

(3) Perform data migration, replication, etc.;

(4) Plug-ins: handle some custom functions when accepting requests;

(5) Access statistics;

c, client function

(1) Provide an interface to access the Tair cluster on the application side;

(2) Update and cache the data distribution table and invalid server address, etc.;

(3) local cache to avoid overheating data access from affecting Tair cluster services;

(4) flow control;

In the figure below, . The client first requests the Config Server to obtain the Data Server where the data resides, and then sends read and write requests to the Data Server. Tair allows data to be stored in multiple Data Servers to achieve exception fault tolerance.

C. Data distribution balance

The distribution of Tair adopts the consistent hash algorithm. For all keys, they are divided into Q buckets. The bucket is the basic unit of load balancing and data migration. The config server assigns each bucket to different data according to a certain strategy. On the server, because the data is hashed according to the key, the balance of the bucket distribution is ensured, thereby ensuring the balance of the data distribution.

D. Fault tolerance

Config Server can detect when a Data Server fails and becomes unavailable. Each hash bucket stores multiple copies in Tair. If it is a backup copy, Config Server will redesignate a Data Server for it. If it is a persistent storage, it will also copy the data to the new Data Server. If it is the primary copy, then ConfigServer first promotes a normal backup copy to the primary copy to provide external services. Then, select another Data Server to add a backup copy to ensure the number of data backups.

E. Data Migration

When increasing or decreasing the data server, the config server will find this situation, and the config server is responsible for recalculating the distribution table of a new bucket on the data server, reassigning the access of the bucket originally served by the reduced machine to other data In the server, data migration will occur at this time. For example, the bucket that was originally in charge of data server A needs to be in charge of B in the new table, but there is no data in the bucket on B, then the data will be migrated to B, and the config server will find out which buckets have fewer backups , and then increase the backup of these buckets on the data server with lower load according to the load balancing situation. When the system adds a data server, the config server coordinates the data server to migrate some of the buckets they control to the new data server according to the load, and adjusts the route after the migration is completed;

The strategy of data server to provide external services during data migration. Assume that data server A wants to migrate buckets 1, 2, and 3 to data server B. Because the routing table of the client does not change before the migration is completed, the client’s buckets 1, 2, and 3 Access requests will be routed to A. Now suppose that 1 has not been migrated, 2 is migrating, and 3 has been migrated. If access 1 is accessed, data server A will still be accessed. If access 3, A will forward the request to B and will The return result of B is returned to the client. If access to 2, it will be processed on A. At the same time, if it is a modification operation on 2, the modification log will be recorded. When the migration of bucket 2 is completed, the log will be sent to B. Apply these logs on the above, and the final AB data consistency is the real completion of the migration. If A is migrated due to downtime, the client will receive an allocation table in an intermediate temporary state, and temporarily assign the buckets responsible for the downtime data server to the data server with its backup for processing. At this time, the service is Available, the load may be unbalanced, and a new load balanced state can be reached after the migration is completed.

3、 ETCD

etcd etcd is a highly available key-value storage system, mainly used for shared configuration and service discovery.

(1) Developed and maintained by CoreOS, inspired by ZooKeeper and Doozer;

(2) It is written in the Go language and handles log replication through the Raft consensus algorithm to ensure strong consistency.

(3) Google’s container cluster management system Kubernetes, open source PaaS platform Cloud Foundry and CoreOS’ Fleet all use etcd extensively;

(4) When the cluster network is turbulent, or the current master node is abnormal, etcd can carry out the election of the master node and restore the lost data in the cluster at the same time

A. Characteristics of ETCD

(1) Simple: API based on HTTP+JSON allows you to use curl easily.

(2) Security: Optional SSL client authentication mechanism.

(3) Fast: Each instance supports one thousand write operations per second.

(4) Credible: the distribution is fully realized using the Raft algorithm.

B. The ability to provide

Etcd mainly provides the following capabilities

(1) It provides an interface for storing and obtaining data, and it guarantees the strong consistency of data of multiple nodes in the Etcd cluster through the protocol. Used to store meta information and shared configuration.

(2) Provide a monitoring mechanism, the client can monitor the changes of a certain key or certain keys. Used to listen and push changes.

(3) Provide the expiration and renewal mechanism of the key, and the client realizes the renewal through regular refresh (the realization mechanism of v2 and v3 is different). Used for cluster monitoring and service registration discovery.

(4) Provide atomic CAS (Compare-and-Swap) and CAD (Compare-and-Delete) support (v2 is implemented through interface parameters, and v3 is implemented through batch transactions). Used for distributed locks and leader elections.

C, etcd architecture

(1) Etcd v2 storage, Watch and expiration mechanism

Etcd v2 is a pure memory implementation, and does not write data to disk in real time. The persistence mechanism is very simple, which is to serialize the store integration into json and write it into a file. The data is a simple tree structure in memory.

There is a global currentIndex in the store, and every time it is changed, the index will increase by 1. Then each event will be associated with the currentIndex.

When the client calls the watch interface (the wait parameter is added to the parameter), if there is a waitIndex in the request parameter, and the waitIndex is smaller than the currentIndex, then query the event whose index is less than or equal to the waitIndex and matches the watch key from the EventHistroy table, if there is data, then return directly. If there is no waitIndex in the history table or the request does not have waitIndex, it will be put into WatchHub, and each key will be associated with a watcher list. When there is a change operation, the event generated by the change will be put into the EventHistroy table, and the watcher related to the key will be notified at the same time.

(2) Etcd v3 storage, Watch and expiration mechanism

Etcd v3 implements watch and store separately. Let’s first analyze the implementation of store. Etcd v3 store is divided into two parts, one part is the index in memory, kvindex, which is implemented based on a golang btree open sourced by Google, and the other part is the back-end storage.

According to its design, the backend can connect to various storages, such as boltdb currently used. boltdb is a stand-alone kv storage that supports transactions, etcd transactions are implemented based on boltdb transactions. The key stored by Etcd in boltdb is reversion, and the value is Etcd’s own key-value combination. That is to say, Etcd will save each version in boltdb, thus realizing a multi-version mechanism.

4. Product selection comparison (Etcd, Zookeeper, Consul comparison)

These three products are often used for selection comparison.

(1) The capabilities provided by Etcd and Zookeeper are very similar. Both are general-purpose consistent meta-information storage, both provide watch mechanisms for change notification and distribution, and are also used by distributed systems as shared information storage. In the software ecosystem The positions are almost the same and can be substituted for each other. In addition to the differences in implementation details, language, and consensus protocol, the biggest difference between the two lies in the surrounding ecosystem. Zookeeper is under apache, written in java, and provides rpc interface. It was first hatched from hadoop project and widely used in distributed systems (hadoop, solr, kafka, mesos, etc.). Etcd is an open source product of coreos company. It is relatively new. It has captured a group of users with its simple and easy-to-use rest interface and active community, and is used in some new clusters (such as kubernetes). Although v3 has also been changed to a binary rpc interface for performance, its ease of use is still better than Zookeeper.

(2) The goal of Consul is more specific. Etcd and Zookeeper provide distributed and consistent storage capabilities. Specific business scenarios need to be implemented by users themselves, such as service discovery, such as configuration changes. Consul, on the other hand, focuses on service discovery and configuration changes, and comes with kv storage. In the software ecology, the more abstract the components, the wider the scope of application, but at the same time there must be deficiencies in meeting the needs of specific business scenarios.

Original Title: Analysis of Distributed Storage Architecture

--

--

Rong
Rong

Written by Rong

Senior backend software developer, focusing on huge system and distributed system design

No responses yet