System Design - By Gaurav Sen Youtube
> Horizontal & Vertical Scaling
> Decoupling
>Logging and metrics calculation
>Extensible
> Load Balancer : used to make intelligent decision for processing request. i.e decision to evenly distribute weight on all servers while handling 'n' number of request to more then one server that the output should be optimum with less delay and more accuracy.
So it is implemented using Consistent hashing:
Hashing in a data structure is a two-step process.
- The hash function converts the item into a small integer or hash value. This integer is used as an index to store the original data.
- It stores the data in a hash table. You can use a hash key to locate data quickly.
- Hashing :Hashing in the data structure is a technique of mapping a large chunk of data into small tables using a hashing function .It is a technique that uniquely identifies a specific item from a collection of similar items.
- Hash Function :The hash function in a data structure maps arbitrary size of data to fixed-sized data. It returns the following values: a small integer value (also known as hash value), hash codes, and hash sums.
- Hash Table : to store the key-value pairs. The hash table uses the hash function to generate an index. Hashing uses this unique index to perform insert, update, and search operations.
Consistent Hashing Ring :
But here in normal hashing problem comes when we try to add or remove new servers to our system/project . So while fetching / retrieving data we won't get result the same way as it was here for 10 index hashtable. So for that Consistent hash ring comes in to picture.
- 1:N - as in our above example it is 1:1 replication where we are copying data from current/master server to just next server , here we can also do 1:N were we can copy data from 1 server to more than 1 server parallelly i.e multiple copies of single master in different servers.
- Chained Replication : here we copy data in chain form [ Ex : S1 -> S2 ->S3] means if master server is S1 then its data is copied in S2 and then both S1 and S2 data is copied in S3.
- Mixed Replication: combination of 1:N and Chained replication.
> Message Queue [ MQs ]: Message queuing makes it possible for applications to communicate asynchronously, by sending messages to each other via a queue.
- RabbitMQ is one of the most widely used message brokers, it acts as the message broker, “the mailman”, a microservice architecture needs.
- RabbitMQ consists of:
1. producer — the client that creates a message
2. consumer — receives a message
3. queue — stores messages
3. exchange — enables to route messages and send them to queues - The system functions in the following way:
1. producer creates a message and sends it to an exchange
2. exchange receives a message and routes it to queues subscribed to it
3. consumer receives messages from those queues he/she is subscribed to - One should note that messages are filtered and routed depending on the type of exchange.
> Database Sharding :
Sharding is a technique for distributing a single dataset among many databases, allowing it to be stored across multiple machines.
Larger datasets are split into smaller chunks, called logical shards which are then distributed across separate database nodes called physical shards, each of which holds multiple logical shards. This boosts the total storage capacity of the system.
1. Horizontal or Range Based Sharding
In this case, the data is split based on the value ranges that are inherent in each entity. For example, the if you store the contact info for your online customers, you might choose to store the info for customers whose last name starts with A-H on one shard, while storing the rest on another shard.
The disadvantage of this scheme is that the last names of the customers may not be evenly distributed. You might have a lot more customers whose names fall in the range of A-H than customers whose last name falls in the range I-Z. In that case, your first shard will be experiencing a much heavier load than the second shard and can become a system bottleneck.
Nevertheless, the benefit of this approach is that it's the simplest sharding scheme available. Each shard also has the same schema as the original database. Your application layer is relatively simple because in most scenarios, you'll not need to combine data from multiple shards to answer any query.
2. Vertical Sharding
In this case, different features of an entity will be placed in different shards on different machines. For example, in a LinkedIn like application, an user might have a profile, a list of connection and a set of articles he has authored. In Vertical sharding scheme , we might place the various user profiles on one shard, the connections on a second shard and the articles on a third shard.
The main benefit of this scheme is that you can handle the critical part of your data (for examples User Profiles) differently from the not so critical part of your data (for example, blog posts) and build different replication and consistency models around it.
The two main disadvantages of vertical sharding scheme are as follows:
- Depending on your system, your application layer might need to combine data from multiple shards to answer a query. For example, a profile view request will need to combine data from the User Profile, Connections and Articles shard. This increases the development and operational complexity of the system.
- If your Site/system experiences additional growth then it may be necessary to further shard a feature specific database across multiple servers.
3. Key or hash based sharding
In this case, an entity has a value ( Eg. IP address of a client application) which can be used as an input to a hash function and a resultant hash value generated. This hash value determines which database server(shard) to use.
As a simple example, imagine you have 4 database servers and each request contained an application id which was incremented by 1 every time a new application is registered.
In this case, you can simply perform a modulo operation on the application id with the number 4 and take the remainder to determine which server the application data should be placed on.
The main drawback of this method is that elastic load balancing ( dynamically adding/removing database servers) becomes very difficult and expensive.
For example, if we wanted to add 6 more servers, majority of the keys would need to be remapped and migrated to new servers. Also, the hash function will need to be changed from modulo 4 to modulo 10.
While the migration of data is in effect , neither the new nor the old hash function is fully valid. So in effect, a large number of the requests cannot be serviced and you'll incur a downtime till the migration completes.
This problem is easily solved by Consistent hashing.
4. Directory based sharding
Directory based shard partitioning involves placing a lookup service in front of the sharded databases. The lookup service knows the current partitioning scheme and keeps a map of each entity and which database shard it is stored on. The lookup service is usually implemented as a webservice.
The client application first queries the lookup service to figure out the shard (database partition) on which the entity resides/should be placed. Then it queries / updates the shard returned by the lookup service.
It enables us to solve the elastic scaling problem described in the previous section without using Consistent Hashing.
Here's how: In the previous example, we had 4 database servers and a hash function that performed a modulo 4 operation on the application ids. Now, if we wanted to add 6 more database servers without incurring any downtime, we'll need to do the following steps:
- Keep the modulo 4 hash function in the lookup service .
- Determine the data placement based on the new hash function - modulo 10.
- Write a script to copy all the data based on #2 into the six new shards and possibly on the 4 existing shards. Note that it does not delete any existing data on the 4 existing shards.
- Once the copy is complete, change the hash function to modulo 10 in the lookup service
- Run a cleanup script to purge unnecessary data from 4 existing shards based on step#2. The reason being that the purged data is now existing on other shards.
There are two practical considerations which needs to be solved on a per system basis:
- While the migration is happening, the users might still be updating their data. Options include putting the system in read-only mode or placing new data in a separate server that is placed into correct shards once migration is done.
- The copy and cleanup scripts might have an effect on system performance during the migration. It can be circumvented by using system cloning and elastic load balancing - but both are expensive.
What are the common problems with Sharding?
The above sections might make it sound like Sharding is the ultimate Silver Bullet to solve all your scaling woes. However, this is not the case and there are various issues to be considered before choosing a sharding based solution.
Database Joins become more expensive and not feasible in certain cases
When all the data is located in a single database, joins can be performed easily. Now, when you shard the database, joins have to be performed across multiple networked servers which can introduce additional latency for your service.
Comments
Post a Comment