[Notes]: AWS re:Invent 2018: Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321)
Prelude
Some notes from the Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) talk by Jaso Sorenson. The talk was given at AWS re:Invent 2018. Watch on YouTube here.
Notes
GetItem
and PutItem
GetItem
andPutItem
initially go through a service called the request router that will then redirect the request to the appropriate storage nodeData from a
PutItem
will be sent to the leader of a Paxos group formed by 3 storage nodesHearbeats are 1.5 sec and missing 2-3 (?) heartbeats will trigger an election
The storage nodes in a group are in different availability zones (AZ’s)
Request routers and storage nodes are zonal services (service that runs in multiple availability zones)
- Request router is a stateless service
A table is partitioned by the provided primary key and each partition is backed by a Paxos group of storage nodes
For an eventually consistent read, the request is sent to a random storage node in its group
Storage nodes are composed by 2 primary components: a B-tree and a replication log
System management and Auto-admin
Auto-admin is a process that serves many roles for system management
It keeps the Partition Metadata System up to date with information like the current leader of a group of storage nodes
It also handles partition repairs. If a partition node goes down, it clones a live node and brings the node up to date by replaying the replication log
Secondary Index
The table is partitioned by the secondary index and the partitions are each backed by a Paxos group of storage nodes. This is done completely independently from the base table.
Changes to the replication log of the base table are propagated by a separate process called the Log Propagator
Writes can cause an entire secondary index to be re-written. Big write amplification.
Provisioning and Token Bucket Algorithm
Token bucket algorithm used to measure capacity being used.
The bucket will initially be filled with tokens to represent the requested DDB capacity (i.e. RCU’s and WCU’s). Each operation will remove one token from the bucket. The bucket will get refilled at a rate of 100 tokens/second.
A request will be throttled if the bucket is exhausted.
A token bucket will fill to a max capacity that is a multiple of your provisioned capapcity. This allows for bursty traffic.
Global Tables
Global tables can be thought of as an external service on top of DDB. It still has to go through a request router to replicate data from one region to the other.
There is a stream reader (called
RepOut
) that consumes changes from DDB streams of the “child” (my word) tables that compose a DDB global table.There is a challenge that, when there is a partition split, you need a
RepOut
process reading from the DDB stream shard of the new partition.A
ReplAdmin
process will watch metadata about partitions from the DDB control plane and start a workflow that will tell a process in theRepOut
process pool to begin processing from the new partition.RepOut
will send batched replication messages to a destination regionsRepIn
service that will drive replication through request routers at the destination region. Once this replication is done,RepIn
will notifyRepOut
of success and the stream will be checkpointed.
TODO
- Maybe rest of talk about DDB specific backup mechanisms. The parts with notes are more generic and more interesting especially due to my current side-project developing RainDB.