motd

Welcome to
           __
      (___()'`;  www.nerdondon.com
      /,    /`
      \\"--\\

^ its a huge gou. get it? think Chinese :).
..."gou" means dog in Chinese but also hugo for the blog plaform.

Anyway, the content will generally be techinical and about software engineering
but I might throw in some random topics here and there.

Linux dne.nerdondon.com 2.0.0-1-arm64
Last login: (lol how do you do this with hugo?)
      

Notes: TiKV Sharding and Scheduling

Prelude

Notes on how TiKV does dynamic sharding with Raft. This is particularly relevant to my own efforts to add dynamic sharding and scheduling to a toy key-value store that I’m building. Incidentally, I will also be trying to use Raft which is in contrast to the usage of Paxos in Google’s Spanner, AWS’s DynamoDB, and Meta’s ZippyDB.

Notes

  • TiKV has a module called the Placement Driver (PD) that aggregates statistics from heartbeats and makes resource scheduling decisions based off these statistics. TiKV names the PD after a similar component in Google’s Spanner.

  • TiKV is organized in the following way:

    • Compute resources (VM’s/containers) hosting the database are called nodes
    • TiKV does range-based sharding where a shard contains a contiguous block of keys. TiKV calls a shard a region.
    • Multiple regions can be stored on a node.
    • A region consists of multiple replicas of the data. Each region’s replicas are grouped together in a Raft group to maintain consistency.
  • TiKV stores metrics in a Prometheus instance and uses this instance to scrape for metrics in addition to the metrics recieved via heartbeats

  • PD will keep a cache of read/write statistics for the top N regions of each store. This is used to determine whether a region is hot and thus a candidate for migration.

Useful references

Read more →

Notes: Spanner

Prelude

Notes on Spanner internals referencing publicly available literature e.g. Alex Petrov’s Database Internals book and the Spanner paper presented at OSDI.

Notes

  • A Spanner zone is “the unit of administrative deployment” and is composed of a zonemaster, spanservers, and per-zone location proxies

    • The zonemaster assigns data to spanservers

    • spanservers are the indiudual units of Spanner that serve data clients.

      • A spanserver is responsible for between 100 and 1000 instances of a data strcuture called a tablet

      • A tablet is essentially a bag of key-value mappings

    • location proxies are used by clients to locate the spanservers that contain their data

  • spanservers “implement a single Paxos state machine on top of each tablet”. The set of replicas is called a Paxos group

    • The leader in each Paxos group manages 2 additional systems: a lock table and a transaction manager. The state of these systems is still replicated by the Paxos group.

      • “The lock table contains the state for two-phase locking: it maps ranges of keys to lock states.”

      • The transaction manager supports distributed transactions i.e. transactions that span multiple Paxos groups.

        • The transaction manager will coordinate with the leaders of the other Paxos groups participating in the transaction to perform two-phase commit. One of the participant groups will be selected as the coordinator leader.

Useful references

Read more →

[Notes]: flock vs fcntl

Prelude

In the course of developing RainDB, there was cause to get whole-file locks (as opposed to byte ranges within a file). Here are some light notes on two Unix system calls for doing this: flock and fcntl. Mostly from an awesome post by @apenwarr

Notes

  • flock locks an entire file at a time and is not standardized by POSIX

  • Both flock and fcntl have shared locks and exclusive locks e.g. they are readers-writer locks

  • fcntl is a POSIX standardized lock that allows locking of specific byte ranges

    • These byte ranges are not actually enforced e.g. you can lock bytes past the end of a file

    • Per @apenwarr:

      You could lock bytes 9-12 and that might mean “records 9-12” if you want, even if records have variable length. The meaning of a fcntl() byte range is up to whoever defines the file format you’re locking.

  • Both flock and fcntl are advisory locks. This means that nothing prevents another process from reading a or writing to a locked file without first acquiring a lock. A well-behaved program will attempt to acquire a lock first.

  • fcntl locks belong to a (pid, inode) pair so, if you close any file descriptor referring to the inode, all of the locks you have on the inode will be released. This is particularly insidious, because a library function that you use may open a file you have a lock on and close it as part of its implmentation. This would release your locks and you wouldn’t really know it unless you dig into the implementation. Yikes.

    • This quote from @apenwarr speaks what we all think:

      That behaviour is certifiably insane, and there’s no possible justification for why it should work that way. But it’s how it’s always worked, and POSIX standardized it, and now you’re stuck with it.

    • Aside, what’s an inode?

      • It’s a data structure that describes a file-system object (e.g. file or directory). It stores information like file ownership, file size, access timestamps, and etc. (paraphrased from Wikipedia)
    • An API for fcntl (released in 2015) called “open file description locks” provides more sane behavior

      • libc docs:

        Open file description locks are useful in the same sorts of situations as process-associated locks. They can also be used to synchronize file access between threads within the same process by having each thread perform its own open of the file, to obtain its own open file description.

        Because open file description locks are automatically freed only upon closing the last file descriptor that refers to the open file description, this locking mechanism avoids the possibility that locks are inadvertently released due to a library routine opening and closing a file without the application being aware.

Thread safety

  • According to @apenwarr:

    Supposedly, flock() locks persist across fork() and (unlike fcntl locks, see below) won’t disappear if you close unrelated files. HOWEVER, you can’t depend on this, because some systems - notably earlier versions of Linux - emulated flock() using fcntl(), which has totally different semantics. If you fork() or if you open the same file more than once, you should assume the results with flock() are undefined.

    • This is particularly applicable in understanding usage of the fs2 crate in Rust to acquire locks on files

    • The fs2 crate does a whole-file lock with flock but, for systems that do not have flock (e.g. Oracle Solaris), it will emulate flock using fcntl

  • With flock, upgrading from a shared lock to an exclusive lock is racy because you have to release the shared lock first

  • It seems like fcntl can atomically upgrade a shared lock to an exclusive lock

Useful references

Read more →

[Notes]: etcd Internals

Prelude

For now, mostly just some cool articles that were very helpful in understanding etcd internals.

Notes

  • etcd first sends transaction requests to it’s Raft engine. This effectively serves as a write-ahead log. Once consensus on the transaction is reached, the transaction is committed to etcd’s underlying key-vale store.

  • etcd uses bbolt as its durable store, which is based on a B+ tree variation

  • etcd will persist Raft cluster information/membership in it’s storage backend

Useful references

Read more →

[Notes]: AWS re:Invent 2018: Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321)

Prelude

Some notes from the Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) talk by Jaso Sorenson. The talk was given at AWS re:Invent 2018. Watch on YouTube here.

Notes

GetItem and PutItem

  • GetItem and PutItem initially go through a service called the request router that will then redirect the request to the appropriate storage node

  • Data from a PutItem will be sent to the leader of a Paxos group formed by 3 storage nodes

    • Hearbeats are 1.5 sec and missing 2-3 (?) heartbeats will trigger an election

    • The storage nodes in a group are in different availability zones (AZ’s)

  • Request routers and storage nodes are zonal services (service that runs in multiple availability zones)

    • Request router is a stateless service
  • A table is partitioned by the provided primary key and each partition is backed by a Paxos group of storage nodes

  • For an eventually consistent read, the request is sent to a random storage node in its group

  • Storage nodes are composed by 2 primary components: a B-tree and a replication log

System management and Auto-admin

  • Auto-admin is a process that serves many roles for system management

  • It keeps the Partition Metadata System up to date with information like the current leader of a group of storage nodes

  • It also handles partition repairs. If a partition node goes down, it clones a live node and brings the node up to date by replaying the replication log

Secondary Index

  • The table is partitioned by the secondary index and the partitions are each backed by a Paxos group of storage nodes. This is done completely independently from the base table.

  • Changes to the replication log of the base table are propagated by a separate process called the Log Propagator

  • Writes can cause an entire secondary index to be re-written. Big write amplification.

Provisioning and Token Bucket Algorithm

  • Token bucket algorithm used to measure capacity being used.

  • The bucket will initially be filled with tokens to represent the requested DDB capacity (i.e. RCU’s and WCU’s). Each operation will remove one token from the bucket. The bucket will get refilled at a rate of 100 tokens/second.

  • A request will be throttled if the bucket is exhausted.

  • A token bucket will fill to a max capacity that is a multiple of your provisioned capapcity. This allows for bursty traffic.

Global Tables

  • Global tables can be thought of as an external service on top of DDB. It still has to go through a request router to replicate data from one region to the other.

  • There is a stream reader (called RepOut) that consumes changes from DDB streams of the “child” (my word) tables that compose a DDB global table.

  • There is a challenge that, when there is a partition split, you need a RepOut process reading from the DDB stream shard of the new partition.

  • A ReplAdmin process will watch metadata about partitions from the DDB control plane and start a workflow that will tell a process in the RepOut process pool to begin processing from the new partition.

  • RepOut will send batched replication messages to a destination regions RepIn service that will drive replication through request routers at the destination region. Once this replication is done, RepIn will notify RepOut of success and the stream will be checkpointed.

TODO

  • Maybe rest of talk about DDB specific backup mechanisms. The parts with notes are more generic and more interesting especially due to my current side-project developing RainDB.

Useful references

Read more →

First Hugo Post

First post on the new ish! So…hi again!

Read more →