motd

Welcome to
           __
      (___()'`;  www.nerdondon.com
      /,    /`
      \\"--\\

^ its a huge gou. get it? think Chinese :).
..."gou" means dog in Chinese but also hugo for the blog plaform.

Anyway, the content will generally be techinical and about software engineering
but I might throw in some random topics here and there.

Linux dne.nerdondon.com 2.0.0-1-arm64
Last login: (lol how do you do this with hugo?)
      

Notes: Chain Replication

Prelude

Chain replication is a leaderless replication mechanism. It is a mechanism I first came in contact with when I worked at AWS. For now, this is just a bunch of references I found enlightening.

Notes

  • TODO

Useful references

Read more →

[Notes]: Bloom Filters

Prelude

These are just some of my notes and references on Bloom filters while doing research for building an LSM tree based database.

Notes

  • A Bloom filter is a probablistic data structure that allows checks for set membership. These checks for set membership can return false positives but will never produce false negatives. More clearly: “if a negative match is returned, the element is guaranteed not to be a member of the set” (Alex Petrov - Database Internals).

  • Used to lower I/O costs incurred by the tiered structure of LSM-trees

Useful references

Read more →

[Notes]: Skip Lists

Prelude

These are just some of my notes and references from implementing a skip list in Rust. Creating a skip list implementation that has concurrency features is my personal intro project to Rust so there will also be some notes on lock-free and wait-free techniques. The overarching goal is to utilize this skip list as the in-memory component of an LSM-tree based storage engine, that I will maybe, hopefully, aspirationally, try to make. LevelDB in Rust if you will.

Notes

  • Used as the in-memory component of LevelDB’s log structured merge tree (LSM tree) implementation

  • Skip list vs self-balancing search trees (e.g. Red-black trees)

    • A skip list is more amenable to concurrency because less nodes are required to be updated during an insert

    • On update, a self-balancing tree may require multiple ancestors to be changed when the balancing operations (e.g. rotations in a Red-black tree)

  • LevelDB is doesn’t have concurrent writes? Source appears to require an external lock mechanism for writes.

  • When used as the in-memory component of an LSM tree, the skip list should support sorting by reverse chronological order. This can be done by having a synthetic key with a timestamp/revision number or via reverse iteration by back pointers. This is because the LSM tree is an append only structure.

Some concurrency stuff

Useful references

Read more →

[Notes]: LSM Trees and Storage Engines

Notes

  • Log-structured merge-trees (LSM trees) are a data structure used for write-heavy workloads.

  • LSM trees are the basis of many NoSQL/NewSQL databases e.g. Cassandra, LevelDB, RocksDB, etc.

  • Because of their optimization towards write-heavy workloads, LSM trees also lend themselves to time series databases like InfluxDB.

  • Whereas B+ trees (used in such storage engines as InnoDB) must maintain their structure via random writes, LSM trees gain their write speed by being structured in such a way as to take advantage of sequential writes.

  • To lend itself to sequential writes, the LSM tree is an append-only structure.

  • At a high level, an LSM tree is composed of different levels/tiers where each level is a different individual data structure that each have different backing storage formats that are increasingly slower.

    • For example: tier 0 of an LSM tree can be backed by RAM, tier 1 by SSD, and tier 2 by spinning drives.
    • This tiered storage architecture also works nicely with cloud native trend, where each tier of storage can be concretely backed by a different cloud offering depending on application needs.
    • Data is first written to the first tier and data is asynchronously pushed down to lower layers as the first tier fills. Reads are performed similarly.

Structure

  • Although referred to as a data structure, an LSM tree can more accurately be described as a technique for composing multiple data structures to achieve desired performance characteristics.

  • Despite the LSM paper saying that “[a]n LSM-tree is composed of two or more tree-like component data structures”, the components that make up an LSM tree do not actually have to be tree-like at all.

    • They can be any sorted data structure, which allows you to tune performance characteristics of the LSM tree tiers to your application’s requirements.
    • As an example, LevelDB uses a skiplist in its first tier. The original paper suggests a (2,3) tree or an AVL tree in the first tier and a B tree in the second.
    • Fun aside, Redis utilizes skiplists as the backing data structure for their sorted set data type
  • The canonical example of an LSM tree is a two-tiered structure composed of an in-memory tier and a on-disk tier. The in-memory tier is usually referred to as a memtable. Modern on-disk component examples utilize a structure inspired by Google’s BigTable called an SSTable (Sorted Strings Table).

  • An interesting point from John Pradeep Vincent on runtimes with GC - “For java implementations, the memtable is usually stored off-heap (direct memory) to avoid GC load.”

Memtable

  • As mentioned above, this is the primary in-memory component of an LSM tree.

  • Since this first point of contact for writes is in-memory, a storage engine would still utilize a write ahead log (WAL) to ensure durability.

  • Once this tier is filled to a certain size threshold, its contents are flushed to the next tier of the LSM tree.

SSTable (Sorted Strings Table)

  • SSTables are just one implementation for the on-disk tier of an LSM tree but are pretty popular in existing literature, so here’s some SSTable specific notes.

  • SSTables, as their name implies, stores data sorted by their key. It is a file (or multiple files) that generally have two parts: an index block and a data block.

    • The index block stores mappings from keys to offsets within the data block
  • A new SSTable (file or set of files) is created when the contents of the memtable are flushed or when a compaction operation occurs.

  • Because these SStables are effectively dumps of the memtable, there can be overlapping key ranges in different SSTable files. In order to improve read performance, a compaction operation is occasionally conducted.

  • As compaction operations occur and new SSTables are continually created from memtable flushes, a layered structure begins to manifest from the SSTables where each layer has more recent data and shadows the lower layers. Keeping the layers shallow is important to maintaining the performance of the LSM tree and hence compaction is a key component of the LSM tree.

Operations

Read

  • With the above structure of a memtable and SSTables, the worst-case read path would be: memtable -> a linear scan–in reverse chronological order–of each level of SSTables

  • To augement the speed of reads, there are a couple of data structures that are commonly introduced into the read path. Two of them being an extra index on top of the SSTables (different from the index file specific to a single SSTable) and a Bloom filter.

    • John Pradeep Vincent puts it best for a Bloom filter: due to the false positives being possible “the presence of a bloom filter improves the read performance for keys that are missing in the system but for the keys that are present in the system, the read is still expensive as we have to look into the memtable, index and the SSTable.”
  • In terms of general database tricks to make read and write paths faster, memory mapping the disk files or using a buffer pool are also techiniques that can be utilized

Write

  • Writes always go to the in-memory component first and can trigger a flush of the memory file contents to disk.

  • With the auxiliary data structures mentioned above, writes would also need to upadate the WAL, the Bloom filter, and the index for SSTables.

  • Again the LSM tree is an append only structure, so updates will write an updated entry that will shadow the previous entry and deletions will write a tombstone indicating item deletion.

Compaction

  • Compaction for an LSM tree is often compared to the garbage collector (GC) featured in many programming languages.

  • Just as with garbage collection, there are myriad strategies that can be implemented for compaction. For example: concurrent/asynchronous vs synchronous, manually triggered vs threshold triggered.

  • Recalling that an LSM tree is an append-only structure, compaction is then critical for cleaning up stale records and deletion tombstones.

Other topics for expansion

Potential topics that I would like to expand on if I revisit this post or ideas for subsequent posts.

  • Discussion on read and write amplication and comparing and contrasting those values for LSM trees vs B trees

  • Expand on Niv Dayan video on knobs to turn to change scaling properties of an LSM tree implementation

  • Cassandra and BigTable SSTable implementation details

  • Compare against update in place data structures like a copy-on-write (CoW) tree

Useful references

Read more →

[Notes]: Envoy Internals Deep Dive by Matt Klein

Prelude

Some notes from the Envoy Internals Deep Dive talk by Matt Klein. The talk was given at KubeCon EU 2018. Watch on YouTube here.

Actual Notes :D

Architecture

  • Envoy is designed with an out of process architecture specifically to address the polyglot nature of microservices

  • At its core, Envoy is a byte proxy. This gives it extensibility to supoort a multitude of protocols.

  • Create with versatility from the start so that Envoy can act as service, middle, and edge proxy

  • Hot restart is key feature

  • One part of Envoy is built as a pipeline with filters as building blocks. Filters are extension points where users can add functionality.

  • The other part of the architecture is where the central management functionality lives. Things like the cluster/service managers or stats engine.

  • The prevailing school of thought for proxies before the 2000s was to have serve a single connection per thread. Envoy is takes a different approach where it serves multiple connections per thread via an event loop.

  • The “main” thread of Envoy handles the management functionality and worker threads handle the connections and filter pipeline. The key to keeping Envoy simple is the latter part of the previous statement. It means that worker threads do not have to communicate with each other and only have to acquire locks in a small set of cases.

  • Read-Copy-Update (RCU) is used to govern parallelization within Envoy. “Designed for read heavy, write infrequent workloads”. Helps reduce the need for locking by making the read path lock free.

  • Envoy keeps a vector pointers in thread local storage (TLS) that is mirrored across the main and worker threads. RCU is used to update slots in the vector and posts messages to worker threads to update information e.g. route tables.

Hot Restarts

  • “Full binary reload without dropping any connections”

  • Allow two processes of Envoy to run at the same time and utilize a shared memory region. Stats and locks are stored here.

  • The two proceses communicate through an RPC protocol over Unix domain sockets (UDS) and perform socket passing. After the second Envoy process spins up, it will communicate that it is good to go and the first process will begin to drain connections.

Stats

  • There’s a 2 level caching mechanism for stats. There is a TLS cache, a central cache on the main thread, and the actual entries either in shared memory or in process memory.

Useful references

Read more →

[Notes]: V8 internals for JavaScript developers? by Mathias Bynens

Prelude

Some notes from the V8 internals for JavaScript developers? talk by Mathias Bynens. The talk was given at JsConf AU 2018. Watch on YouTube here. This talk focuses on arrays and JS engine optimizations around array operations.

Actual Notes :D

Element kinds

  • Arrays can be described by the types of elements they hold (3 types in V8) and whether or not they are packed.

    • The three types are SMI, DOUBLE, and ELEMENTS. SMI is a subset of DOUBLE and DOUBLE is a subset of ELEMENTS. SMI here stands for small integers.
    • An array and it’s element types are described together as the following regex (PACKED | HOLEY)(_SMI | _DOUBLE)?_ELEMENTS
    • For example, PACKED_SMI_ELEMENTS or HOLEY_ELEMENTS
    • To be clear, a “holey” array is also known as a sparse array
  • Packed arrays are more performant than holey arrays

    • This is because V8 not only has to perform bounds checks on your array index, it must also check the prototype chain for the index.
    • For example, suppose you are querying arr[8] and arr is holey with length 9 and arr[8] is undefined. V8 will check if the property 8 exists on arr with hasOwnProperty(arr, '8'). This check will return false so V8 must go the next level up the prototype chain and check hasOwnProperty(Array.protoype, '8'). This will return false as well. This check goes on until the prototype chain ends. This is a lot of work that the runtime has to do especially if you consider that JavaScript arrays can be subclassed.

Transitioning between element kinds

  • Delving deeper into how the different element kinds relate to each other, V8 will classify an empty array by the simplest classification i.e. SMI. If you push a floating point number onto a SMI array, the array will be reclassified to DOUBLE. The analogous re-classification will occur if you then push an object on to an array i.e. the array is re-classified as just holding ELEMENTS.

  • Array re-classifications are only done one way. Concretely, you can transition an array from SMI to DOUBLE. However, if you remove all floating point numbers from the array, V8 will not reclass that array back to SMI. The array will retain it’s classification as a DOUBLE.

  • The same relation holds true when making a packed array holey. The array will be reclassified as a holey array. However, if you fill in the holes, V8 won’t go back and classify the array as packed again. Mathias says that these transitions can be visualized as a lattice (see figure below).

    Lattice screen capture from Mathias Bynens’ talk
  • Try to use arrays over array-like objects. Array-like objects are integer indexed objects with a length property. Using array-like objects prevents V8 from fully optimizing your code even though you can use Array.protoype methods via call(). This includes using the arguements object in functions. Nowadays, you should use rest parameters instead.

Optimizing array initialization vs element kinds

  • You can initialize an array with the array constructor i.e. new Array(n). This is great if you know you will have a large array and want the JS engine to pre-allocate space. The downside is that the array will start out with n holes, which means that array operations won’t be as optimized as operations for a packed array.

  • Initializing an empty array and continually pushing onto it will help keep optimizations for array operations, but this comes at the cost of array re-allocations as the array grows. These re-allocations can grow slower and slower as the array grows.

  • Which optimization should you choose? There is no right answer and it’s just a game of trade-offs. You can optimize for array creation or you can optimize for array operations. It will depend on your use case.

Key takeaway

  • “Write modern, idiomatic JavaScript, and let the JavaScript engine worry about making it fast” - Mathias Bynens

Useful references

Read more →