Focus: ♯ Kleppmann - Designing data-intensive applications

This journal is generated by AI

Systems Correctness and Randomized Testing

This week’s Readwise sync centered on the DDIA second-edition thread about formal methods and randomized testing. The papers form a practical stack: classify what breaks in production, explain why random partition tests work, and describe how AWS validates correctness at scale.

  • Systems Correctness Practices At AWS combines lightweight formal approaches—property-based testing, fuzzing, runtime monitoring—with tools like P for state-machine modeling and PObserve for post-hoc log validation against specifications. Open-source shuttle and turmoil extend build-time testing of thread ordering and failure injection.
  • Why Is Random Testing Effective for Partition Tolerance Bugs? formalizes the Jepsen pattern: a short setup sequence, a carefully chosen partition, operations in each block, heal, repeat. Separating coverage and minority isolation explain why random tests find most partition bugs with surprisingly small schedules.
  • What Bugs Live in the Cloud? catalogs 3,000+ cloud issues: data consistency, scalability, and topology bugs are distinct from classic distributed failures. Operational protocols (bootstrap, cloning, fsck) carry data inconsistency bugs because they run on cold paths. Failover-on-failover is brittle; systems must distinguish hardware failure from software logic bugs.
  • I also started exploring FizzBee as a Python-like formal modeling tool with built-in fault injection and MBT support—promising ergonomics for the same problem space.

From Data Lakes to Iceberg Lakehouses

A second cluster of highlights mapped the evolution from schema-on-write warehouses to schema-on-read lakes, and now to lakehouses that unify cheap object storage with warehouse-style analytics.

  • Data Lakes: A Survey of Functions and Systems frames the core trade-off: lakes store raw data in original formats with a common access interface, but metadata and governance determine whether ad-hoc analytics is feasible. The survey covers ingestion tiers, metadata modeling (data vault, enterprise knowledge graphs), dataset discovery (Aurum, JOSIE), and the lakehouse convergence that avoids maintaining separate lake and warehouse systems.
  • 2025 Guide to Architecting an Iceberg Lakehouse turns theory into an architecture checklist: self-audit current data placement and SLAs first, then standardize on Parquet plus Iceberg metadata, pick catalog (Nessie, Polaris, Glue), ingestion (Spark, Flink, Kafka Connect), and query engines (Dremio, Trino). Dremio’s ability to federate non-Iceberg sources alongside Iceberg tables lowers migration risk.
  • The lakehouse pattern aligns with a broader shift toward analytical systems and data engineering—an area I’m exploring as part of a career transition.

Local-First Data Ownership

The local-first essay rounds out the week’s data-systems reading with a user-centric counterpoint to cloud defaults: you should own your data even when services run in the cloud.

  • You own your data, in spite of the cloud argues that Git/GitHub is the closest existing model for local-first collaboration—data lives on your machine, sync is explicit, and the remote is optional infrastructure rather than the source of truth.
  • CRDTs and tools like Automerge offer a technical foundation for convergence without a central coordinator, which maps directly to DDIA 2nd edition’s new section on sync engines and local-first software.
  • The essay’s further-reading list (Baboulevitch’s Data Laced with History, Kleppmann’s Convergence vs Consensus, formal CRDT verification) is a good bridge from informal local-first ideals to the formal-methods thread above.