Focus: ♯ Kleppmann - Designing data-intensive applications
This journal is generated by AI
Systems Correctness and Randomized Testing
This week’s Readwise sync centered on the DDIA second-edition thread about formal methods and randomized testing. The papers form a practical stack: classify what breaks in production, explain why random partition tests work, and describe how AWS validates correctness at scale.
- Systems Correctness Practices At AWS combines lightweight formal approaches—property-based testing, fuzzing, runtime monitoring—with tools like P for state-machine modeling and PObserve for post-hoc log validation against specifications. Open-source shuttle and turmoil extend build-time testing of thread ordering and failure injection.
- Why Is Random Testing Effective for Partition Tolerance Bugs? formalizes the Jepsen pattern: a short setup sequence, a carefully chosen partition, operations in each block, heal, repeat. Separating coverage and minority isolation explain why random tests find most partition bugs with surprisingly small schedules.
- What Bugs Live in the Cloud? catalogs 3,000+ cloud issues: data consistency, scalability, and topology bugs are distinct from classic distributed failures. Operational protocols (bootstrap, cloning, fsck) carry data inconsistency bugs because they run on cold paths. Failover-on-failover is brittle; systems must distinguish hardware failure from software logic bugs.
- I also started exploring FizzBee as a Python-like formal modeling tool with built-in fault injection and MBT support—promising ergonomics for the same problem space.
From Data Lakes to Iceberg Lakehouses
A second cluster of highlights mapped the evolution from schema-on-write warehouses to schema-on-read lakes, and now to lakehouses that unify cheap object storage with warehouse-style analytics.
- Data Lakes: A Survey of Functions and Systems frames the core trade-off: lakes store raw data in original formats with a common access interface, but metadata and governance determine whether ad-hoc analytics is feasible. The survey covers ingestion tiers, metadata modeling (data vault, enterprise knowledge graphs), dataset discovery (Aurum, JOSIE), and the lakehouse convergence that avoids maintaining separate lake and warehouse systems.
- 2025 Guide to Architecting an Iceberg Lakehouse turns theory into an architecture checklist: self-audit current data placement and SLAs first, then standardize on Parquet plus Iceberg metadata, pick catalog (Nessie, Polaris, Glue), ingestion (Spark, Flink, Kafka Connect), and query engines (Dremio, Trino). Dremio’s ability to federate non-Iceberg sources alongside Iceberg tables lowers migration risk.
- The lakehouse pattern aligns with a broader shift toward analytical systems and data engineering—an area I’m exploring as part of a career transition.
Local-First Data Ownership
The local-first essay rounds out the week’s data-systems reading with a user-centric counterpoint to cloud defaults: you should own your data even when services run in the cloud.
- You own your data, in spite of the cloud argues that Git/GitHub is the closest existing model for local-first collaboration—data lives on your machine, sync is explicit, and the remote is optional infrastructure rather than the source of truth.
- CRDTs and tools like Automerge offer a technical foundation for convergence without a central coordinator, which maps directly to DDIA 2nd edition’s new section on sync engines and local-first software.
- The essay’s further-reading list (Baboulevitch’s Data Laced with History, Kleppmann’s Convergence vs Consensus, formal CRDT verification) is a good bridge from informal local-first ideals to the formal-methods thread above.
Obsidian Links
- ♯ Kleppmann - Designing data-intensive applications
- FizzBee
- Marc Brooker et al. - Systems Correctness Practices At Aws (Highlights)
- Rupak Majumdar et al. - Why Is Random Testing Effective for Partition Tolerance Bugs (Highlights)
- Haryadi S. Gunawi et al. - What Bugs Live in the Cloud A Study of 3000+ Issues in Cloud Systems (Highlights)
- Rihan Hai et al. - Data Lakes A Survey of Functions and Systems (Highlights)
- Alex Merced - 2025 Guide to Architecting an Iceberg Lakehouse (Highlights)
- Martin Kleppmann et al. - You own your data, in spite of the cloud (Highlights)