This journal is generated by AI

Dimensional Modeling Meets the Open Lakehouse

The bulk of this week’s reading was data engineering: Kimball’s The Data Warehouse Toolkit delivered over a hundred highlights on dimensional modeling, while Architecting an Apache Iceberg Lakehouse covered how the same presentation-layer discipline lands on open table formats.

  • Kimball’s bedrock principle: one measurement event in the physical world maps to exactly one row in the fact table. The four design decisions follow in order—select the business process, declare the grain, identify the dimensions, identify the facts—and the most frequent design error is skipping the grain declaration.
  • Dimension tables trade space for simplicity: denormalize aggressively, replace cryptic codes with verbose textual attributes, and substitute default rows (Unknown, Not Applicable) for null keys. Normalized 3NF models are great for operational writes but “resemble a map of the Los Angeles freeway system” to BI users.
  • Percentages and ratios are non-additive—store the numerator and denominator in the fact table instead. Store derived facts physically so the ETL computes them once and consistently.
  • Conformed dimensions are the enterprise glue: separate teams can build separate star schemas at different times, and they plug together as long as shared dimensions mean the same thing everywhere. That’s 95 percent of the data architecture effort.
  • Iceberg modernizes the storage side: layered immutable metadata gives atomic commits, time travel, and optimistic concurrency, with the catalog as the arbiter of each table’s current state. The REST Catalog specification decouples clients from catalog implementations; Nessie adds git-like branching and merging so pipelines can write in isolation and merge atomically.
  • Adjacent bookmarks rounded out the architecture view: Zhamak Dehghani’s data mesh argument for decomposing monolithic lakes, and the Open Data Contract Standard for formalizing the producer–consumer interface.

Fractional Indexing for Realtime Ordered Sequences

A pair of articles explained how collaborative tools keep ordered lists consistent under concurrent edits without the complexity of Operational Transformation.

  • Figma’s Realtime Editing of Ordered Sequences (Evan Wallace) explains why OT was overkill for a design tool: OT’s benefits target enormous text sequences and interleaving avoidance, at a high implementation cost—the first OT paper had subtle bugs that went undiscovered for years.
  • Figma’s alternative: every object gets a fraction between 0 and 1 (exclusive) as its index; ordering is just sorting. Inserting between two objects averages their indices, and the exclusive bounds guarantee you can always generate a key before or after any existing one.
  • Indices are stored as strings and averaged with string manipulation, so precision never runs out. When two clients insert at the same position simultaneously, the server simply assigns a fresh unique position to the second insert.
  • David Greenspan’s Implementing Fractional Indexing works out the details in lexicographic string space: the midpoint algorithm must not needlessly grow key length, and edge cases like trailing zeros (“3” vs “30” have nothing between them) need careful handling.
  • His refinement uses variable-length integer parts with a length prefix (A0–A9, B00–B99, …) so appending grows keys logarithmically instead of one digit per insert, and base 95 packs 65 bits of integer precision into ten characters.

War Stories of Scaling Relational Databases

Two production write-ups showed different paths to the same destination: keeping a relational database alive under massive growth.

  • OpenAI’s Scaling PostgreSQL to power 800 million ChatGPT users keeps a single-writer Postgres and offloads pressure instead of sharding it: shardable write-heavy workloads migrate to Cosmos DB, backfills are strictly rate-limited, and complex joins move to the application layer.
  • The operational playbook: isolate low-priority and high-priority requests onto dedicated instances to kill the noisy-neighbor problem, pool connections with PgBouncer in transaction mode, use cache locking so a cache-miss storm sends only one reader per key to the database, rate-limit at every layer, and enforce a 5-second timeout on schema changes.
  • Slack took the sharding path with Vitess: the original architecture had three MySQL clusters with all data-access and shard-lookup logic living in the webapp itself, which became the bottleneck to evolve—motivating a datastore layer that owns routing instead. The Vitess project itself landed in my bookmarks as the reusable version of that lesson.