How to Design a Stable Storage System

Over the years, we’ve traveled the full evolutionary chain of storage technologies — from traditional filesystems to network filesystems to modern object storage. Each stage taught us valuable (sometimes painful) lessons about scalability, reliability, and the limits of conventional approaches.

Stage 1: Traditional Filesystems

Mature filesystems like XFS and Ext4 are fast and reliable for single-server use. However, when you start adding disks and growing LVM volumes, the risk of downtime increases dramatically. As we noted in our scalability article, the Dirichlet principle is merciless: with enough drives, the probability of failure approaches certainty.

Stage 2: Network Filesystems (NFS et al.)

To reduce risk, we moved to network filesystems. The results were mixed at best. Scalability and fault tolerance remained problematic. In one production setup with 40 application servers using NFS, occasional file losses would break the UI because links in the database pointed to missing assets. Not ideal for production environments.

Stage 3: Object Storage — The Logical Next Step

Object storage solves many of these issues by providing better fault tolerance, horizontal scalability, rich metadata support, and a clean HTTP-based API that applications can use directly. No more fragile filesystem links in a database — keys and metadata live with the objects themselves.

The Object Storage Zoo

We evaluated many options:

  • CouchDB: Everything in JSON → bloated binary objects.
  • Cassandra: Unpredictable performance under load.
  • OpenStack Swift: Acceptable until a single node failure could stop object serving (questionable eventual consistency).
  • LeoFS, ZoDB, MongoDB, Redis: Each had significant drawbacks in stability, predictability, or maintenance.

Why Riak CS Stood Out

After extensive testing, Riak CS (now part of the Basho lineage) proved to be the most reliable choice. Built on Erlang/OTP — famous for systems that run for decades with minimal intervention — it offered predictable performance, consistent latency, and solid distributed systems design. It shines in clusters of 3+ nodes and handles real-world loads gracefully.

Lessons Applied at LightUp.Cloud

These experiences heavily influenced our architecture. We prioritize systems that are predictable, fault-tolerant, and scalable by design rather than by heroic effort. For users who need fast, reliable synchronization of large files with strong consistency guarantees, this foundation makes all the difference.

The evolution of storage reminds us that the “latest and greatest” isn’t always the most stable. Sometimes the best solutions come from understanding the entire journey — and choosing tools that were built for the long haul.

28 November, 2017