System Design - RAID and Storage

December 20, 2025 | system-design, storage


Notes on computer storage architecture regarding system design.

Storage#

  • Persisting data in a organized manner
    • Even on system failure
  • Recovering and accessing when needed
  • Long-term storage
  • Security, performance, latency, scalability, redundancy, replication, availability, etc.
  • HDD, SSD, Network Systems, etc.
    • Where to store the data, depending on our use-case
    • Best storage model for cases where we write a lot, or read a lot
  • We want to expand the size in case we reach the maximum capacity

Metrics and dimensions#

Some metrics can help guide us based on our requirements, which storage to choose.

Throughput#

Number of operations a system can handle within a determined period of time.

  • Operations / time
  • Data transferred to a storage unit
  • Number of reads and writes
  • Megabytes per second (MB/s)
  • Gigabytes per second (GB/s)

Disks with higher throughput are generally more expensive and used for systems that the amount of read/write are extremely high. They need to work with a large amount of request, etc.

Throughput is the amount of data we are writing within a time range.

Bandwidth#

The max capacity of my throughput.

  • Maximum quantity of data in my communication channel
  • Represents my maximum throughput
  • MB/s, GB/s, etc.

Bandwidth is the max throughput our channel can accept.

I/O and IOPS#

I/O represents a read/write operation that happens within a system and a storage unit.

  • I/O operations per second
  • Read/Write operations

I/O and IOPS shows the number of read/write operation within a unit of a second. We want to keep in mind the number of IOPS a storage can handle, and how much are we using now.

Storage Models#

Similarly to the OSI model, we can also have a similar layered model for storage.

DAS (Direct-Attached Storage)#

  • Traditional model, generally connected to the host through SATA, USB, etc.
  • Disk allocation is done within the host
  • Direct access
  • No additional latency
  • No complex protocol, nor any networking connection
  • Frequent and intense disk access
  • Does not support attaching to simultaneous systems (can be done through software)
  • Hard to scale horizontally.
  • Centralized

NAS (Network-Attached Storage)#

  • NFS (Network File System)
  • Directly connected to the network
  • Allows multiple hosts to connect to the volume
  • SMB
  • Modifies the same data simultaneously
  • Data is available through a network
  • Centralized server
  • Can scale horizontally
  • Requires additional software and network layers, so there are more latency, network bandwidth limits, can be affected by network bottlenecks or spike of requests

Block Storage#

  • Allows storing scattered data in a block
  • Direct access through the OS and as a mounted storage
  • The server responsible for managing the blocks can wipe them, and use them as a file system
  • Addressing is unique and can be organized individually
  • Isolated disk but can also be used virtually
  • The size of the block is generally fixed-set, unless it supports resizing dynamically

File Storage#

  • File-level or file-based storage
  • Hierarchy structured system of files and directories
  • Has metadata (created_at, updated_at, permissions, etc.)
  • The name of the file with a hierarchy structure forms the unique identifier
  • These systems can be configured using RAID and are attached to a NAS system

Object Storage#

  • Architectured more oriented to services – more scalable than the others
  • Implemented at a protocol level (they offer APIs so we can access them) – layer 7
  • Highly abstracted
  • High amount of data, completely detached from the application
  • Content and metadata support
  • Allows configuring things such as life-cycle, exclusions, etc.
  • Available through cloud provider (GCP, AWS, etc.)
  • Backup, replication and lifecycle support
    • Transparent from the application
  • Cloud Native architecture
  • High decoupling from the application

RAID models#

  • Redundant Array of Independent Disks Combines multiple volumes of physical disks into a single logical system
  • Tradeoff between resilience, fault tolerance, performance and data integrity
  • RAID 0, 1, 5, 6 and 10 (1+0)

RAID 0 (Striping)#

  • Focuses in space and performance
    • Data is distributed equally into one or more disks
  • R/W optimized
    • Parallelizes the IOPs to each disk
  • Lack of availability and resilience
  • Sum of IOPs of the disks
    • 2 disks where each has a limitation of 1000 IOPs, then we our system has 2k IOPS
  • If a single disk fails, all data is lost
  • Physical disks can be expanded horizontally
  • Useful for temporary data processing
  • Useful for persistent cache with a low TTL
  • Useful for video and image rendering
  • Useful for CI/CD clusters or stateless with high I/O transitions
  • Avoid critical workflows
  • Useful for when R/W is a priority
  • Useful for temporary or redundant data

RAID 1 (Mirroring)#

  • Focuses on availability and redundancy
  • Applies mirroring of data – each disk has an exact copy
  • Copies all its data to other disk
  • If a disk has a problem, another disk takes responsibility without any interruption or data loss
  • Uses half of the storage due to replication (50% of usable space)
  • Useful for high availability and simplicity
  • Useful for Cluster Management
  • Useful for Etcd, state managers, etc.
  • Useful for Storage with Edge Nodes
  • Useful for file systems
  • Useful for critical systems

RAID 5 (Striping with Distributed Parity)#

  • RAID 0 and 1 tradeoffs
  • Performs well with R/W without sacrificing availability and security
  • Perform write operations distributed between the volumes
  • Maintains parity metadata distributed between volumes
  • Requires at least 3 disks, with N-1, where 1 is used for parity distribution
  • Performance is reduced when one disk is removed
  • Tolerates 1 disk failure
  • Slow reconstruction
  • Good use of disk space
  • Useful for little mutable data – data that does not change much – good for read operations
  • Useful for a long-term data
  • Useful for storing logs

RAID 6 (Striping with double parity)#

  • Similar to parity distribution of RAID 5
  • Additional layer of distributed parity
  • 2 disks can fail simultaneously without data loss
  • RAID 6 requires 2 additional volumes instead of 1 (as of RAID 5)
  • Performance is slightly decreased due to double parity
  • Useful for high data ingestion and retention
  • Useful for geographical replication
  • Useful for Data Lakes and Data Warehouse
  • Generally standard in Datacenters

RAID 10 (Combination of RAID 1 with RAID 0)#

  • Combines RAID 1 and RAID 0
  • Distributes the blocks between different disks
  • High availability
  • Tolerates simultaneous disk faults
  • Total capacity is reduced by half, given 50% is used for data replication
  • Expensive
  • Useful for financial systems, where availability is a priority
  • Useful for systems where the cost of losing data is higher than the cost of RAID 10
  • Requires a minimum of 4 disks – for mirroring and striping

Comparison#

RAIDFault TolerancePerformanceUsable spaceUse-case
0NoneHigh100Data that has no transactional value
11 DiskMedium50%
51 DiskGood~80%Critical OSs, we can restore old backups, etc.
62 DisksRegular~65%Systems that require more resilience
101 per pairHigh50%