All books

Designing Data Intensive Applications

Bookshelf

Designing Data Intensive Applications

Foundations of Data Systems

  • Data system categorization: Look beyond traditional categories (database, queue, cache) to see data systems as tools with overlapping functionality
  • Key requirements: Focus on reliability, scalability, and maintainability as foundational properties
  • Reliability definition: Build systems that work correctly even when faults occur
  • Scalability approach: Design for graceful handling of growth in data, traffic, and complexity
  • Maintainability emphasis: Create systems that many engineers can work on productively
  • Fault vs. failure: Distinguish between faults (components deviating from spec) and failures (system as a whole stops working)
  • Hardware faults: Design redundancy for hardware components expected to fail
  • Software faults: Prevent systematic errors through testing, isolation, and monitoring
  • Human errors: Minimize through well-designed interfaces, sandboxed environments, and easy recovery
  • Measurement importance: Use metrics to make quantitative statements about scalability

Data Models and Query Languages

  • Data model impact: Recognize how your data model shapes application code and thinking
  • Relational model benefits: Leverage the relational model for its generality and flexibility
  • Document model advantages: Consider document model for schema flexibility and locality
  • Impedance mismatch awareness: Be conscious of the object-relational mismatch
  • Schema evolution: Plan for schema changes over time regardless of data model
  • Normalization trade-offs: Balance normalization against read performance and complexity
  • Graph model applicability: Use graph models when many-to-many relationships are core to your data
  • Query language paradigms: Understand differences between declarative and imperative approaches
  • MapReduce limitations: Recognize the expressiveness limitations of MapReduce patterns
  • Declarative preference: Prefer declarative query languages to enable optimization and parallelization

Storage and Retrieval

  • Storage engine classification: Distinguish between OLTP (transaction processing) and OLAP (analytics) workloads
  • Log-structured designs: Consider log-structured storage engines for high write throughput
  • B-tree characteristics: Leverage B-trees for ordered data with strong read performance
  • Comparing LSM and B-trees: Choose LSM-trees for high write throughput, B-trees for strong read performance
  • Clustered index impact: Understand impact of clustered indexes on data locality
  • Secondary index overhead: Account for write amplification from maintaining secondary indexes
  • In-memory databases: Use in-memory databases when performance justifies the cost
  • Data warehouse purpose: Build separate systems optimized for analytics queries
  • Column-oriented storage: Use column storage for analytical workloads with many columns
  • Compression benefits: Apply compression to reduce storage costs and improve throughput

Encoding and Evolution

  • Encoding importance: Select appropriate encodings for data in memory, on disk, and over networks
  • Backward compatibility: Maintain the ability to read old data with new code
  • Forward compatibility: Ensure new data can be read with old code
  • Language-specific formats: Avoid language-specific serialization formats due to coupling and performance issues
  • Text format trade-offs: Understand that text formats (JSON, XML) are human-readable but verbose
  • Binary encoding benefits: Use binary encodings for efficiency and reduced size
  • Schema evolution: Design for schema changes without downtime
  • Schema registry value: Consider schema registries to manage schemas and evolution
  • RPC complications: Be aware of the many failure modes unique to RPC calls
  • Async workflows: Design for asynchronous message passing when appropriate
  • Message broker benefits: Use message brokers to decouple producers from consumers

Replication

  • Replication purpose: Replicate data for data locality, availability, and read scaling
  • Synchronous vs. asynchronous: Understand the fundamental trade-offs between consistency and availability
  • Single-leader benefits: Use single-leader replication for simplicity when read scaling is sufficient
  • Replica setup: Configure new replicas properly to avoid inconsistency
  • Replica failure: Design robust processes for handling node failures and rejoins
  • Replication lag: Account for replication lag in system design and user experience
  • Read consistency: Choose appropriate read consistency for your application needs
  • Multi-leader complexities: Recognize the conflict resolution requirements of multi-leader setups
  • Leaderless trade-offs: Understand durability and consistency trade-offs in leaderless systems
  • Quorum calculations: Set W + R > N to ensure quorum for strong consistency in leaderless replication

Partitioning

  • Partitioning purpose: Partition data to improve scalability and balance load
  • Key-based partitioning: Choose partitioning schemes based on key distribution knowledge
  • Hash partitioning: Use hash partitioning to distribute load evenly
  • Range partitioning: Apply range partitioning when range queries are important
  • Secondary index challenges: Address the scatter/gather overhead of secondary indexes across partitions
  • Rebalancing approaches: Select rebalancing strategies that minimize data movement
  • Request routing: Create a robust system for routing requests to appropriate partitions
  • Skew avoidance: Design to avoid hot spots from skewed workloads
  • Parallel query execution: Build mechanisms for parallel query execution across partitions
  • Combining techniques: Combine replication and partitioning for scalable, available systems

Transactions

  • Transaction purpose: Use transactions to simplify error handling and concurrency for applications
  • ACID interpretation: Understand the practical meaning of Atomicity, Consistency, Isolation, and Durability
  • Weak isolation impacts: Be aware of anomalies possible under weak isolation levels
  • Read phenomena: Recognize and prevent dirty reads, non-repeatable reads, and phantom reads
  • Isolation levels: Choose appropriate isolation levels based on application requirements
  • Optimistic vs. pessimistic: Select optimistic or pessimistic concurrency control based on contention expectations
  • Serializability cost: Understand the performance cost of serializable isolation
  • Two-phase locking: Use 2PL when strong isolation is required despite performance impact
  • Serializable snapshot isolation: Consider SSI for serializable isolation with better performance
  • External consistency need: Determine if your application requires external consistency/linearizability

Distributed System Troubles

  • Partial failure reality: Design for partial failures in distributed systems
  • Unreliable networks: Assume the network will fail in various ways
  • Timeout tuning: Set timeouts carefully based on expected response distributions
  • Network faults handling: Handle network partitions and faults gracefully
  • Clock synchronization limitations: Don’t rely on synchronized clocks for critical functionality
  • Monotonic clock usage: Use monotonic clocks for measuring elapsed time
  • Anomaly detection: Implement detection of clock skew and network issues
  • Failure detection: Create robust failure detection mechanisms with appropriate timeouts
  • Knowledge limitations: Acknowledge the fundamental limitations of knowledge in distributed systems
  • Byzantine fault tolerance: Determine if Byzantine fault tolerance is needed for your threat model

Consistency and Consensus

  • Consistency models: Understand the spectrum from eventual consistency to strict serializability
  • Linearizability benefits: Use linearizability when recency guarantees are critical
  • Linearizability costs: Be aware of the performance and availability costs of linearizability
  • Causality importance: Respect causal dependencies in your system design
  • Lamport timestamps: Use Lamport timestamps to establish a causal ordering
  • Total order broadcast: Implement total order broadcast for consensus
  • Consensus algorithms: Choose appropriate consensus algorithms for your requirements
  • Two-phase commit limitations: Understand the coordinator failure vulnerability in 2PC
  • Distributed transactions: Use distributed transactions cautiously due to performance and operational costs
  • Idempotence design: Design operations to be idempotent when possible

Batch Processing

  • Batch processing value: Leverage batch processing for cost-efficient high-volume data processing
  • Unix philosophy: Apply Unix philosophy of composability to batch job design
  • MapReduce workflow: Structure complex analytics as multi-stage MapReduce jobs
  • Sorting benefits: Utilize sorting for efficient grouping and joining in batch processing
  • Partitioning effect: Use partitioning to parallelize batch processing
  • Fault tolerance: Design batch jobs to handle partial failures gracefully
  • Task granularity: Choose appropriate task sizes to balance parallelism and overhead
  • Output immutability: Treat outputs as immutable for simpler recovery and debugging
  • Separation of concerns: Separate the logical data processing from physical execution
  • Workflow coordination: Coordinate workflows using scheduler tools like Airflow or Oozie

Stream Processing

  • Stream processing applications: Apply stream processing for event-rich domains
  • Event stream origin: Capture event streams at their source to enable derived views
  • Messaging delivery: Choose message delivery semantics based on application needs
  • Messaging systems: Select messaging systems based on throughput, latency, and durability requirements
  • Database integration: Integrate streams with databases through change data capture
  • Stream processing patterns: Implement common patterns like windowing, sampling, and joining
  • Time handling: Handle event time vs. processing time discrepancies
  • Watermark heuristics: Use watermarks to reason about event completeness
  • State management: Manage state carefully in stream processors
  • Idempotent operations: Design stream operations to be idempotent for exactly-once semantics

The Future of Data Systems

  • Data integration approaches: Evaluate different approaches to integrating disparate systems
  • Derived data value: Recognize the value of derived data for system flexibility
  • Batch and stream unification: Unify batch and stream processing conceptually
  • Unbundling databases: Consider unbundling database functionality for specific needs
  • Separation of storage and processing: Separate storage from processing for flexibility
  • Correctness enforcement: Push correctness enforcement to the infrastructure layer
  • Metadata importance: Leverage metadata for system observability and evolution
  • Ethical data use: Consider the ethical implications of data collection and processing
  • Predictive privacy impact: Anticipate privacy concerns in data system design
  • End-to-end thinking: Design data systems with end-to-end application needs in mind

Key Takeaways

  1. Reliability engineering: Build systems that continue working correctly even when things go wrong
  2. Scalability planning: Design for growth in data volume, complexity, and traffic from the start
  3. Maintainability focus: Prioritize operability, simplicity, and evolvability in system design
  4. Data model selection: Choose data models based on application access patterns, not just data structure
  5. Storage engine fit: Select storage engines based on workload characteristics (write-heavy vs. read-heavy)
  6. Encoding flexibility: Use encodings that allow for independent evolution of services
  7. Replication purpose: Apply replication patterns based on specific needs for availability, latency, or scalability
  8. Partitioning strategy: Partition data to scale write throughput and balance load appropriately
  9. Consistency levels: Select consistency levels based on actual application requirements, not dogma
  10. System integration: Design thoughtful integration between systems using batch, stream processing, and derived data