Grokking the Principles and Practices of Advanced System Design
Master Grokking the Principles and Practices of Advanced System Design. Learn to design distributed, scalable systems and tackle real-world challenges efficiently.
This course teaches you how large, real-world systems are built and operated to meet strict service-level agreements. You’ll learn the many building blocks of a modern system’s design by picking and combining the right pieces and understanding the trade-offs between them.
You’ll learn about some great systems from hyperscalers such as Google, Facebook, and Amazon. This course has hand-picked seminal work in system design that has stood the test of time and is grounded on strong principles. You will learn all these principles and see them in action in real-world systems.
After taking this course, you will be able to solve various system design interview problems. You will have a deeper knowledge of an outage of your favorite app and will be able to understand their event post-mortem reports. This course will set your system design standards so that you can emulate similar success in your endeavors.
What You’ll Learn
- Working knowledge of building large-scale systems
- Ability to evaluate common system design trade-offs
- Ability to map interview questions and on-job design tasks to well-known systems
- Familiarity with the complexity of real-world systems behind a seemingly simple system
- Understanding of large cloud service providers hosted in geographically dispersed data centers
Course Content
1.Prologue
This chapter sets the stage for the course, emphasizing learning from historical systems and balancing innovation with established design practices.
- Case Studies: Standing on the Shoulders of Giants
2.File Systems
This chapter sets the stage for exploring distributed file systems, focusing on advancements in data management with systems like GFS, Colossus, and Tectonic.
- Introduction to Distributed File Systems
3.Google File System (GFS)
This chapter covers the Google File System (GFS), focusing on efficient management of large data files with scalability, fault tolerance, and high throughput.
- Introduction to GFS
- GFS File Operations
- Detailed Design of GFS
- Workflow of Create and Read File Operations in GFS
- Workflow of Write Operations in GFS
- Workflow of Delete and Snapshot Operations in GFS
- Relaxed Data Consistency Model
- Dealing with Data Inconsistencies in GFS
- Metadata Consistency Model of GFS
- Evaluation of GFS
- Quiz on GFS
4.Google Colossus File System
This chapter covers Colossus, which improves scalability and performance over GFS using a distributed metadata model for better data management and low latency.
- Introduction to Colossus
- Design and Evaluation of Colossus
- Quiz on Colossus
5.Facebook’s Tectonic File System
This chapter discusses Tectonic File System, providing scalable storage with performance isolation and optimized resource management for diverse workloads.
- Introduction to Tectonic
- ZippyDB Design
- Detailed Design of Tectonic
- Multitenancy in Tectonic
- Tenant-specific Optimization in Tectonic
- Empirical Evaluation of Tectonic’s Functional Requirements
- Evaluation of Tectonic
- Quiz on Tectonic
6.Databases
This chapter covers the evolution from relational to NoSQL databases, highlighting the balance between scalability, availability, and consistency.
- Introduction to Distributed Databases
7.Google Bigtable
This chapter covers Bigtable, a scalable storage solution for managing large datasets, enhancing performance and availability with its unique design.
- Introduction to Bigtable
- Data Model of Bigtable
- Detailed Design of Bigtable: Part I
- Detailed Design of Bigtable: Part II
- Design Refinements in Bigtable
- Evaluation of Bigtable
- Quiz on Bigtable
8.Google Megastore
This chapter covers Megastore, blending NoSQL scalability with relational features for high availability, ACID transactions, and optimized cloud performance.
- Introduction to Megastore
- High-level Design for Better Availability and Scalability
- Data Model of Megastore
- Replication in Megastore
- Evaluation of Megastore
- Quiz on Megastore
9.Google Spanner
This chapter covers Google Spanner, combining relational features with NoSQL scalability for strong consistency, high availability, and global data management.
- Introduction to Spanner
- Detailed Design of Spanner
- Database Buckets and Data Model of Spanner
- TrueTime API in Spanner
- Spanner, TrueTime, and the CAP Theorem
- Concurrency Control in Spanner
- Database Operations in Spanner
- Evaluation of Spanner
- Quiz on Spanner
10.Key-value Stores
This chapter introduces key-value stores, crucial for caching, NoSQL databases, and enhancing scalability and availability in modern distributed applications.
- Introduction to Key-value Stores
11.Many-core Key-value Store
This chapter covers the many-core key-value store, enhancing efficiency and scalability while addressing power consumption and performance challenges.
- Motivation and Requirements for a Many-core Approach
- Estimations and Limitations of a Many-core System
- Detailed Design of a Many-core System
- Evaluation of the Many-core System
- Quiz on Many-core Systems
12.Scaling Memcache
This chapter explores Memcache scaling strategies, addressing performance, consistency, and network efficiency challenges across various operational levels.
- Introduction to Scaling Memcache
- Single-server Level of Memcache
- Cluster Level of Memcache
- Regional Level of Memcache
- Cross-regional Level of Memcache
- Evaluation of Memcache
- Quiz on Memcache
13.SILT
This chapter covers SILT, which optimizes key-value storage with a multi-store architecture, focusing on memory efficiency, low latency, and data management.
- Introduction to SILT
- High-level Design of SILT
- A Write-friendly Store for SILT: Part I
- A Write-friendly Store for SILT: Part II
- A Write-friendly Store for SILT: Part III
- Intermediary Store(s) in SILT
- A Memory-efficient Store for SILT: Part I
- A Memory-efficient Store for SILT: Part II
- A Memory-efficient Store for SILT: Part III
- Request Flows in SILT
- Evaluating and Extending the Design of SILT
- Quiz on SILT
14.Amazon DynamoDB
This chapter covers DynamoDB, a managed NoSQL service designed for high availability, strong durability, and scalability, meeting diverse data management needs.
- Introduction to DynamoDB
- High-level Design of DynamoDB
- No Fixed Schema in DynamoDB
- Partitioning and Replication in DynamoDB
- Adapting to Traffic Patterns in DynamoDB
- Durability and Correctness in DynamoDB
- Ensuring High Availability in DynamoDB
- Quiz on DynamoDB
15.Concurrency Management
This chapter introduces concurrency management methods for efficiently handling simultaneous client requests in distributed systems.
- Introduction to Concurrency Management
16.Two-phase Locking (2PL)
This chapter covers 2PL, a concurrency control mechanism ensuring data integrity, while addressing challenges like deadlocks and throughput issues.
- Introduction to Two-Phase Locking (2PL)
- Analysis and Evaluation of Two-Phase Locking (2PL)
- Quiz on 2PL
17.Google Chubby Locking Service
This chapter covers Chubby, a distributed locking service that enhances coordination, availability, and fault tolerance in Google’s systems with robust design.
- Introduction to Chubby
- Detailed Design of Chubby: Part I
- Detailed Design of Chubby: Part II
- Detailed Design of Chubby: Part III
- Detailed Design of Chubby: Part IV
- The Rationale Behind Chubby’s Design
- Evaluation of Chubby
- Quiz on Chubby
18.ZooKeeper
This chapter covers ZooKeeper, a coordination system for distributed environments, offering efficient resource management and high availability.
- Introduction to ZooKeeper
- Detailed Design of ZooKeeper
- Primitives of ZooKeeper
- Evaluation of ZooKeeper
- Quiz on ZooKeeper
19.Big Data Processing: Batch to Stream Processing
This chapter explores the evolution and significance of big data processing systems like MapReduce, Spark, and Kafka in data handling and management.
- Introduction to Big Data Processing Systems
20.MapReduce
This chapter covers MapReduce, which simplifies processing large datasets with a user-friendly model that enables efficient parallelization and fault tolerance.
- System Design: MapReduce
- High-level Design of MapReduce
- MapReduce: Detailed Design
- Design Refinements in MapReduce: Part I
- Design Refinements in MapReduce: Part II
- MapReduce: Evaluation
- Concluding MapReduce
- Quiz on MapReduce
21.Spark
This chapter covers Spark’s architecture, focusing on in-memory processing, RDDs, and features for low latency and fault tolerance.
- Introduction to Spark
- Requirements of Spark
- High-level Design of Spark
- Resilient Distributed Datasets of Spark
- Parallel Operations in Spark
- Shared Variables in Spark
- Detailed Design of Spark
- Refinements in Spark
- Evaluation of Spark
- Quiz on Spark
22.Kafka
This chapter introduces Kafka, a powerful messaging system for real-time event streaming, known for high scalability, efficiency, and reliable data delivery.
- Introduction to Kafka
- High-level Design of Kafka
- Detailed Design of Kafka
- Efficiency of Kafka
- Distributed Coordination in Kafka
- Delivery Guarantees of Kafka
- Evaluation of Kafka
- Quiz on Kafka
23.Consensus
This chapter introduces consensus in distributed systems, covering algorithms like Paxos and Raft, and key concepts like FLP and Byzantine faults.
- Introduction to Consensus in Distributed Systems
24.Understanding Consensus: Two Generals, FLP, & Byzantine Generals
This chapter explores consensus challenges in distributed systems, focusing on the Two Generals problem, FLP impossibility, and Byzantine Generals problem.
- Consensus Prerequisites and Two Generals’ Problem
- FLP Impossibility
- The Byzantine Generals Problem
- Let AI Evaluate Your Understanding of Consensus Fundamentals
25.Two-phase Commit
This chapter explains 2PC, a consensus protocol to ensure atomicity in distributed transactions by coordinating across nodes and handling failure challenges.
- Introduction to Two-Phase Commit (2PC)
- Working of the Two-Phase Commit Protocol
- Failures in the Two-Phase Commit Protocol
- Quiz on Two-Phase Commit
26.State Machine Replication
This chapter covers State Machine Replication, which ensures fault tolerance by using replicated state machines to maintain consistency despite failures.
- Introduction to State Machine Replication
- State Machines
- Replication and Coordination of State Machines
- Ordering Requests: Part I
- Ordering Requests: Part II
- Fault Tolerance for Outputs and Clients
- Protocols for Maintaining Fault Tolerance: Part I
- Protocols for Maintaining Fault Tolerance: Part II
- SMR in Practice Via a Log
- Quiz on State Machine Replication
27.Paxos
This chapter explores the Paxos consensus algorithm, detailing its design, operation, and use in achieving reliable distributed consensus.
- Introduction to Paxos
- Basic Paxos Protocol Design
- Basic Paxos in Action
- The Rationale behind Paxos Design Choices
- Multi-Paxos
- Quiz on Paxos
28.Raft
This chapter covers Raft, a consensus algorithm ensuring consistency and fault tolerance through leader election, log replication, and cluster management.
- Introduction to Raft
- Raft’s Basics and High-Level Workflow
- Raft’s Leader Election Protocol
- Raft’s Log Replication Protocol
- Raft’s Safety, Fault-Tolerance, and Availability Protocols
- Raft’s Cluster Membership Changes
- Log Compaction and Client Interaction in Raft
- Quiz on Raft
29.Epilogue
This chapter concludes the course by emphasizing applying system design principles to real-world challenges while encouraging ongoing exploration and learning.
- Conclusion
deepal –
“I just wrapped up ‘Grokking the Principles and Practices of Advanced System Design,’ and I can’t believe how much I learned! The way the course breaks down complex concepts made everything so much clearer. I loved the real-world examples; they really helped me see how to apply these ideas in practice. The hands-on approach was a game-changer for me, and I feel so much more ready for my interviews now. If you want to boost your system design skills, I highly recommend this course!”