A dive into Shopify’s multi-tenant architecture that allows them to failover between
regions with zero downtime, move shops between shards, minimize the blast radius of catastrophes,
as well as throttling and serving cache hits out of the load-balancers.
Some notes I took while watching this great talk:
Lots of customers with their own domains pointing to their IPs:
- Not much they can do at the DNS level in terms of traffic engineering
- Instead they use BGP Anycast to anounce their IP blocks to neighbouring ISPs.
They are happy to use nginx with OpenResty and have developed some lua
scripts to solve problems right at the loadbalancer level:
- BotSquasher, a
lua
script that analyzes the Kafka stream of incoming requests to ban bots by looking at suspicious patterns, like very frequent refreshes or strange flows. - EdgeCache, skips the entire application stack and gets cached content directly from memcached.
- Pauser, enqueues requests for some seconds while failover is in progress to avoid serving errors.
A group of shops are in a pod, which is composed by mysql, redis, memcached and cron runner. Pods are isolated from one another and are a stateful layer. The application workers are shared among pods and are stateless. Pod balancer moves bigger shops to less crowded pods to keep the load and size balanced among pods.
Application level sharding; sharded by the shopId
key. Every table has a shopId
key.
They are using containers and kubernetes in production.