Real time at larger scale

Mikito Takada

@mikitotakada

Mikito Takada, Zendesk (mixu.net, twitter.com/mikitotakada)

The stack at Zendesk

  • Ejabberd (2010)
  • HAProxy/f5 + Socket.io + redis (2011) - chat
  • HAProxy/f5 + Engine.io + redis (2012) - chat, presence, status

Custom API and management code on top of Socket.io/Engine.io.

Easy to get started with a real-time app

  • npm install --save socket.io redis
  • receive messages and push them into the database
  • ...
  • profit!

Now comes the hard part

  • Scaling
  • Tooling and monitoring
  • Accomodating new use cases

Techniques

  • It's all about imposing the right constraints - so that your architecture allows for scaling
  • This means presenting a consistent external API, which abstracts over potential complexities and failure points.
  • [browser] [server] [backend service]
  • And following certain core techniques.

Techniques

  • Independently scalable components
  • Stateless server
  • Disposable processes
  • Monitoring over debugging
  • Client-side recovery

Independently scalable processes

  • It should be possible to add capacity for each layer of the app independently of the others.
  • Pass of/queue work to the next layer, don't care about details
  • Socket.io (client IDs), Redis (keys/values/pubsub) - write your stack on top to deal with things that are meaningful to your app (user IDs, resource subscriptions etc.). Client IDs are an implementation detail that your API's should abstract over.

Do as little as possible

  • Processes should be simple; they should be managed by some other mechanism (e.g. a process manager, a load balancer etc.)
  • If something fails, just start a new instance
  • Separate tasks

Stateless processes

  • Data is important (multiple independent copies) - not computation
  • Statelessness: any data that is important must be stored in reliable a stateful backing service (e.g. DB)
  • Shared nothing architecture: each node is indepedent and self-sufficient
  • The process is just a cache a for the backend service, and nothing important is on it.

Stateless processes

  • Socket.io requires sticky sessions (handshakes), but don't build your app in a way that uses in-memory sessions.
  • Things that matter are persisted elsewhere and things that don't can be automatically recovered.
  • Avoid server-side sessions, write your APIs and authentication to work with stateless servers.

Disposability

  • Detect failures, fail gracefully
  • Robustness / recoverability
  • Use the same logic in the normal case and in the exceptional case as much as possible

Disposability - Fail gracefully

  • Client crash: reload.
  • Server crash: Re-establish existing state; this is app specific and more than just reconnecting.
  • Write things of significance to the backend service (survive restarts). Just connections, no sessions; state is either in the client or in the database. Adding more capacity becomes a matter of just starting a new server.

Monitoring over debugging

  • Fault identification and failure rates
  • Simple console/file based logging doesn't give you enough visibility into what's going on
  • KISS; if you aren't going to react in real-time, hourly is fine.
  • Make collecting data easy - both on the server and on the client

Monitoring over debugging

  • Like olark: log('cookie problems #nocookies_for_session #warn') - makes counting things more seamless.
  • Server and client-side logging using Minilog (mixu.net/minilog)
  • Monitoring graphs and queries: Hive/Hadoop + simple graphing

Client-side recovery

  • The user should know what their connection status is, because they are in the best position of fixing it (e.g. a wifi drop)
  • Surface events: "connected" / "reconnected" / "ready", "disconnecting", "disconnected", "reconnecting", "reconnected", "unavailable"
  • What happens when I am disconnected? How do I know what my connection status is?

Simple

  • Keep it simple; makes it easier to expand to new use cases.
  • No special casing: generic resources supplemented with client-side code, not server-side code
  • Engine.io vs Socket.io; WS vs. polling? Your end-user doesn't care in most cases; polling works fine
  • Don't try to tackle too many problems at once