Real time at larger scale

Mikito Takada

@mikitotakada

Mikito Takada, Zendesk (mixu.net, twitter.com/mikitotakada)

Socket.io/Engine.io contributor
I write free books:
- New book - Single page apps in depth (singlepageappbook.com)
- Mixu's Node book (book.mixu.net)

The stack at Zendesk

Ejabberd (2010)
HAProxy/f5 + Socket.io + redis (2011) - chat
HAProxy/f5 + Engine.io + redis (2012) - chat, presence, status

Custom API and management code on top of Socket.io/Engine.io.

Easy to get started with a real-time app

npm install --save socket.io redis
receive messages and push them into the database
...
profit!

Now comes the hard part

Scaling
Tooling and monitoring
Accomodating new use cases

Techniques

It's all about imposing the right constraints - so that your architecture allows for scaling
This means presenting a consistent external API, which abstracts over potential complexities and failure points.
[browser] [server] [backend service]
And following certain core techniques.

Techniques

Independently scalable components
Stateless server
Disposable processes
Monitoring over debugging
Client-side recovery

Independently scalable processes

It should be possible to add capacity for each layer of the app independently of the others.
Pass of/queue work to the next layer, don't care about details
Socket.io (client IDs), Redis (keys/values/pubsub) - write your stack on top to deal with things that are meaningful to your app (user IDs, resource subscriptions etc.). Client IDs are an implementation detail that your API's should abstract over.

Do as little as possible

Processes should be simple; they should be managed by some other mechanism (e.g. a process manager, a load balancer etc.)
If something fails, just start a new instance
Separate tasks

Stateless processes

Data is important (multiple independent copies) - not computation
Statelessness: any data that is important must be stored in reliable a stateful backing service (e.g. DB)
Shared nothing architecture: each node is indepedent and self-sufficient
The process is just a cache a for the backend service, and nothing important is on it.

Stateless processes

Socket.io requires sticky sessions (handshakes), but don't build your app in a way that uses in-memory sessions.
Things that matter are persisted elsewhere and things that don't can be automatically recovered.
Avoid server-side sessions, write your APIs and authentication to work with stateless servers.

Disposability

Detect failures, fail gracefully
Robustness / recoverability
Use the same logic in the normal case and in the exceptional case as much as possible

Disposability - Fail gracefully

Client crash: reload.
Server crash: Re-establish existing state; this is app specific and more than just reconnecting.
Write things of significance to the backend service (survive restarts). Just connections, no sessions; state is either in the client or in the database. Adding more capacity becomes a matter of just starting a new server.

Monitoring over debugging

Fault identification and failure rates
Simple console/file based logging doesn't give you enough visibility into what's going on
KISS; if you aren't going to react in real-time, hourly is fine.
Make collecting data easy - both on the server and on the client

Monitoring over debugging

Like olark: log('cookie problems #nocookies_for_session #warn') - makes counting things more seamless.
Server and client-side logging using Minilog (mixu.net/minilog)
Monitoring graphs and queries: Hive/Hadoop + simple graphing

Client-side recovery

The user should know what their connection status is, because they are in the best position of fixing it (e.g. a wifi drop)
Surface events: "connected" / "reconnected" / "ready", "disconnecting", "disconnected", "reconnecting", "reconnected", "unavailable"
What happens when I am disconnected? How do I know what my connection status is?

Simple

Keep it simple; makes it easier to expand to new use cases.
No special casing: generic resources supplemented with client-side code, not server-side code
Engine.io vs Socket.io; WS vs. polling? Your end-user doesn't care in most cases; polling works fine
Don't try to tackle too many problems at once