HAProxy/f5 + Engine.io + redis (2012) - chat, presence, status
Custom API and management code on top of Socket.io/Engine.io.
Easy to get started with a real-time app
npm install --save socket.io redis
receive messages and push them into the database
...
profit!
Now comes the hard part
Scaling
Tooling and monitoring
Accomodating new use cases
Techniques
It's all about imposing the right constraints - so that your architecture allows for scaling
This means presenting a consistent external API, which abstracts over potential complexities and failure points.
[browser] [server] [backend service]
And following certain core techniques.
Techniques
Independently scalable components
Stateless server
Disposable processes
Monitoring over debugging
Client-side recovery
Independently scalable processes
It should be possible to add capacity for each layer of the app independently of the others.
Pass of/queue work to the next layer, don't care about details
Socket.io (client IDs), Redis (keys/values/pubsub) - write your stack on top to deal with things that are meaningful to your app (user IDs, resource subscriptions etc.). Client IDs are an implementation detail that your API's should abstract over.
Do as little as possible
Processes should be simple; they should be managed by some other mechanism (e.g. a process manager, a load balancer etc.)
If something fails, just start a new instance
Separate tasks
Stateless processes
Data is important (multiple independent copies) - not computation
Statelessness: any data that is important must be stored in reliable a stateful backing service (e.g. DB)
Shared nothing architecture: each node is indepedent and self-sufficient
The process is just a cache a for the backend service, and nothing important is on it.
Stateless processes
Socket.io requires sticky sessions (handshakes), but don't build your app in a way that uses in-memory sessions.
Things that matter are persisted elsewhere and things that don't can be automatically recovered.
Avoid server-side sessions, write your APIs and authentication to work with stateless servers.
Disposability
Detect failures, fail gracefully
Robustness / recoverability
Use the same logic in the normal case and in the exceptional case as much as possible
Disposability - Fail gracefully
Client crash: reload.
Server crash: Re-establish existing state; this is app specific and more than just reconnecting.
Write things of significance to the backend service (survive restarts). Just connections, no sessions; state is either in the client or in the database. Adding more capacity becomes a matter of just starting a new server.
Monitoring over debugging
Fault identification and failure rates
Simple console/file based logging doesn't give you enough visibility into what's going on
KISS; if you aren't going to react in real-time, hourly is fine.
Make collecting data easy - both on the server and on the client
Monitoring over debugging
Like olark: log('cookie problems #nocookies_for_session #warn') - makes counting things more seamless.