Here is what you have to know about Factor.io. It is has a distributed architecture with 8 different components. It provides a runtime to execute workflows that are defined by the user. It also takes inputs from users, in the form of variables and credentials to run workflows. It also has a dependency on numerous 3rd party services (e.g. Github, Cloud Foundry, etc), as a part of the integrations available for the user-defined workflows.
In other words, there is a lot of shit that can break. And break it did. So as of recent we’ve spent quite a bit of time investing in making Factor.io more reliable. Here are 10 of them.
- Instrumented all of our services with Exceptional: We’ve been catching quite a few bugs with this bad boy. Airbrake is another popular service to get the job done.
- Instrumented incoming API calls with Logentries: Factor.io is distributed, so diagnosing failures is a little challenging. We’ve instrumented all of our services to log incoming messages. So we can tracke an action step-by-step as it executes across the distributed system.
- Setup Pingdom for the front-end: this is just a low hanging fruit. We have great uptime, but sometimes our hosting provider does act-up in particular during deployments. Thus far we’ve always known about issues before getting a notification.
- Background processor runs using the god gem: We try to handle exceptions the best that we can. When all else fails, we can count on god to restart the back-end service.
- Use RabbitMQ reliable queues to coordinate work: RabbitMQ is configured to be reliable, i.e. writes stuff to disk. From our experience so far RabbitMQ has been incredibly reliable. We’ve been using the same service without restarts or touching it now for nearly a year. BUT, if it does fail, we’ve configured it to store queues on disk, so if it restarts it will pick up where it left off.
- Gracefully handling restarts: Each components saves it’s expected state in a DB. If the process (worker, service, etc) fails or has to restart, it will just pickup the expected state from the DB and get everything setup where it left off. For example, If you start a Hipchat listener and the service restarts, it will rejoin the Hipchat room after the restart.
- Handle error conditions that should never occur: In code it is easy to take things for granted. But those things that we assume to work sometimes break too. We make no assumptions.
- Provider users with a log of their workflow execution: Factor.io executes instructions (workflows) as created by the user. Sometimes they might fail because of some dependency, user error, or change in coditions. We provide a powerful log allowing users to drill into the activities to understand what is going on under the hood so they can diagnose those failures.
- Provider users with an up-to-date status of their workflow: When a workflow is executing we provide an up-to-date status in the dashboard so you know what is going on. If something fails, you will know proactively.
- Architected for eventual consistency: For each of the components we capture an expected state and current state. If the two don’t match up, we try to get them to match up. If it fails to match up, we update the current state across the entire distributed system.
What we’d like to do next…
- Run periodic functional tests in production.
- Add more and more-specific error handling; still a few things we haven’t covered.
- Provider better input validation so that we can prevent error conditions before they occur in the workflow.
- Run chaos-monkey like tests and handle failures.