Factor.io Blog

Deploy from a git tag with Capistrano 3

Do you use Capistrano and Github? Using git tags is a great way to tag your code for deployment.

I recently read “Deploy from a Git tag with Capistrano" by Nathan Hoad. Which was a great, but it was designed for Capistrano pre-v3. As such, I’d like to update that blog post for Capistrano 3.

Using git tag a new release
git tag -a 02.16.2014.01
You can list your tags like this
git tag
02.12.2014.01
02.16.2014.01
Now make sure you push the tags to the remote repository.
git push origin --tags
Now in your Capistrano deploy script. Note, this is new for V3
ask :branch, proc{`git tag`.split("\n").last}
This replaces the set :branch, 'master' code you had before. This syntax is MUCH shorter than the pre v3.

On deployment, this will ask “Please enter branch: |02.16.2014.01|”.

Bonus. You can use Factor.io to automatically kick off deployments when your code is tagged and pushed to Github, Bitbucket or Gitlab.

Better than cron

Cron was first introduced in Version 7 Unix in 1979. Being 35 years old, it is no surprise that it is the de facto standard as a time-based scheduler. But there are drawbacks when it comes to it’s usage in highly distributed and highly agile environments.

Why not cron?

  • Logs are either dumped to you in email, hidden away on disk, or lost completely.
  • No native way of running remote jobs.
  • Making any changes requires SSHing into a box or a sequence of steps to get it updated through configuration management systems (e.g. Chef/Puppet).
  • Difficult to manage across numerous systems.

Why Factor.io is better

  • Central place to define the “crontab”
  • No need to SSH into boxes to get started
  • No need to make changes to code, checking-in, deploying, just to update a time period or command
  • Run on multiple servers at once
  • Get all the logs in a single place without logging into the servers
  • Super easy to setup
  • Self documenting.

Factor.io isn’t perfect… yet!

The current model for defining workflows using the drag-and-drop interface doesn’t allow for conditions. This means you can’t define actions on different conditions (e.g. pass/fail). The good news is that the release of the new model for defining workflows is just around the corner. You’ll be able to programmatically define workflows with conditions and all the other great stuff Ruby DSL provides.

Deploying using Capistrano, roles and tags

We use Chef for provisioning server, Capistrano for deployments, and most of our code is written in Ruby.

We wanted to fully automate `cap production deploy` so that the servers would be self-identifying. Here is how we did it.

First, we used server tags in AWS and Chef to define the role of a given server during the provisioning process. This is how you can tag a server using knife ec2 server create. More docs from Chef here.

knife ec2 server create --tags role=factord ...

Secondly, we setup our production.rb file in Capistrano like this…

This is using the Fog gem to get the list of servers from AWS. We then identify the role of the server using the tags we defined in the previous step. We use the “role” directive in Capistrano to identify the server and it’s corresponding role. And that’s it. Now when we run `cap production deploy` it gets the current list of running servers and tags each server appropriately.

10 things we did to make Factor.io more reliable

Here is what you have to know about Factor.io. It is has a distributed architecture with 8 different components. It provides a runtime to execute workflows that are defined by the user. It also takes inputs from users, in the form of variables and credentials to run workflows. It also has a dependency on numerous 3rd party services (e.g. Github, Cloud Foundry, etc), as a part of the integrations available for the user-defined workflows.

In other words, there is a lot of shit that can break. And break it did. So as of recent we’ve spent quite a bit of time investing in making Factor.io more reliable. Here are 10 of them.

  1. Instrumented all of our services with Exceptional: We’ve been catching quite a few bugs with this bad boy. Airbrake is another popular service to get the job done.
  2. Instrumented incoming API calls with Logentries: Factor.io is distributed, so diagnosing failures is a little challenging. We’ve instrumented all of our services to log incoming messages. So we can tracke an action step-by-step as it executes across the distributed system.
  3. Setup Pingdom for the front-end: this is just a low hanging fruit. We have great uptime, but sometimes our hosting provider does act-up in particular during deployments. Thus far we’ve always known about issues before getting a notification.
  4. Background processor runs using the god gem: We try to handle exceptions the best that we can. When all else fails, we can count on god to restart the back-end service.
  5. Use RabbitMQ reliable queues to coordinate work: RabbitMQ is configured to be reliable, i.e. writes stuff to disk. From our experience so far RabbitMQ has been incredibly reliable. We’ve been using the same service without restarts or touching it now for nearly a year. BUT, if it does fail, we’ve configured it to store queues on disk, so if it restarts it will pick up where it left off.
  6. Gracefully handling restarts: Each components saves it’s expected state in a DB. If the process (worker, service, etc) fails or has to restart, it will just pickup the expected state from the DB and get everything setup where it left off. For example, If you start a Hipchat listener and the service restarts, it will rejoin the Hipchat room after the restart.
  7. Handle error conditions that should never occur: In code it is easy to take things for granted. But those things that we assume to work sometimes break too. We make no assumptions.
  8. Provider users with a log of their workflow execution: Factor.io executes instructions (workflows) as created by the user. Sometimes they might fail because of some dependency, user error, or change in coditions. We provide a powerful log allowing users to drill into the activities to understand what is going on under the hood so they can diagnose those failures.
  9. Provider users with an up-to-date status of their workflow: When a workflow is executing we provide an up-to-date status in the dashboard so you know what is going on. If something fails, you will know proactively.
  10. Architected for eventual consistency: For each of the components we capture an expected state and current state. If the two don’t match up, we try to get them to match up. If it fails to match up, we update the current state across the entire distributed system.

What we’d like to do next…

  1. Run periodic functional tests in production.
  2. Add more and more-specific error handling; still a few things we haven’t covered.
  3. Provider better input validation so that we can prevent error conditions before they occur in the workflow.
  4. Run chaos-monkey like tests and handle failures.

Rolling out public beta

After 8 months in private beta, we are finally rolling out the public beta. We will be sending out invites in bulk over the next couple weeks, and expect to open signup after rolling out the full queue.

We wanted to highlight some of the improvements we made per feedback from our awesome private beta users.

New Features

Dashboard

Jumpstarts

Realtime workflow status

Detailed Start/Stop Logs

Step-by-Step Activity Logs

Detailed Info per Activity

What’s next?

Roadmap

As we launch the public beta our sites are set on general availability launch and the next few iterations thereafter.

Programmatically define workflows

We are going to be introducing a Domain Specific Language (DSL) which enables you to define workflows programmatically instead of using our user interface in the console. These programmatic workflows can be executed locally but using our APIs, or they can be hosted by Factor.io.

The benefit is that you will have a lot more powerful capabilities with the DSL as it is based on Ruby and enables you to use everything Ruby provides. For example you can create a workflow that performs a deploy if a test passes, or opens an issue in Github if a test fails.

On-prem support

Factor.io is great when it runs in the public cloud, but sometimes you need to run on-prem. From our customers we learned that this occurs for one of two reasons. Either your are in a regulated industry (e.g. banking, healthcare), which has requirements for running everything in a controlled and audited environment. The other reason is that people may use tools that are only available on-prem, like Github Enterprise, private Jenkins, etc.

New channels

The list of channels we support today are fairly limited. Going forward we will be adding channels based on customer demand. These channels basically fall in a number of categories.

  • Config Management (e.g. Chef, Puppet, AnsibleWorks, Salt Stack)
  • Project Management
  • Issue/Bug Tracking
  • Continuous Integration (e.g. Jenkins, Circle CI, Codeship, Wercker)
  • Generic (e.g. Timer, web-hook, web-call)

Create your own channels

We’ve already built a easy to implement way of extending functionality of Factor.io by adding new channels. We want to empower our users to do the same. We need to provide great documentation, open-source the channel service, and provide a finer-grained security model.

SCP Support now here!

Secure Copy Protocol (SCP) is a method of copying files to a remote server via SSH. This is now available as a part of the SSH Channel in Factor.io.

What does this mean?

As a part of your workflow you can upload files to a remote server like an EC2 instance on AWS, or a server running on Rackspace, or any server that supports SSH.

What can I use this for?

  • Easily create and bootstrap a server on AWS|Rackspace with a command like “setup server foobar” in Hipchat.
  • Deploy new app to a server when you merge code into a master branch in Github.
  • Create a new mini-QA environment for every new feature in Github when you create a new branch

Introducing activity variables

It’s Fourth of July and I know I shouldn’t be working, but hey, I’m a founder.

I just finished up a new feature: activity variables. Here is how it works.

When creating a new workflow you sequentially stitch together sets of activities. Each activity in the workflow has input variables and output variables. Now you can see the output parameters available from each of the activities, and using Mustache notation you can insert those parameters into other activities later in the workflow.

Today all the variables are merely strings. In the future we will add types to those variables (e.g. string, file, URL) and the input fields can also by typed. This will make it easier to pass the variables from one activity to another. With that in mind we will also suggest defaults.

Are dev tools becoming too much of a good thing?

There is a dev tool for that.

Wherever you look there are new tools sprouting up by developers for developers. There is even a service for creating a notification bar on websites called HelloBar.

We have configuration management tools like Chef and Puppet, but now Salt Stack is getting some traction too. We have Github as the leading code repository/management service, but there is also an Enterprise SKU, and numerous other services like Atlassian’s BitBucket and Stash. The list goes on, chat services, project management services, continuous integration (test automation), PaaS, IaaS, etc.

This is a double-edged sword for developers. Having options is great so we can always pick the best technology for the problem. However, this creates a whole new world of pain.

On-boarding a new developer to the team takes weeks if not months as they have to ramp up on each of those services/technologies in addition to your company’s code base.

More tools means more overhead in dealing with each one manually.

Once we realize that the manual overhead is so high, developers start automating these process. This is an investment in performance, this back-end automation is never seen by customers, it does not generate revenue, and it’s time spent away from working on the product.

There is little integration between them, so automation comes in the form of glue code trying to piece together those services in finicky scripts.

After we invest in automation via code, we lock ourselves into those services as the cost of switching means re-writing a ton of that glue code.

These are the pains we felt as an engineering organization. Those were the pains that inspired us to create Factor.io. I’d love to hear how you have automated your deployment process, shoot us an email at founders@factor.io.

What we’ve been up to over the past few months

The last four months have been an amazing journey in entrepreneurship.

In the last five years of my life I’ve had numerous false starts with businesses. Late last year I finally decided to leave AppFog, where I was the Director of Product, so I could enable developers to spend less time on the crap they hate.

After interviewing some 20 companies earlier this year I put up a landing page and a sign-up form. Things didn’t just take-off. We hustled for the first few dozen sign-ups.

On April 1st we were lucky enough to join the Microsoft Accelerator powered by TechStarsA month into the program I wrote a brief summary of my learnings “10 Lessons learned in the first month of MS Accelerator by TechStars.” It was tough as I was going solo most of the time. My co-founder, Alex Parkinson, was still at AppFog as the Engineering Manager helping push the company through the recent acquisition by Savvis/CenturyLink. He is now kicking-ass full time on Factor.io.

During the program we started picking up great traction and hitting 300 sign-ups just a few weeks before the end. The day before Demo Day we presented at the GigaOm Structure conference as one of six startup finalists. We won the People’s Choice Award!!

image

Thanks to the win we got mentioned in 1, 2, 3, 4 different GigaOm articles, one TechCrunch article, and some French blog. You can watch the Launchpad presentation here.

If you are curious, Microsoft did a great job putting together a few videos as a part of this experience, Microsoft Accelerator for Windows Azure — Meet Factor.io and Azure Accelerated Season 2 - What it Takes.

After things simmer down after Demo Day and Structure we are setting up shop in Portland at the Burnside Rocket building at 11th & E Burnside. If you live in Portland you are always welcome to swing by and say Hi.

We’ll be cranking away and getting the public beta rolled out over the next couple weeks.