Build your own chat bot with Gitter.im

Per request from our customers we’ve added support for Gitter.im, the Github-centric chat service. And now you can build your own chat bot with Gitter.im.

What is Gitter.im?

Services like Hipchat and Campfire are popular amongst developers; however, they are designed to be generic team chat services used by all sorts of teams. Gitter.im is a little different. While it provides the general chat capabilities you would expect out of the client, it has some awesome developer-specific features you won’t find anywhere else. For example, you can setup chat rooms around repos, reference commits in your messages, and you can even include Markdown in your message.

What can you do with Gitter.im + Factor.io

Now that we’ve added support to Gitter.im this means you can integrate Gitter with the rest of your development workflow. For example, you can type in “deploy qa” in your Gitter channel for a given repo to automatically deploy the qa branch to the QA environment. Here is a Gist to do exactly that with Github and Heorku.

To get started:

  1. Sign up
  2. Click on “Service” and activate Github, Heroku, and Gitter services
  3. Click “Create Workflow” and select “Custom Workflow”
  4. In the definition paste in the contents of this gist. Replace the appropriate values. And save the repo.
  5. Now in the Gitter.im room you should be able to type in “deploy master” to deploy to the appropriate environment.

Cool, huh? With the Gitter.im Service integration you can received messages as well as send messages to your room. You can also specify regular expression filters in the filter option to filter for specific matches and pull out values. In other word, you can create your own chat bot capable of pulling code form Github/BitBucket/GitLab, compiling with Jekyll/Middleman, and deploying to a server via SSH, Heroku, or BitBalloon.

Happy hacking

Docker Garbage Collector

We use Docker extensively at Factor.io to power two different components. First, it is used to host Factor.io workflows. Secondly, it is used an isolated build environment for when you include things like Middleman or Jekyll builds in your workflows.

In each of the cases we manage the docker containers and images to start, stop, and cleanup after themselves. However, sometimes bugs or deployments prevent cleanup at times. As such, we created a Docker Garbage Collector for Factor.io powered by Factor.io (we like meta).

Here is the workflow:

listen 'timer','every',minutes:10 do |timer_info|
  remove_stopped = 'sudo docker rm $(sudo docker ps -a -q)'
  remove_untagged = 'sudo docker rmi $(sudo docker images -a | grep "^<none>" | awk \'{print $3}\')'
  commands=[remove_stopped,remove_untagged]
  run 'ssh','execute',commands:commands, host:'docker-01.factor.io', username:'ubuntu'
end

As you can see, this runs two different commands, one to remove stopped containers, and the second to remove untagged images. For this example we run it on the server ‘docker-01.factor.io. However, you could just as well use the AWS or Rackspace connectors to list all your Docker servers dynamically so you don’t need to hard-coded those addresses.

In some way this is a glorified cron. Unlike cron, this is fully hosted and doesn’t require you to configure the cron job on the server. It also can run on multiple servers (even in parallel). And lastly, you get the log output of these commands across all the servers in real-time.

If you want to try this out, just create a new workflow in Factor.io and go create a “Custom Workflow” and paste in the code block provided.

New Mini Feature: fixed Web Hook URIs

Now you can create a new Web Hook listener in Factor.io that listens on a fixed URI that looks something like https://connector.factor.io/v0.3/2/my-hook.

If you ever used the web hooks listener in Factor.io you may have noticed that the URL generated for the web hook is dynamically defined. This means that every time the workflow starts it gets a new URI. Every time Factor.io redeploys one of a couple components the workflow may restart; this ends up happening about twice a day. This can be a problem when you configure other tools to try to hit that web hook.

Now you can create a web hook like this and specify the ‘id’ parameter.

listen 'web', 'hook', id:'my-id' do |hook_info|
  info "received hook"
end

This will create two different web hooks. One for this very specific instance, and another at /v0.3/[user_id]/[hook_id], where user_id is your user_id and hook_id is the value specified in ‘hook’ when you created the workflow. This second “fixed” URL can be used by multiple workflows. When you hit that address, all of your workflows listening on that address will be triggered.

Deploy from a git tag with Capistrano 3

Do you use Capistrano and Github? Using git tags is a great way to tag your code for deployment.

I recently read “Deploy from a Git tag with Capistrano" by Nathan Hoad. Which was a great, but it was designed for Capistrano pre-v3. As such, I’d like to update that blog post for Capistrano 3.

Using git tag a new release
git tag -a 02.16.2014.01
You can list your tags like this
git tag
02.12.2014.01
02.16.2014.01
Now make sure you push the tags to the remote repository.
git push origin --tags
Now in your Capistrano deploy script. Note, this is new for V3
ask :branch, proc{`git tag`.split("\n").last}
This replaces the set :branch, 'master' code you had before. This syntax is MUCH shorter than the pre v3.

On deployment, this will ask “Please enter branch: |02.16.2014.01|”.

Bonus. You can use Factor.io to automatically kick off deployments when your code is tagged and pushed to Github, Bitbucket or Gitlab.

Better than cron

Cron was first introduced in Version 7 Unix in 1979. Being 35 years old, it is no surprise that it is the de facto standard as a time-based scheduler. But there are drawbacks when it comes to it’s usage in highly distributed and highly agile environments.

Why not cron?

  • Logs are either dumped to you in email, hidden away on disk, or lost completely.
  • No native way of running remote jobs.
  • Making any changes requires SSHing into a box or a sequence of steps to get it updated through configuration management systems (e.g. Chef/Puppet).
  • Difficult to manage across numerous systems.

Why Factor.io is better

  • Central place to define the “crontab”
  • No need to SSH into boxes to get started
  • No need to make changes to code, checking-in, deploying, just to update a time period or command
  • Run on multiple servers at once
  • Get all the logs in a single place without logging into the servers
  • Super easy to setup
  • Self documenting.

Factor.io isn’t perfect… yet!

The current model for defining workflows using the drag-and-drop interface doesn’t allow for conditions. This means you can’t define actions on different conditions (e.g. pass/fail). The good news is that the release of the new model for defining workflows is just around the corner. You’ll be able to programmatically define workflows with conditions and all the other great stuff Ruby DSL provides.

Deploying using Capistrano, roles and tags

We use Chef for provisioning server, Capistrano for deployments, and most of our code is written in Ruby.

We wanted to fully automate `cap production deploy` so that the servers would be self-identifying. Here is how we did it.

First, we used server tags in AWS and Chef to define the role of a given server during the provisioning process. This is how you can tag a server using knife ec2 server create. More docs from Chef here.

knife ec2 server create --tags role=factord ...

Secondly, we setup our production.rb file in Capistrano like this…

This is using the Fog gem to get the list of servers from AWS. We then identify the role of the server using the tags we defined in the previous step. We use the “role” directive in Capistrano to identify the server and it’s corresponding role. And that’s it. Now when we run `cap production deploy` it gets the current list of running servers and tags each server appropriately.

10 things we did to make Factor.io more reliable

Here is what you have to know about Factor.io. It is has a distributed architecture with 8 different components. It provides a runtime to execute workflows that are defined by the user. It also takes inputs from users, in the form of variables and credentials to run workflows. It also has a dependency on numerous 3rd party services (e.g. Github, Cloud Foundry, etc), as a part of the integrations available for the user-defined workflows.

In other words, there is a lot of shit that can break. And break it did. So as of recent we’ve spent quite a bit of time investing in making Factor.io more reliable. Here are 10 of them.

  1. Instrumented all of our services with Exceptional: We’ve been catching quite a few bugs with this bad boy. Airbrake is another popular service to get the job done.
  2. Instrumented incoming API calls with Logentries: Factor.io is distributed, so diagnosing failures is a little challenging. We’ve instrumented all of our services to log incoming messages. So we can tracke an action step-by-step as it executes across the distributed system.
  3. Setup Pingdom for the front-end: this is just a low hanging fruit. We have great uptime, but sometimes our hosting provider does act-up in particular during deployments. Thus far we’ve always known about issues before getting a notification.
  4. Background processor runs using the god gem: We try to handle exceptions the best that we can. When all else fails, we can count on god to restart the back-end service.
  5. Use RabbitMQ reliable queues to coordinate work: RabbitMQ is configured to be reliable, i.e. writes stuff to disk. From our experience so far RabbitMQ has been incredibly reliable. We’ve been using the same service without restarts or touching it now for nearly a year. BUT, if it does fail, we’ve configured it to store queues on disk, so if it restarts it will pick up where it left off.
  6. Gracefully handling restarts: Each components saves it’s expected state in a DB. If the process (worker, service, etc) fails or has to restart, it will just pickup the expected state from the DB and get everything setup where it left off. For example, If you start a Hipchat listener and the service restarts, it will rejoin the Hipchat room after the restart.
  7. Handle error conditions that should never occur: In code it is easy to take things for granted. But those things that we assume to work sometimes break too. We make no assumptions.
  8. Provider users with a log of their workflow execution: Factor.io executes instructions (workflows) as created by the user. Sometimes they might fail because of some dependency, user error, or change in coditions. We provide a powerful log allowing users to drill into the activities to understand what is going on under the hood so they can diagnose those failures.
  9. Provider users with an up-to-date status of their workflow: When a workflow is executing we provide an up-to-date status in the dashboard so you know what is going on. If something fails, you will know proactively.
  10. Architected for eventual consistency: For each of the components we capture an expected state and current state. If the two don’t match up, we try to get them to match up. If it fails to match up, we update the current state across the entire distributed system.

What we’d like to do next…

  1. Run periodic functional tests in production.
  2. Add more and more-specific error handling; still a few things we haven’t covered.
  3. Provider better input validation so that we can prevent error conditions before they occur in the workflow.
  4. Run chaos-monkey like tests and handle failures.

Rolling out public beta

After 8 months in private beta, we are finally rolling out the public beta. We will be sending out invites in bulk over the next couple weeks, and expect to open signup after rolling out the full queue.

We wanted to highlight some of the improvements we made per feedback from our awesome private beta users.

New Features

Dashboard

Jumpstarts

Realtime workflow status

Detailed Start/Stop Logs

Step-by-Step Activity Logs

Detailed Info per Activity

What’s next?

Roadmap

As we launch the public beta our sites are set on general availability launch and the next few iterations thereafter.

Programmatically define workflows

We are going to be introducing a Domain Specific Language (DSL) which enables you to define workflows programmatically instead of using our user interface in the console. These programmatic workflows can be executed locally but using our APIs, or they can be hosted by Factor.io.

The benefit is that you will have a lot more powerful capabilities with the DSL as it is based on Ruby and enables you to use everything Ruby provides. For example you can create a workflow that performs a deploy if a test passes, or opens an issue in Github if a test fails.

On-prem support

Factor.io is great when it runs in the public cloud, but sometimes you need to run on-prem. From our customers we learned that this occurs for one of two reasons. Either your are in a regulated industry (e.g. banking, healthcare), which has requirements for running everything in a controlled and audited environment. The other reason is that people may use tools that are only available on-prem, like Github Enterprise, private Jenkins, etc.

New channels

The list of channels we support today are fairly limited. Going forward we will be adding channels based on customer demand. These channels basically fall in a number of categories.

  • Config Management (e.g. Chef, Puppet, AnsibleWorks, Salt Stack)
  • Project Management
  • Issue/Bug Tracking
  • Continuous Integration (e.g. Jenkins, Circle CI, Codeship, Wercker)
  • Generic (e.g. Timer, web-hook, web-call)

Create your own channels

We’ve already built a easy to implement way of extending functionality of Factor.io by adding new channels. We want to empower our users to do the same. We need to provide great documentation, open-source the channel service, and provide a finer-grained security model.

SCP Support now here!

Secure Copy Protocol (SCP) is a method of copying files to a remote server via SSH. This is now available as a part of the SSH Channel in Factor.io.

What does this mean?

As a part of your workflow you can upload files to a remote server like an EC2 instance on AWS, or a server running on Rackspace, or any server that supports SSH.

What can I use this for?

  • Easily create and bootstrap a server on AWS|Rackspace with a command like “setup server foobar” in Hipchat.
  • Deploy new app to a server when you merge code into a master branch in Github.
  • Create a new mini-QA environment for every new feature in Github when you create a new branch