Have you heard about ChatOps? I had the pleasure of hosting a panel discussion on ChatOps during Atlassian Summit 2015. To learn from real-world examples, I interviewed Stevan Arychuk from New Relic, David Hayes from PagerDuty, and Raymond Chan from Twitter. The discussion covered how ChatOps grew up in the 3 organizations, the benefits and costs, and thoughts about the future of ChatOps. Read on for my summary observations from the panel.
What is ChatOps?
ChatOps falls under the DevOps umbrella. Damon Edwards described DevOps with the loose taxonomy of culture, automation, measurement, and sharing — CAMS. For me, ChatOps is about putting sharing at the front. ChatOps is about sharing a collaborative culture, sharing automation and tools, and sharing common measurements. In this sense, sharing is the feedback mechanism that helps teams learn from their own efforts, and from the efforts of others.
Technically, ChatOps isn't all that new. On the panel, Stevan described how IRC and bots have long been part of the sysadmin toolkit. So why has the industry coined a new phrase? And why is this topic rising in popularity? New technologies are changing the cultural aspects of automation in chat. New products like HipChat are more friendly for use beyond sysadmins. That means the chat-based collaborations now include other roles. When you add bots to HipChat, the consequence is more people can get information or trigger actions. The big deal is that even non-technical people can now self-serve to run important reports or scripts.
Another change is the maturity of bot frameworks. Raymond from the panel described how easy it was to set up the popular Hubot. It runs on Node.js so is easy to deploy on a local laptop, or into the cloud via a PaaS environment. The bot doesn't need to be in your production infrastructure, which provides an easier opportunity to try out new things.
What does ChatOps do?
Modern bots are much more about interactions with other tools. That can be used to bring useful information into a chat context. For example, PagerDuty can signal an outage in a team's room. Or New Relic can indicate when there are big drops in performance. Or Bitbucket can announce commits and pull requests.
Pulling information in is just one direction. I like the summary of ChatOps as a "shared command line" because people can also issue commands to bots. People now trigger building a branch, deploying to production, unveiling of features (by flipping a feature toggle), updating JIRA issue statuses, and merging version control branches. When the command to trigger an automated flow is available from chat, then anyone can learn how to take action. That helps share knowledge and remove bottlenecks.
Everyone on the panel mentioned the importance of having fun with bots. That may explain why Sassy is one of the most popular bots. Sassy brings memes and animated gifs into chat. Obviously, cat pictures don't help a production outage but they can be make celebrating a win that much more gratifying.
How do people use ChatOps?
Raymond told a story about Twitter's interaction with the World Cup. When a country's team scored, then Twitter would see a spike in traffic for that country. Operations folks opened a room to keep track of the World Cup schedule. They collaborated on when and where to expect traffic spikes. Using bots, the team could quickly check status of JIRA issues that track infrastructure changes. Raymond described how his bot was trained to answer the question, "Can I deploy this now?" This helped avoid deployment collisions, especially when the flurry of activity was most intense.
Raymond's story is typical events requiring "all hands on deck". Many operations teams open a conference bridge during production outages. Getting everyone together on a call helps coordinate activity. But it also has a downside. People just joining the call will miss information that was previously communicated. A persistent chat room enables anyone to catch up. Just scroll back through history. As David explained during the panel, that makes chat really useful for collaborating on production problems. He explained the chat log is both a real-time source of status during an outage, and a source of inspiration for improvement during post-incident review (known in some circles as a post-mortem). The chat logs indicate what kinds of information were useful in solving the problem and what steps were taken to restore service. Both feed forward into what to teach a bot next.
As the name implies, ChatOps was born as chat about operations. ChatOps is starting to outgrow the name. Beyond production troubleshooting, more and more people are starting to use ChatOps in routine activities. Inside Atlassian, we use a bot to track visitors to this blog. We also use a bot to summarize usage and errors about Atlassian Connect. Generally, as more people participate in chat, so too do the uses of automation.
What's next for ChatOps?
There are still rough edges. The panel expressed a key struggle is managing the signal to noise ratio. That's why everyone on the panel was excited by HipChat Connect. With new integration features like cards and glances, external tools can provide information without disrupting conversation flow. Actions will help people respond to information, without needed to enter a separate command. The panel also posited that new kinds of integrations would emerge. For example, a bot might grab a screenshot from production to communicate more about what is going on.
Regardless of the next generation features, the panel agreed ChatOps is here and now for everyone. So why not give it a try? Have a look at HipChat, turn on some integrations, and invite your friends.