How Cloud Custodian conquered cloud resource management

Acquiring cloud infrastructure is easy, but controlling cost and usage is hard. Here’s how open-source project Cloud Custodian flipped the script with policy as code.

Credit: Shutterstock

Kapil Thangavelu says he probably would have been a history professor if he hadn’t discovered Python in college.

His early-days exposure to Zope and Plone — a Python web application server and content management system, respectively — put him on a career trajectory that began with GIS (geographic information system) work at DARPA, then ephemeral workloads on AWS at the Library of Congress, then an opportunity to work with Mark Shuttleworth and Ubuntu Linux at Canonical. At Canonical, Thangavelu led the development of Juju, an open source project that allows user to assemble cloud applications like Lego bricks.

So when Thangavelu joined Capital One to lead a major cloud modernization effort, he already had strong opinions about coordinating cloud resources across distributed teams. But what he saw at Capital One — application teams struggling to find a common mental model of how they were using cloud resources — led him to an epiphany. In software development, there is so much more time spent reading code than writing code. Cloud infrastructure needed a lingua franca and continuous workflow to get everyone on the same page.

Today Thangavelu is known as the creator of Cloud Custodian, an open-source project that gives enterprises a “policy as code” framework and continuous workflow around cloud optimization, so cost and usage are no longer “spring cleaning” one-offs, but automatic side effects of developers finally having a DSL (domain-specific language) in this critical arena.

I recently met with Thangavelu to learn more about trends he’s seeing in cloud optimization and finops, and how Cloud Custodian and his startup Stacklet see cloud governance evolving in this time of great inflationary pressure and runaway cloud bills.

Kapil Thangavelu

Travis Van: What led you to the vision for Cloud Custodian when you were at Capital One?

Kapil Thangavelu: Like so many large enterprises eight years ago, they were aggressively moving to the cloud and open source, and the mandate was to accelerate all the developers getting into the cloud environment. Obviously being in financial services, we were dealing with a highly regulated industry — every new cloud service had to have its certs signed off, everything configured correctly in REST. There were a ton of one-off scripts, it was easy to configure things incorrectly and create backlogs of problems, and then you had the other challenges of making sure things were tested and monitored consistently. It was obvious that this was not going to scale across hundreds of of engineers and application teams. So we said, let’s create a DSL that can address these issues holistically across these dimensions. Let’s not just identify cloud problems, but figure out a language that would also let us fix them in real-time. We designed Cloud Custodian to be a highly readable YAML DSL. We wanted this language and policy definition for cloud resources to be accessible across many different groups, to developers, to their managers, and even to the auditors in secondary lines like security. And we wanted it to be highly readable, because in coding you’re always going to be reading much more than you write with cloud resources, so let’s make it as readable as possible.

Van: What would you say Cloud Custodian is known for today, in terms of the kinds of problems it solves?

Thangavelu: The initial focuses were tagging, compliance, security, but also doing workflows around cost stuff. Cloud Custodian gives you a workflow where you can define things like grace periods for cloud resources where they then shut off if unused — those types of constructs for building logical workflows around cloud resources, as policies. Even today, eight years after open sourcing the project in 2016, Cloud Custodian’s claim to fame is being best in class in remediation. It doesn’t just let you admire problems, it’s designed to help you solve the problems in your cloud footprint. The big areas where it thrives are things like garbage collection and dealing with under-utilized cloud resources, right-sizing resources that may be overprovisioned, handling the life cycle of objects and buckets and all the reclamation policies that go with that, and making sure configurations are in line with the desired policies, pre-deployment. Those are some of the big areas, but Cloud Custodian also has things like blast radius protection and other types of tooling to help deal with the risks of remediation in production, which is always tricky.

Van: What is the operation model for Cloud Custodian?

Thangavelu: We help users treat policies as code — put them in Git, version them, do pull requests on them. Cloud Custodian provides open source CI [continuous integration] tools for validation, so as you roll that out for your deployment practices, you merge to trunk, they get auto-deployed. You can use your existing databases, and Cloud Custodian focuses on being a native app on each of the clouds you use. So if you use GCP, you’re going into those metric stores that are already there, and same with AWS or Azure. Then we’ve got Cloud Custodian set up to file Jira tickets, kick out Slack messages and provide that real-time feedback loop directly between the person who made a change. So if you deployed a database to the Internet by accident and broke a policy, you get messaged to get that fixed.

Van: What does it look like when companies are not handling cloud governance well?

Thangavelu: Failing means your resources are misconfigured, you’re wasting money, potentially you have security vulnerabilities in your environment caused by configuration errors. It also may mean you are using dedicated reporting tools that go into different teams. I’ve seen security teams that report into the CISO and have to go all the way up the org chain into a VP, then back down through another VP team before getting back to the app team. That tends to lead to a ton of inefficiency.

Everybody knows the cloud bill is basically rate multiplied by usage. But while most enterprises have a handle on rate, usage is the hard part. You have different application teams provisioning infrastructure. You go through code reviews. Then when you get to five to 10 applications, you get past the point where anyone can possibly know all the components. Now you have containerized workloads on top of more complex microservices architectures. And you want to be able to allow a combination of cathedral (control) and bazaar (freedom of technology choice) governance, especially today with AI and all of the new frameworks and LLMs [large language models]. At a certain point you lose the script to be able to follow all of this in your head. There are a lot of tools to enable that understanding — architectural views, network service maps, monitoring tools — all feeling out different parts of the elephant versus giving an organization a holistic view. They need to know not only what’s in their cloud environment, but what’s being used, what’s conforming to policy, and what needs to be fixed, and how. That’s what Cloud Custodian is for — so you can define the organizational requirements of your applications and map those up against cloud resources as policy.

Van: Tell us a little bit about the project’s momentum and what’s next on the horizon?

Thangavelu: We’re getting new features every week from the community, which is 367 contributors and a dozen maintainers. I’d say the greatest activity is all the providers being created, with a particular emphasis on new Cloud Custodian providers for the Kubernetes ecosystem. As a broader trend there’s a big “shift left” evolution that’s happening in Cloud Custodian today. We’re bringing Cloud Custodian into the IDE, we’re expanding its integration with CI systems so the rest of the team can get code compliance into the build process. We’re creating primitives that make it easier to parse the output and understand not just when there are problems, but how developers can fix the problems.

—

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

How Cloud Custodian conquered cloud resource management

Acquiring cloud infrastructure is easy, but controlling cost and usage is hard. Here’s how open-source project Cloud Custodian flipped the script with policy as code.

More from this author

What Entrust certificate distrust means for developers

Cutting Kubernetes costs with virtual clusters

How eBPF is shaping the future of Linux and platform engineering

Grafana: Shining a light into Kubernetes clusters

Show me more

What is Rust? Safe, fast, and easy software development

Kotlin for Java developers: Classes and coroutines

Azure AI Foundry tools for changes in AI applications

Building Python wheels to distribute your programs

Creating a pip install-able Python package

How to get better web requests in Python with httpx