Public package repos expose thousands of API security tokens—and they’re active

JFrog’s new Xray Secrets Detection uncovered active access tokens in popular open-source software registries including Docker, npm, and PyPI. Here are our findings and takeaways.

Credit: Petr Bonek / Getty Images

As part of the development of JFrog Xray’s new Secrets Detection feature, we wanted to test our detection capabilities on as much real world data as possible, both to make sure we eliminate false positives and to catch any errant bugs in our code.

As we continued testing, we discovered there were a lot more identified active access tokens than we expected. We broadened our tests to full-fledged research, to understand where these tokens are coming from, to assess the viability of using them, and to be able to privately disclose them to their owners. In this blog post we’ll present our research findings and share best practices for avoiding the exact issues that led to the exposure of these access tokens.

Access tokens – what are they all about?

Cloud services have become synonymous with modern computing. It’s hard to imagine running any sort of scalable workload without relying on them. The benefits of using these services come with the risk of delegating our data to foreign machines and the responsibility of managing the access tokens that provide access to our data and services. Exposure of these access tokens may lead to dire consequences. A recent example was the largest data breach in history, which exposed one billion records containing PII (personally identifiable information) due to a leaked access token.

Unlike the presence of a code vulnerability, a leaked access token usually means the immediate “game over” for the security team, since using a leaked access token is trivial and, in many cases, negates all investments into security mitigations. It doesn’t matter how sophisticated the lock on the vault is if the combination is written on the door.

Cloud services intentionally add an identifier to their access tokens so that their services may perform a quick validity check of the token. This has the side effect of making the detection of these tokens extremely easy, even when scanning very large amounts of unorganized data.

Platform	Example token
AWS	AKIAIOSFODNN7EXAMPLE
GitHub	gho_16C7e42F292c6912E7710c838347Ae178B4a
GitLab	gplat-234hcand9q289rba89dghqa892agbd89arg2854
npm	npm_1234567890abcdefgh
Slack	xoxp-123234234235-123234234235-123234234235-adedce74748c3844747aed48499bb

—

Which open-source repositories did we scan?

We scanned artifacts in the most common open-source software registries: npm, PyPI, RubyGems, crates.io, and DockerHub (both Dockerfiles and small Docker layers). All in all, we scanned more than 8 million artifacts.

In each artifact, we used Secrets Detection to find tokens that can be easily verified. As part of our research, we made a minimal request for each of the found tokens to:

Check if the token is still active (wasn’t revoked or publicly unavailable for any reason).
Understand the token’s permissions.
Understand the token’s owner (whenever possible) so we could disclose the issue privately to them.

For npm and PyPI, we also scanned multiple versions of the same package, to try and find tokens that were once available but removed in a later version.

‘Active’ vs. ‘inactive’ tokens

As mentioned above, each token that was statically detected was also run through a dynamic verification. This means, for example, trying to access an API that doesn’t do anything (no-op) on the relevant service that the token belongs to, just to see that the token is “available for use.” A token that passed this test (“active” token) is available for attackers to use without any further constraints.

We will refer to the dynamically verified tokens as “active” tokens and the tokens that failed dynamic verification as “inactive” tokens. Note that there might be many reasons that a token would show up as “inactive.” For example:

The token was revoked.
The token is valid, but has additional constraints to using it (e.g., it must be used from a specific source IP range).
The token itself is not really a token, but rather an expression that “looks like” a token (false positive).

Which repositories had the most leaked tokens?

The first question that we wanted to answer was, “Is there a specific platform where developers are most likely to leak tokens?”

In terms of the sheer volume of leaked secrets, it seems that developers need to watch out about leaking secrets when building their Docker Images (see the “Examples” section below for guidance on this).

We hypothesize that the vast majority of Docker Hub leaks are caused by the closed nature of the platform. While other platforms allow developers to set a link to the source repository and get security feedback from the community, there is a higher price of entry in Docker Hub. Specifically, the researcher must pull the Docker image and explore it manually, possibly dealing with binaries and not just source code.

An additional problem with Docker Hub is that no contact information is publicly shown for each image, so even if a leaked secret is found by a white hat researcher it might not be trivial to report the issue to the image maintainer. As a result, we can observe images that retain exposed secrets or other types of security issues for years.

The following graph shows that tokens found in Docker Hub layers have a much higher chance of being active, compared to all other repositories.

Finally, we can also look at the distribution of tokens normalized to the number of artifacts that were scanned for each platform.

When ignoring the number of scanned artifacts for each platform and focusing on the relative number of leaked tokens, we can see that Docker Hub layers still provided the most tokens, but second place is now claimed by PyPI. (When looking at the absolute data, PyPI had the fourth most tokens leaked.)

Which token types were leaked the most?

After scanning all token types that are supported by Secrets Detection and verifying the tokens dynamically, we tallied the results. The top 10 results are displayed in the chart below.

We can clearly see that Amazon Web Services, Google Cloud Platform, and Telegram API tokens are the most-leaked tokens (in that order). However, it seems that AWS developers are more vigilant about revoking unused tokens, since only ~47% of AWS tokens were found to be active. By contrast, GCP had an active token rate of ~73%.

Examples of leaked secrets in each repository

It is important to examine some real world examples from each repository in order to raise awareness to the potential places where tokens are leaked. In this section, we will focus on these examples, and in the next section we will share tips on how these examples should have been handled.

DockerHub – Docker layers

Inspecting the filenames that were present in a Docker layer and contained leaked credentials shows that the most common source of the leakage are Node.js applications that use the dotenv package to store credentials in environment variables. The second most common source was hardcoded AWS tokens.

The table below lists the most common filenames in Docker layers that contained a leaked token.

Filename	# of instances with active leaked tokens
.env	214
./aws/credentials	111
config.json	56
gc_api_file.json	50
main.py	47
key.json	40
config.py	38
credentials.json	35
bot.py	35

—

Docker layers can be inspected by pulling the image and running it. However, there are some cases where a secret might have been removed by an intermediate layer (via a “whiteout” file), and if so, the secret won’t show up when inspecting the final Docker image. It is possible to inspect each layer individually, using tools such as dive, and find the secret in the “removed” file. See the screenshot below.

jfrog 06 — Docker layer with credentials opened in the dive layer inspector.

Inspecting the contents of the “credentials” file reveals the leaked tokens.

jfrog 07 — AWS credentials leaked via ./aws/credentials.

DockerHub – Dockerfiles

Docker Hub contained more than 80% of the leaked credentials in our research.

Developers usually use secrets in Dockerfiles to initialize environment variables and pass them to the application running in the container. After the image is published, these secrets become publicly leaked.

jfrog 08 — AWS credentials leaked through Dockerfile environment variables.

Another common option is the usage of secrets in Dockerfile commands that download the content required to set up the Docker application. The example below shows how a container uses an authentication secret to clone a repository into the container.

jfrog 09 — AWS credentials leaked through the Dockerfile via a `git clone` command.

crates.io

With crates.io, the Rust package manager, we happily saw a different outcome than all other repositories. Although Xray detected nearly 700 packages that contain secrets, only one of these secrets showed up as active. Interestingly, this secret wasn’t even used in the code, but was found within a comment.

PyPI

In our PyPI scans, most of the token leaks were found in actual Python code.

For example, one of the functions in an affected project contained an Amazon RDS (Relational Database Service) token. Storing a token like this may be fine, if the token only allows access for querying the example RDS database. However, when collecting permissions for the token, we discovered that the token gives access to the entire AWS account. (This token has been revoked following our disclosure to the project maintainers.)

jfrog 11 — AWS token leakage in the source code of a PyPI package.

jfrog 11b rev — Unintended full admin permissions (`*/*`) on an “example” Amazon RDS token.

npm

Other than hardcoded tokens in Node.js code, npm packages can have custom scripts defined in the scripts block of the package.json file. This allows running scripts defined by the package maintainer in response to certain triggers, such as the package being built, installed, etc.

A recurring mistake we saw was storing tokens in the scripts block during development, but then forgetting to remove the tokens when the package is released. In the example below we see leaked npm and GitHub tokens that are used by the build utility semantic-release.

jfrog 12 — npm token leakage in npm “scripts” block (package.json).

Usually, the dotenv package is supposed to solve this problem. It allows developers to create a local file called .env in the project’s root directory and use it to populate the environment variables in a test environment. Using this package in the correct manner solves the secret leak, but unfortunately, we found improper usage of the dotenv package to be one of the most common causes of secrets leakage in PyPI packages. Although the package documentation explicitly says not to commit the .env files to version control, we found many packages where the .env file was published to npm and contained secrets.

The dotenv documentation explicitly warns against publishing .env files:

No. We strongly recommend against committing your .env file to version control. It should only include environment-specific values such as database passwords or API keys. Your production database should have a different password than your development database.

RubyGems

Going over the results for RubyGems packages, we saw no special outliers. The detected secrets were found either in Ruby code or in arbitrary configuration files inside the gem.

For example, here we can see an AWS configuration YAML that leaked sensitive tokens. The file is supposed to be a placeholder for AWS configuration, but the development section was altered with a live access/secret key.

jfrog 13 — AWS token leakage in spec/dummy/config/aws.yml.

The most common mistakes when storing tokens

After analyzing all the active credentials we’ve found, we can point to a number of common mistakes that developers should look out for, and we can share a few guidelines on how to store tokens in a safer way.

Mistake #1. Not using automation to check for secret exposures

There were plenty of cases where we found active secrets in unexpected places: code comments, documentation files, examples, or test cases. These places are very hard to check for manually in a consistent way. We suggest embedding a secrets scanner in your DevOps pipeline and alerting on leaks before publishing a new build.

There are many free, open-source tools that provide this kind of functionality. One of our OSS recommendations is TruffleHog, which supports a plethora of secrets and validates findings dynamically, reducing false positives.

For more sophisticated pipelines and broad integration support, we provide JFrog Xray.

jfrog 14 — A GitHub token leaked in documentation, intended as read-only but in reality provided full edit permissions.

Mistake #2. Generating tokens with broad permissions that never expire

The majority of cloud services allow setting up fine-grained scopes and permissions when generating an access token.

jfrog 15 — GitHub’s Personal Access Token generation screen, with fine-grained token permissions.

Unfortunately, we have observed a very common anti-pattern where a single token is generated with full admin permissions.

Approximately 25% of the active AWS tokens we found had full admin capabilities.

Setting up IAM permissions for an AWS access token is a daunting task, due to the long and complicated list of possible permissions in the IAM permissions model. In many instance, we’ve observed tokens with wildcards: for example, s3:* instead of bothering to add s3:ListBucket and s3:GetObject permissions individually.

For users who don’t bother with documentation, AWS gives a “full admin” set of permissions marked as */*, which grants unlimited access to all possible functionality. So, instead of the developer using the AWS token to download pictures from an S3 bucket, the attacker gains access to the user’s whole AWS infrastructure when the token is leaked.

We suggest using the least privileges principle and granting only the permissions required to perform a task. Investing time in identifying the scope and choosing the matching permissions is a better use of time than conducting a data breach investigation in the future.

In addition to the permissions given to the token, setting an expiration date on the token makes sure that “long lost” instances of the token are useless even if the token is leaked in the future.

Mistake #3. No access moderation for the secret

In many cases, a developer may use a hardcoded token somewhere because they believe there is no alternative for moderating the access to this token. In better examples, this token has the exact set of permissions that it needs to have, and thus exposing it to everyone is not an issue. However, not all cloud services may support the exact set of permissions needed for the token, and in such cases exposing the token at all is a problem.

For example, it is a common practice to keep API secrets in environment variables. This can be safe in a local development environment. But when it comes to cloud services, especially those published in public repositories, the secrets become available to anyone interested.

jfrog 17 — AWS credentials leaked through Dockerfile environment variables.

Fortunately, there are a few moderation tools that can be applied, to make sure only validated users may access the token. Here are some examples relevant to a Docker-based environment:

Kubernetes secrets (for K8s-based applications)
Docker secrets (for Docker Swarm services)
Requiring the user to supply the secret as a docker run argument
HashiCorp Vault (external tool suitable for many runtime environments)

The common denominator among these tools is that the secret is not stored hardcoded in the image, but rather the secret is fetched from an external source at runtime, after verifying that the client has the proper authorization to use the secret. This means the secret is not exposed when the image itself is exposed. This decoupling also helps keep the image running smoothly after a secret rotation.

Mistake #4. Fixing a leak by unpublishing the token

Because a token leak is a security vulnerability, the naive solution that often comes to mind is to remove the key from the code and publish a new, “fixed” version, same as with a CVE. However, the internet always remembers. Once a package is published in a repository, it could be automatically cached by any number of benign or malicious actors. These actors will still have access to the package even if it’s unpublished from its original source.

In the physical world, when you lose a key, the first thing you should do is change the lock. Similarly, when you lose an API key, you should immediately revoke and replace the key.

jfrog 18 — Secret tokens leaked in an .env file in version 1.1.1 of a package. “Fixed” by unpublishing in version 1.1.2.

Mistake #5. Exposing unnecessary assets publicly

Many of the images containing the leaked tokens we detected seemed to have no business being stored in public repositories. Internal test images, company development builds, and other private packages somehow ended up on popular public repositories and carried sensitive secrets along with them.

For example, a Docker image that contained tests for a blockchain-related framework was published on Docker Hub.

Inside the Docker image, a private GitHub repository is pulled by using a hardcoded GitHub access token.

This is bad enough, because this private GitHub repository has now been exposed to attackers. But even more alarming is that this GitHub token gives complete admin access to the GitHub account, exposing all private repositories and giving complete access to the framework’s website.

Although there are multiple faults here, the fundamental issue is that this internal testing container should never have been published to a public Docker Hub registry.

By default, many development tools use a centralized repository that is publicly available and easy to crawl. Some companies may choose to rely on these central repositories and not roll their own private instances, which leaves the company assets exposed when artifacts are published.

Fortunately, for each of the repositories that we’ve scanned, there are open source solutions for deploying your own private repository easily.

Docker – Harbor is a CNCF-graduated container registry.
PyPI – private-pypi allows you to deploy a PyPI index privately.
npm – Verdaccio is a lightweight Node.js private proxy registry.
RubyGems – An official guide by RubyGems highlights several solutions such as Gemstash.
crates.io – crates.io is the open source package registry for the Rust package manager, and can be easily deployed.

For diverse environments that require supporting many types of artifacts with top-of-the-line devops capabilities, we recommend using JFrog Artifactory.

Giving back to the open source community

Although the initial goal of our research was to find and fix false positives in JFrog Xray’s new Secrets Detection tool, we stumbled upon a much greater volume of exposed active secrets than we envisioned. To complete the research, we privately disclosed all leaked secrets to their respective owners (when such owners could be identified) so that the secrets could be replaced or revoked as needed.

Shachar Menashe is senior director of JFrog Security Research. With over 10 years of experience in security research, including low-level R&D, reverse engineering, and vulnerability research, Shachar is responsible for leading a team of researchers in discovering and analyzing emerging security vulnerabilities and malicious packages. He joined JFrog through the June 2021 acquisition of Vdoo, where he served as vice president of security. Shachar holds a B.Sc. in electronics engineering and computer science from Tel-Aviv University.

Andrey Polkovnychenko is a security researcher at JFrog Security Research. Skilled in reverse engineering, malware analysis, Arm assembly, x86 assembly, Python, and C, Andrey brings over a decade of experience working in the mobile security industry.

—

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Topics

About

Policies

Our Network