Context
Say you inherit custody of large-ish git repositories because e.g. of staff turnover, that may be made public of access down the track. How do you systematically assess whether this internal development does not have sensitive info or downright secret keys buried down in the codebase, present or past in the git history?
gitleaks
Upfront disclaimer I found about gitleaks today only. It looks good to me for what I want. This short post does not imply any suitability or fitness for purpose for your particular need if you have any.
After a very cursory search via google today I landed on gitleaks (see also https://gitleaks.io/ and its blog). Out of curiosity I decided to give it a try on some git repositories, and assess whether I could add some custom rules.
Log
Installation
In general I’d prefer to build tools from source if I can. for gitleaks I figured I needed sudo apt golang
to then try make build
, but bumped into:
config/config.go:4:2: package embed is not in GOROOT (/usr/lib/go-1.15/src/embed)
detect/detect.go:8:2: package io/fs is not in GOROOT (/usr/lib/go-1.15/src/io/fs)
which is a bit of a bump for folks not familiar with the go
ecosystem. Binary releases are readily available; your choice to install. See this issue.
Trial basic use
Pick a random repo to test, and see what happen.
gitleaks detect -r report.json
:
6:22PM INF 284 commits scanned.
6:22PM INF scan completed in 260ms
6:22PM WRN leaks found: 2
Huh oh… What? Thankfully this is a false positive: some generated R code uses arbitrary tokens, with no security implications. It does pick the commit at which this information was added.
{
"Description": "Generic API Key",
"StartLine": 2,
"EndLine": 3,
"StartColumn": 14,
"EndColumn": 1,
"Match": "token: 10BE3573-1514-4C36-9D1C-5A225CD40393",
"Secret": "10BE3573-1514-4C36-9D1C-5A225CD40393",
"File": "path/to/some/R/RcppExports.R",
"SymlinkFile": "",
"Commit": "21324567890oiuasheroiuhawoiruhaoiuwrh",
"Entropy": 3.6943858,
"Author": "J-M",
"Email": "some.guy@example.com",
"Date": "2017-04-05T06:20:11Z",
"Message": "Add R package 'blah' wrapping for mylibrary.dll",
"Tags": [],
"RuleID": "generic-api-key",
"Fingerprint": "21324567890oiuasheroiuhawoiruhaoiuwrh:path/to/some/R/RcppExports.R:generic-api-key:2"
},
Custom rules
OK, I get the gist of the tool, and this is appealing.
Creating a test repository with a readme, to further test default and upcoming custom rules.
Starting with a segment of the readme like the following, which I expect violates the rul “Generic API Key”
username: abcdef
password: password
Nope.
I suspect this is because there is an entropy threshold to consider a password is indeed a secret, rather than a dummy documentation artifact. So, let’s complicate the password:
password: ;oiaspgoih-1514-4C36-9D1C-poainrpoiunpwe4i
Nope.
On a complete hunch: because starting character is not a letter or number?
password: oiaspgoih-1514-4C36-9D1C-poainrpoiunpwe4i
Bingo!, that was it. The underlying regex rules are quite complicated (to me at least), so not quite sure what is going on, but not flagging the line password: ;oiaspgoih-1514-4C36-9D1C-poainrpoiunpwe4i
may be a small flaw in the tool.
I note that we have a high entropy in the reported match:
"Match": "password: oiaspgoih-1514-4C36-9D1C-poainrpoiunpwe4i",
"Secret": "password: oiaspgoih-1514-4C36-9D1C-poainrpoiunpwe4i",
"File": "README.md",
"SymlinkFile": "",
"Commit": "19fbedaa6e0cf9907d8779583aa391f49bb09a03",
"Entropy": 4.3028474,
Custom rule with a low entropy threshold
We create a file ~/.config/gitleaks.toml
. Let’s say we want to catch anything that remotely looks like a password (though still restricted by lines starting with the characters password
):
# Title for the gitleaks configuration file.
title = "J-M's gitleaks config"
[extend]
# useDefault will extend the base (this) configuration with the default gitleaks config:
# https://github.com/zricethezav/gitleaks/blob/master/config/gitleaks.toml
useDefault = true
[[rules]]
# Unique identifier for this rule
id = "any-kind-of-password"
# Short human readable description of the rule.
description = "Any passwords"
regex = '''password.*:.*'''
# Float representing the minimum shannon entropy a regex group must have to be considered a secret.
entropy = 0
gitleaks -c ~/.config/gitleaks.toml detect -r report.json
{
"Description": "Any passwords",
"StartLine": 70,
"EndLine": 70,
"StartColumn": 2,
"EndColumn": 19,
"Match": "password: password",
"Secret": "password: password",
"File": "README.md",
"SymlinkFile": "",
"Commit": "5c4782a83541baa26e98f8c162d2eede1744d316",
"Entropy": 3.0588138,
"Author": "J-M",
"Email": "blah@blah.au",
"Date": "2022-11-11T06:13:53Z",
"Message": "Try to trigger rule for password",
"Tags": [],
"RuleID": "any-kind-of-password",
"Fingerprint": "5c4782a83541baa26e98f8c162d2eede1744d316:README.md:any-kind-of-password:70"
},
Conclusion
This appears to be a suitable tool to audit git repositories to detect more or less sensitive information, including with custom criteria. I did find the introductory material is a bit lacking for newcomers, but this is far from a showstopper. And, I know what it’s like…
I recommend you read this post from the tool creator. Food for thoughts