A very important GitHub feature for teams is to configure branch protections. These typically include requiring pull requests and a minimum number of reviews, preventing force pushes to main branches, signing commits, and most importantly, passing status checks.
Status checks can be added by external services to commits to signal issues. In addition to external services, GitHub Actions also reports workflow runs as status checks. This feature can be used to set up required CI steps (building the codebase, running tests, linting, etc.) that must pass before allowing to merge pull requests.
At Hygraph, we recently observed that GitHub would allow us to merge pull requests even though workflows failed. As this could lead to dangerous situations where broken code is merged based on false assumptions, I started investigating the matter.
After reading up on a couple of forum posts and performing my own experiments to reproduce the issue, I came to the conclusion that combining the job matrix feature with required status checks leads to issues. We’ll first walk through a short primer on this specific feature, then see what is failing, and follow up by thinking about possible solutions.
GitHub Actions Job Matrix or Sharding Jobs
The job matrix feature is a great solution when you have multiple targets to run the same job against, or if you want to parallelize workflows like slow-running tests. You can define a matrix strategy, which will then schedule a dynamic number of jobs based on your requirements. Every job receives its own name.
jobs:
build:
strategy:
matrix:
node: [10, 12, 14]
steps:
# Configures the node version used on GitHub-hosted runners
- uses: actions/setup-node@v2
with:
# The Node.js version to configure
node-version: ${{ matrix.node }}
I wrote a comprehensive post on the job matrix strategy last fall.
Failing jobs will skip subsequent steps
Whenever a job fails in CI, all subsequent jobs are skipped. This behaviour makes sense, as you want to fail fast in most cases. However, skipped jobs are not recognized by the required status check enforcement, allowing to merge in code even though the required step was never run.
As an example, take the following GitHub Actions workflow
name: Test
on:
pull_request:
jobs:
test:
runs-on: ubuntu-20.04
steps:
- run: exit 1
ci:
runs-on: ubuntu-20.04
needs: test
steps:
- run: exit 0
We expect the test workflow to fail, after which the ci
step will be skipped. If we check the GitHub status, we’ll see the issue with this process.
We can merge in our pull request even though the ci
step never ran. The same happens when we use the job matrix on the first workflow. That’s really bad. So what can we do to solve this?
Requiring job matrix shards
A rather naive fix is that you can mark partial jobs from the job matrix as required. However, this also means that every time you add a new matrix item, you’ll have to update your branch configuration. It would be much better to have a required second job that always runs but only passes when all shards or previous jobs actually passed.
Conditional success output
After a lot of experimenting, I came across a pattern that works both for all possible configurations: Running two jobs after the build/test/etc. step. After running the tests, a success job will only run if all jobs before it succeeded. If any job before it failed, it will be skipped. In order to make it work with required status checks (to avoid allowing merges of skipped checks), we introduce another job, which always runs and only passes in case the success job ran.
For this, we use GitHub Actions job outputs. These can be declared on jobs to reuse values in downstream jobs.
First, let’s check out the job running after our tests. In this case, I used the job matrix strategy to parallelize tests.
after-shards:
needs: shards # run after shards
runs-on: ubuntu-20.04
if: success() # only run when all shards have passed
# store success output flag for ci job
outputs:
success: ${{ steps.setoutput.outputs.success }}
steps:
- id: setoutput
run: echo "::set-output name=success::true"
In this job, we merely set an output value to true in case we run this job. This only happens when all previous jobs it depends on have passed.
ci:
runs-on: ubuntu-20.04
if: always() # always run, so we never skip the check
needs: [shards, after-shards]
steps:
# pass step only when output of previous after-shards job is set
# in case at least one of the shard fails, after-shards is skipped
# and the output will not be set, which will then cause the ci job to fail
- run: |
passed="${{ needs.after-shards.outputs.success }}"
if [[ $passed == "true" ]]; then
echo "Shards passed"
exit 0
else
echo "Shards failed"
exit 1
fi
The second step reads the previous job’s output and passes only if it’s set. The key piece of the second job is that it always runs, irrespective of any failed predecessors. Not adding always()
will lead to the original issue, where the required job is skipped. Last but not least, make sure to require the ci
job in your branch protection settings.
Below is the complete test action, including an easy way to simulate failing tests to check different scenarios.
name: Test
on:
pull_request:
jobs:
shards:
strategy:
matrix:
shards: [1, 2, 3, 4]
runs-on: ubuntu-20.04
steps:
- run: echo doing stuff ${{ matrix.shards }}
# simulate failure in one or more shards
- name: fail one
run: exit 1
if: ${{ matrix.shards == 1}}
- run: exit 0
after-shards:
needs: shards # run after shards
runs-on: ubuntu-20.04
if: success() # only run when all shards have passed
# store success output flag for ci job
outputs:
success: ${{ steps.setoutput.outputs.success }}
steps:
- id: setoutput
run: echo "::set-output name=success::true"
ci:
runs-on: ubuntu-20.04
if: always() # always run, so we never skip the check
needs: [shards, after-shards]
steps:
# pass step only when output of previous after-shards job is set
# in case at least one of the shard fails, after-shards is skipped
# and the output will not be set, which will then cause the ci job to fail
- run: |
passed="${{ needs.after-shards.outputs.success }}"
if [[ $passed == "true" ]]; then
echo "Shards passed"
exit 0
else
echo "Shards failed"
exit 1
fi
This was quite an adventure. Noticing that you can merge broken code when you rely on CI is a DevOps nightmare come reality. I do understand why GitHub Actions work the way they do, but the integration with required status checks isn’t optimal.