This post is a post about a recent chain of interconnected bugs and mistakes that we found. I feel there is learning in this tale of many interconnected bugs/mistakes…even if I cannot quite place my finger on what exactly that learning is.
So our story all beings with the great UI refactoring that is JENKINS-43507…
Ideally, any change should be small. Code review works best when the changes are small…but we also have to balance that with ensuring that the changes are complete, so while the refactoring started off as a series of small changes there reached a point of The Great Switcheroo™ where it was necessary to swap everything over with new tests to cover the switch over.
Ideally a lot of the preparation code could have been merged in small change requests one at a time, but it can be hard to test code until it can be used, and adding a change request that consists of new code that isn’t used (yet) and cannot be tested (yet) can make things hard to review… anyway that is my excuse for a collection of changes requests that clock in at nearly 40k LoC.
If we take the GitHub Branch Source PR as an example of the code: github-branch-source-plugin#141
GitHub reports this as:
First off, in this refactoring of the GitHub Branch Source I made a simple mistake:
In all the changes in the PR, I was refactoring old methods such that they called through to the corresponding new methods (to retain binary API compatibility).
Pop-Quiz: Can you spot the mistake in maintaining binary API compatibility in this code?
For comparison, here is what the old code looked like:
The mistake is that the effective behavior of the code is changed. I had maintained binary compatibility. I had deliberately not maintained source compatibility (so that when you update the dependency you are forced to switch to the new method) but I was missing behavioral compatibility.
The fix is to add these four lines:
So, I hear you ask, with 6k LoC of new tests, how come you didn’t catch that one?
The existing tests all called the now deprecated
setBuildOriginBranchWithPR(boolean), etc methods in order to configure the branch and pull request discovery settings to those required for the test. Those methods were changed. Previously they were simple setters that just wrote to the backing boolean field. One of the points of this PR is to refactor away from 6 boolean fields with 64 combinations and replace them with more easily tested traits, so the setters will add, update the traits as necessary:
So because the tests were setting up the instance to test explicitly, they were not going to catch any issues with the legacy constructor’s default behavior settings, though they did catch some issues with my migration logic.
I used code coverage to verify that I had tests for all of the new methods containing logic…so of course I had added tests like:
Which were checking the branch point in the constructor…so when self-reviewing the code I looked at the 100% code coverage for the method and said Woot! (This was Mistake 2)
I had not got any tests that verified the behavioral contract of the legacy constructor.
Now these plugins have a semi-close coupling with BlueOcean, so one of our critical acceptance criteria includes verification against BlueOcean.
The first step in all that was to bump the dependency versions in BlueOcean to run the acceptance tests…
Now you may remember that I said that I had explicitly broken source compatibility for the legacy methods, this was in order to catch cases where people are assuming that the old getters / setters are exclusively the entirety of the configuration. If you are copying or re-creating a
GitHubSCMNavigator instance via code and you use the legacy methods, the new instance will be invalid, your code needs to upgrade to the new traits based API to function correctly.
So when I bumped the dependencies in BlueOcean without changing the code, my expectation was that the build would blow up because of the source incompatibility and it would then be compiler assisted replacement of the legacy methods… oh but little did I count on this little subtle behavioural change between ASM4 and ASM5…
Back in early May, Jesse spotted and fixed this issue with the ASM upgrade…
Without blaming anyone, the mistake here is that the BlueOcean code had not picked up the fix, so there were no compiler errors. The code compiled correctly.
This turns out to be fortuitous…
BlueOcean’s create flow for GitHub needs to reconfigure the GitHubSCMNavigator to add each repository that is “created” into the regex of named repositories to discover.
Now, in hindsight, there is a lot wrong with that… but the mistake was to recreate a new instance of the GitHubSCMNavigator each time, rather than reconfigure the instance.
In fact the original code even had a setter for the regex field:
So to some extent there was no need to replace the existing instance with every change:
But, in principle there should be nothing wrong with replacing it each time…and in any case the new repository may require the
credentialsId to be updated and the pre-JENKINS-43507 code used a final field and did not provide a setter…
The mistake here was not to replicate the rest of the configuration. In effect, every time you created a pipeline on GitHub using the BlueOcean creation flow, you blew away any configuration that had been applied via the classic UI: JENKINS-45058…the code should really have looked like this:
So how did we discover all four of these mistakes?
Well my PR #1186 that bumped the plugin versions had test failures:
The fact that the compilation succeeded rather than fail as expected (because of the @Restricted) annotation exposed Mistake 3
Then Mistake 4 actually exposed Mistake 1… if BlueOcean was preserving the configuration on creation these tests may not have failed…certainly a manual verification of the test scenario might have resulted in the test failure being chalked up as a bad test, but because the configuration was continually being reset to the constructor default, the manual verification forced Mistake 1 to the surface:
So Mistake 1 would have been caught if we didn’t have Mistake 2…
Once you catch a mistake in production code, you typically add tests so part of fixing Mistake 1 was to also fix Mistake 2
If it were not for Mistake 3, running the BlueOcean tests with the updated plugin versions would have required code changes that would probably have bypassed Mistake 4 and we would have missed Mistake 1 in making those code changes…
Without Mistake 4 we might not have found Mistake 1…
Without Mistake 1 we might not have found Mistake 4…
Four interrelated mistakes and without anyone of them we might not have found any of the others.
As I said at the beginning, I feel there is learning in this tail of many interconnected bugs/mistakes…even if I cannot quite place my finger on what exactly that learning is.
Hopefully you have enjoyed reading this analysis!