Thursday, July 31, 2014

TFS Batched Gated Build – Stopping the starvation



Until TFS 2013 only one gated check-in build was allowed to run consecutively, this caused, in medium and large size development teams, resource “starvation”.
Only one validation process ran for each check in, causing either a long queue and delay in development process and code sharing or a short and insufficient validation process rendering the gated build validation system redundant.
No more, in TFS 2013 a new option was added (batched gated build).
Let a step back to remind ourselves that the purpose of gated build is to protect the product from breaking on a single developer error. When it is short and quick (only validates compilation for example) it provides little protection, on the other hand if adding validation steps (tests etc.) “Costs” valuable time.
For example 30 minutes validation in a 5 developer team can cause a request to wait over 2 hours in line for validation.
Batched gated build helps solving this issue.

When setting up a build definition trigger you can determine the maximum amount of shelvesets (check-ins) you want merged when in queue.


This will cause several check-ins to run together in a single validation build.


The logic of the trigger is pretty simple: when a build is queued and the queue is empty it starts right away, if a build is already running the request will be queued, after the build is completed the server will queue the next batch in the queue together up to the number stated in the build definition.
From my experience this practice drops wait time significantly, and speeds up the development and sharing for the team.
This does not come without implications or concerns though, here are five:

 What happens if a batch fails to merge while unshelving?
When several shelvesets are unshelved together there can be conflicts (this can happen even with a single shelveset if the baseline of the shelveset is not the latest version of code). The build process template, by default marks each build request for retry (only in the Get workspace and unshelve process), the retry request will state that when this shelveset is retried it will run without a batch.



The retry behavior options are:
Do not Batch – each failed request in a batch will be retried separately by the server.
Batch Dynamically – the build server will allow retried requests to be batched regularly in the queue.
Batch Isolated – the Batch will only be retried with the requests it originally ran with.

 How to setup automatic retry for a batch?
In order to have the build server automatically start failed requests ahead of the queue you can use the “Force Retry” option.



 How to avoid an endless loop of automatic attempts?
Using the “Force” option should only be attempted with “DoNotBatch” behavior to avoid endless loop of failed Builds.

 What happens if a batch fails to validate?
The retry requests activity in the default process template resides only in the “Get Workspace” step, so other failures is not treated the same by default. You can, however use in again (by creating a custom template) in the workflow with a simple logic that will retry batched requests on their own automatically ahead of the queue and mark an unlatched request for retry Dynamically (not Automatically, of course).

 Force = True, DoNotBatch            Force = False, BatchDynamically

      What is the optimal batch size?
Using this (Kung Fu) tricks will shorten the build queue indeed but you should beware of trying to setup to small or to large batch. Keeping in mind, that with auto retry, each failed batch can take up to n+1 times of the average build time (n being the batch size). Setting batch size to small will not speed up queue progress and to big can hang the queue for a long time to validate the error.
Large batch size can increase the probability of merge conflicts as well.
My educated guess is to keep the batch size between 3 to 5, this should shorten the wait time significantly and not block the queue for excessive time on failure.

To sum up: the batched build solution is optimal for using gated check-in validation without the resource starvation it used to cause. There are other issues to be taken into consideration like modifying the shelveset validation and merge process and customize the build process to save time. Furthermore, by using batched builds a single developer can block the queue for a long time [(Batch size + 1)X(Average Build time)]. Analyzing, publishing and reporting the “shame list” of developers that checked in invalidated code causing resource starvation once more can motivate your team to run local pre validation, which will result in improving your developers as well as code while keeping the product stable (win-win-win).

Till next time.