Until TFS 2013 only one gated check-in build was allowed to
run consecutively, this caused, in medium and large size development teams, resource
“starvation”.
Only one validation process ran for each check in, causing
either a long queue and delay in development process and code sharing or a
short and insufficient validation process rendering the gated build validation
system redundant.
No more, in TFS 2013 a new option was added (batched gated
build).
Let a step back to remind ourselves that the purpose of
gated build is to protect the product from breaking on a single developer
error. When it is short and quick (only validates compilation for example) it provides
little protection, on the other hand if adding validation steps (tests etc.) “Costs”
valuable time.
For example 30 minutes validation in a 5 developer team can
cause a request to wait over 2 hours in line for validation.
Batched gated build helps solving this issue.
When setting up a build definition trigger you can determine
the maximum amount of shelvesets (check-ins) you want merged when in queue.
This will cause several check-ins to run together in a
single validation build.
The logic of the trigger is pretty simple: when a build is
queued and the queue is empty it starts right away, if a build is already running
the request will be queued, after the build is completed the server will queue
the next batch in the queue together up to the number stated in the build
definition.
From my experience this practice drops wait time significantly,
and speeds up the development and sharing for the team.
This does not come without implications or concerns though,
here are five:
What happens if a batch
fails to merge while unshelving?
When several shelvesets are unshelved together
there can be conflicts (this can happen even with a single shelveset if the
baseline of the shelveset is not the latest version of code). The build process
template, by default marks each build request for retry (only in the Get
workspace and unshelve process), the retry request will state that when this
shelveset is retried it will run without a batch.
The retry behavior options are:
Do not Batch – each failed request in a batch
will be retried separately by the server.
Batch Dynamically – the build server will
allow retried requests to be batched regularly in the queue.
Batch Isolated – the Batch will only be
retried with the requests it originally ran with.
How to setup automatic
retry for a batch?
In order to have the build server
automatically start failed requests ahead of the queue you can use the “Force
Retry” option.
How to avoid an endless
loop of automatic attempts?
Using the “Force” option should only be
attempted with “DoNotBatch” behavior to avoid endless loop of failed Builds.
What happens if a batch
fails to validate?
The retry
requests activity in the default process template resides only in the “Get
Workspace” step, so other failures is not treated the same by default. You can,
however use in again (by creating a custom template) in the workflow with a
simple logic that will retry batched requests on their own automatically ahead
of the queue and mark an unlatched request for retry Dynamically (not
Automatically, of course).
Force = True, DoNotBatch Force = False, BatchDynamically
What is the optimal batch
size?
Using this (Kung Fu) tricks will shorten
the build queue indeed but you should beware of trying to setup to small or to
large batch. Keeping in mind, that with auto retry, each failed batch can take
up to n+1 times of the average build time (n being the batch
size). Setting batch size to small will not speed up queue progress and to big
can hang the queue for a long time to validate the error.
Large batch size can increase the probability
of merge conflicts as well.
My educated guess is to keep the batch size between 3
to 5, this should shorten the wait time significantly and not block the queue
for excessive time on failure.
To sum up: the batched build solution is optimal for using
gated check-in validation without the resource starvation it used to cause. There
are other issues to be taken into consideration like modifying the shelveset
validation and merge process and customize the build process to save time. Furthermore,
by using batched builds a single developer can block the queue for a long time [(Batch
size + 1)X(Average Build time)]. Analyzing, publishing and reporting the
“shame list” of developers that checked in invalidated code causing resource
starvation once more can motivate your team to run local pre validation, which will
result in improving your developers as well as code while keeping the product
stable (win-win-win).
Till next time.