GitLab CI outage
Incident Report for Appirio
Postmortem

Appirio uses a hosted instance of GitLab to which we don’t have direct admin access. Our primary CI job runner ran out of disk space causing CI jobs to fail.

We responded by creating another CI runners on the same hosted service, but this runner encountered errors that indicated a misconfiguration that we couldn’t fix without admin access.

Our hosted service provider has reduced their SLA to 1 business day, and they did not respond to this High Severity incident as quickly as they have in the past.

After the incident had persisted for several hours without a response from the service provider, we decided to create a CI runner on a separate service that we had admin access to. That resolved the outage.

Our plan for some time has been to move to a fully self-hosted GitLab instance, and this outage gives us additional motivation for that. In addition, we are now running multiple redundant CI runners to provide failover support.

Posted Apr 12, 2019 - 01:28 EDT

Resolved
This incident has been resolved. We apologize for the disruption and are looking at ways to prevent this in future.
Posted Apr 12, 2019 - 01:27 EDT
Investigating
Our GitLab CI runners are currently failing. We have escalated this issue and will resolve it as soon as possible.
Posted Apr 11, 2019 - 19:29 EDT
This incident affected: Appirio DX.