Meta Incident Management

* Please be aware this page is not linked on the main home page

Tasks
UX, UI, Research, Product strategy

Time
Apr 2020 – Oct 2021

I lead design and product initiatives building tools and introduced systems that helped software engineers record incidents, and efficiently collaborate to quickly mitigate system failures.

A SEV is a ‘ticket’ that is created when an incident occurs within a product in the company. It serves as a way to document the incident, and to alert others who may be needed to help mitigate or that may be affected by it.
Once a SEV has been mitigated a review takes place to further understand the root cause of the problem, and to prevent recurring incidents buy extracting learnings and creating follow up tasks.

The challenge

There are approximately 2,000 SEVs a month, ranging from minor incidents to, very rarely, entire company outages.
The biggest problem is that there are too many tickets that are not being reviewed and learnings aren’t being extracted to prevent recurrence.

The entire SEV process and tooling was created without a designer, so the tooling was fragmented and not ntuitive.

Users have to jump between multiple tools to plan and conduct reviews: Agenda, Calendar, Configurator, Google docs/ Quip, Tableau

As the company grew this has created inefficiencies in reviewing SEVs, which results not enough SEVs are being reviewed. It was vital that this was addressed as the entire company relied on this tool

Journey map

I conducted user research to identify all the various personas involved and outlined the entire SEV review process. 

A detailed project flowchart for creating and configuring a review series, including sections on personas, actions, pain points, opportunities, ideas, and a flow chart visualizing the process.

Old design

SEV tool and SEV Review tool

A computer screen displaying a project management or incident tracking software with a list of incident review tasks, their owners, status, scheduled and update dates.
Screenshot of a software interface titled 'SEV Manager'. The visible page reports an issue with a chef run failing on macOS and Linux, owned by Menlo Park, categorized as SEV 2, with status 'In Progress'. The summary states 'Chef runs for macOS & Linux failing', and the incident affected the 'CPE - Chef' area. The interface includes task links, a review section, and a list of primary on-call personnel.

Goal

The high level goal was to help teams review incidences and reduce re-occurance. 
The main goal is to create a seamless end to end review tooling, by helping users:

  • Easily find SEVs to review

  • Triage SEVs to the right review meeting

  • Easily schedule regular review meetings

This can be achieved by:

  • Streamlining the different workflows into one tool, SEV Review Series.

  • Introduce automations and auto populate data based on Series configuration

  • Link to other tools for seamless communication; Emails, calendar and team chat

Process map

I mapped out the current high level process and created an ideal journey, including how the process can be improved with automation

Comparison of two flowcharts titled 'Configuring series' for current user journey and ideal user journey, with a dark theme, depicting detailed processes with pathways and decision points, organized in rows and columns.

Projects

We identified key areas to explore as separate projects to reach each goal.

Four screenshots of presentation slides related to SEV review processes and collaboration strategies with review owners and attendees.

HMW workshop

Using the qualitative research I conducted an ideation workshop with the team, to gather high level ideas with the new process as the goal. 

Screenshot of a presentation slide divided into three sections with notes and ideas about managing and reviewing SEVs (Service Event Tickets). The left section addresses finding SEVs for review, the middle discusses collaboration between coordinator and engineering leads on SEV selection, and the right covers surfacing SEVs that haven't been reviewed in their SLAs.

Data Capture

Screenshot of a presentation slide discussing creating SEV (Service Event) workflows, highlighting problems, goals, proposals, and user roles like owner and affected team member.

We realised that these projects depended on quality data capture when the SEV was created. 
Through data we realised 67% of SEVs were recurring incidents, and most incidents were of high severity, not many small ones. We know that this is caused by a delay in SEV reviews.

Through qualitative research we discovered people weren’t opening tickets for minor incidents, before the incident had grown, as they were worried about false tickets, adding to the backlog of tickets needing to be reviewed

But when incidents are not  filed early enough, engineers aren’t able to stop them from becoming large incidents, which in turn creates  even more work for the team.

SEV Creation Form

The SEV Creation form is complex and contains 15 fields which are not all necessary for creating a SEV. 

Screenshot of an enterprise service management dashboard showing a report on a failing chef run for macOS and Linux with details on the issue, status, and affected areas.

Measuring success

Measuring success for this project was not as straightforward as seeing a reduced number of SEVs.
Through qualitative research, and understanding work pressures we uncovered that, this could indicate that users may be worried about additional admin work to the backlog of tickets.
As the higher level goal is the health of the company’s systems, we had to be careful to not focus on a reduced number of SEVs as a goal.

How might we 

  • Encourage engineers to open more low SEVs, without the worry of additional admin work

  • Make the SEV creation process quicker, only capturing relevant data

  • Ensure high quality data capture 

Grouping information

In order to reduce the number of form fields to only required  fields that are needed at creation, I had to make an assessment of all the fields, group them and prioritise them.

Flowchart with sections labeled Summary, Escalation, Communication, Collaborate, and Tasks and Review, each containing bullet points with relevant topics.

Design concept

The main goal was to create a focused short form with essential information only. So this was an overlay window with a 3 step guided process.

A series of six screenshots demonstrating a step-by-step guide for managing SEV (Sev) types in a system. The process involves creating, setting up communication channels, escalating, and changing the status of SEVs based on their status on the left side of the interface. The screenshots include a blue and gray user interface with labels, buttons, and instructions overlaid with cyan and red annotation lines and notes.

Launch and results

After multiple rounds of user  testing and feedback we launched the new SEV creation flow to 3 controlled teams over a period of 2 weeks. This allowed us to test using real incidents and  gave us confidence to launch to the wider company.

Results

  • We successfully decreased the average time to report an incident from 8 minutes to 3 minutes.

  • After 2 months we noticed that there were 17% less SEV 3’s. We used this as a measure of success because this was the most common type of SEVs, and they had one root cause. 

  • There were more pre-emptive and low SEVS being created, which meant problems were being caught early.

This also resulted in less SEVs needing to be reviewed, which aligned to our original goal.

Screenshot of a computer screen showing a user interface for creating a Service Event (SEV). The form is labeled 'Create SEV' and has fields for Title, Type, Status, Stack, Level, Incident Impact, and Impacted areas, with values such as 'Like Button not working' and 'In Progress'.
Screenshot of a communication setup window within a user interface. The window shows options for setting up a chat, with tabs for 'Workplace Chat' and 'IRC'. It includes fields for adding people to the chat, a link to a template, and mentions of link creation for canvas and VC. The background shows a blurred project management interface with a sidebar, search bar, and a list of messages or notifications.

SEV Tool Re-design

Screenshot of a task management and reporting interface titled 'SEV Manager' showing details for a service issue 'Like button not working'. The page includes sections for summary, collaboration links, and status, with options for chat, video call, and live document sharing. The summary indicates the issue is 'In Progress' and owned by Ruben Huertas in London, with tags like 'Systems' and 'FBApp', and impacted areas including infrastructure and network.
Screenshot of an SEV management system interface showing a workflow for handling a 'Like button not working' incident, including sections for escalation, communication, and broadcasting updates.
Screenshot of a service management dashboard showing a task titled 'Like button not working,' with status 'In Progress.' The interface includes sections for comments, escalation incident, collaboration, tasks and issues, and a chart titled 'Issues with Suppression Values.'

Please be aware this page is not linked on the main home page