Company: Vista
Our Team: Vista is the design and marketing partner to millions of small businesses worldwide to professionally promote their business with quality printed and digital marketing products at an affordable price!
Over 80 product teams in our engineering department build and operate frontends, services, data pipelines, and ML models with a "You Build It, You Run It" principle.
Incident management at Vista is all about uncovering the unknown and helping our organization to continually increase its operational posture.
It goes beyond pure remediation and plays a crucial role in learning from incidents and driving learning for the organization as a whole.
The Incident Management team's primary goal is to reduce the likelihood of major incidents, minimize impact, and reduce the duration of incidents across the services we build and operate.
We support teams in identifying, triaging, and coordinating the remediation of production incidents, both large and small.
We help teams to be prepared for future incidents by advising on monitoring and alerting practices, running incident drills, and sharing best practices.
What You Will Do: The Senior Manager, Incident Management will oversee the Incident Management team within Vista Engineering.
This role is crucial for advancing our service operations and incident remediation maturity. The successful candidate will drive initiatives to proactively improve our incident preparedness and our organization's maturity in running 500+ services. They will optimize our processes and systems to reduce MTTD and MTTR, champion organizational learning from incidents, and strengthen communication and stakeholder relationships. Leadership and Strategy: Develop and implement a strategy for the Incident Management to improve incident identification, handling, reporting, and prevention. Establish, align, and communicate clear priorities, objectives, and key results. Lead, mentor, and develop a global, follow-the-sun team of incident managers. Oversee incident resolution to minimize the customer and business impact. Improve our incident management procedures. Enhance the quality, rigor, and technical depth of our post-mortem analyses. Establish a culture of continuous learning and blameless retrospectives. Ensure action items get implemented. Continuously Grow the Organizational Maturity: Define and implement organizational-wide initiatives to measure the health of our systems (SLIs / SLOs) and to improve triaging and resolving of incidents. Share lessons learned to prevent the recurrence of similar issues. Reporting and Metrics: Improve our key performance indicators (KPIs) to measure our ability to react to incidents. Use data-driven insights to identify trends and areas for improvement. Disseminate and review regular updates and KPIs with the organization and senior leadership.
#J-18808-Ljbffr