Job Title: Platform Reliability EngineerCareer Level - EIntroduction to role:Join us as a Platform Reliability Engineer in our Commercial IT – SSD, Data, Analytics and AI Platform Success Team. Your primary focus will be to ensure the stability, performance, and reliability of our Data, Analytics, and AI systems. You will bridge the gap between development and operations by generating insights into sub-optimal processes and optimization opportunities. This role offers an exciting opportunity to integrate Agile, Lean and SaFe practices within monitoring and observability initiatives and to continuously improve delivery cycle times.Accountabilities:As a Platform Reliability Engineer, you will be responsible for the evaluation, selection, and deployment of monitoring & observability technologies. You will manage and maintain monitoring infrastructure, ensuring it aligns with industry best practices. You will collaborate with DevOps, CriticalOps, and IT leadership teams to understand system requirements and design effective monitoring strategies. You will also develop and implement monitoring solutions for infrastructure, applications, and services.Responsibilities:Ensuring the stability, performance, and reliability of Data, Analytics, and AI systems by implementing and maintaining robust monitoring and observability solutions.Designing, deploying, and managing monitoring tools and practices that provide insights into the health and performance of our data infrastructure and analytics processes.Bridging the gap between development and operations by generating insights into sub-optimal processes and optimization opportunities.Maintaining working knowledge of platform architecture and business acumen.Integrating Agile, Lean, and SaFe practices within monitoring and observability initiatives to continuously improve delivery cycle times.Exploring and implementing new ways to automate systems, designing and testing automation processes, identifying quality issues, and supporting IT platform teams to eliminate defects and errors with product and platform development.Experience leveraging AIOps capabilities to uplift existing production operationsTechnology/Tool ManagementResponsible for the evaluation, selection, and deployment of monitoring & observability technologies suitable for the organization's needs.Manage and maintain monitoring infrastructure, ensuring it aligns with industry best practices.Monitoring & Observability Practice ManagementCollaborate with DevOps, CriticalOps, and IT leadership teams to understand system requirements and design effective monitoring strategies.Establish key metrics and KPIs that enable insights and analytics to achieve data-driven continuous improvement.Provide training and support to other teams on using monitoring tools effectively.Create and maintain documentation for monitoring and observability practices, including standard operating procedures and best practices.Stay abreast of industry trends, emerging technologies, and best practices related to monitoring and observability platforms.Monitoring & Observability Implementation & OperationsDevelop and implement monitoring solutions for infrastructure, applications, and services.Design and configure alerting mechanisms to proactively respond to potential issues.Use monitoring tools to identify and troubleshoot issues in real-time.Collaborate with other teams to resolve incidents promptly and prevent reoccurrence.Analyze monitoring data to identify performance bottlenecks and areas for improvement.Work with development and operations teams to optimize system performance based on monitoring insights.Implement automation scripts and workflows to streamline monitoring processes.Integrate monitoring solutions with existing frameworks for seamless operation.Identify and evaluate "self-healing" opportunities based on production issue trend analysis to inform AIOps roadmap.Essential Qualifications:Degree level education in computer science, information technology, or a related field.Proven experience as a monitoring and observability engineer or a similar role.Proficient in developing monitoring capabilities and configuring integration with tools such as Prometheus, Grafana, Splunk, SumoLogic, DataDog, DynaTrace, etc.Strong scripting skills (e.g., Python) for automation in data environments.Familiarity with logging, tracing, and APM (Application Performance Monitoring) solutions.Desirable Qualifications:Customer engagement experience.Knowledge of data processing frameworks (e.g., Apache Spark) and data storage solutions (e.g., data lakes, warehouses).Experience with data orchestration tools (e.g., Apache Airflow).Understanding of data lineage and metadata management.Ready to make a difference? Apply today and be part of a team that has the backing to innovate, disrupt an industry and change lives.
#J-18808-Ljbffr