The Site Reliability Engineer Lead (SRE Lead) at Screening Eagle will lead a team of SREs to ensure the stability, resilience, and scalability of our services through automation, testing, and engineering.
This role involves leveraging expertise from product systems/operations, cloud infrastructure (AWS), build and release engineering, software development, and stress/load testing to guarantee our services are available, cost-efficient, and fit for purpose from the early stages of development.
Responsibilities: Develop and implement cloud infrastructure using Terraform.Optimize resources for cost-efficiency and performance.Ensure infrastructure security and implement service control policies (e.g., Control Tower).Configure AWS VPC flow logs, load balancer logging, Direct Connect, AWS VPN, TGX, etc.Implement robust monitoring and alerting systems.Set up and monitor CI/CD pipelines both on-premises and in the cloud.Enhance monitoring, logging, and alerting practices.Create prototypes and lead development teams in implementing solutions.Lead the SRE team, ensuring technical quality and best practices.Collaborate with developers and operations to integrate infrastructure changes.Document DevOps changes, technical partnerships, design, integration, testing, and deployment.Evaluate risks, customize applications, and lead quality practices.Focus on agile methodologies, test automation, and continuous integration.Simplify and automate complex processes to ensure quality and operational excellence.Improve the DevOps toolchain and streamline software delivery processes.Stop projects/products if solutions are not technically acceptable.Minimum Requirements: 5+ years of experience developing AWS cloud infrastructure.7+ years of experience leading teams.Extensive experience in implementing and evolving DevOps practices across multi-disciplinary teams and business frameworks.Strong background in leading technology change programs and managing projects.In-depth knowledge and experience with AWS services (EC2, S3, VPC, IAM, etc.
).Expert-level proficiency in Terraform, including writing reusable modules and leveraging best practices.Highly skilled with Kubernetes, Terraform, serverless, and AWS in general.Proficient in non-functional testing, including performance, security, and cost optimization.Experience working with advanced architectures such as ARM and AWS Graviton.Knowledge of K8S operator programming and GPU-based architectures.Competent in using different arch build tools and practices.Expertise in Git and GitOps philosophy.Expert in logging and monitoring tools like ELK, Prometheus, and Grafana.Demonstrable MLOps experience.Ability to quickly gain domain knowledge.Operational experience in maintaining applications.
#J-18808-Ljbffr