Site Reliability Engineer
Department:
Job Summary
Job Summary
As a Site Reliability Engineer you will solve complex problems in a fast-paced collaborative environment supporting transformational projects in cloud technologies and automation for a leading SaaS product in the healthcare industry. You will drive reliability scalability and performance across applications microservices and infrastructure partnering with cross-functional teams to ensure a highly available resilient and efficient platform.
Key Responsibilities Monitoring & Alerting- Develop and maintain comprehensive monitoring for system health applications microservices dependencies and infrastructure.
- Establish baseline metrics and set up customized alerts for deviations collaborating with development teams to define key indicators and thresholds.
- Manage alert routing to appropriate audiences to prevent alert fatigue.
- Monitor and debug Microsoft Azure resources and assist support teams with technical troubleshooting.
- Oversee incident management processes including triage severity assessment and coordination of war rooms with relevant stakeholders (DevOps development QA customer service).
- Maintain on-call rotation to monitor analyze and resolve critical infrastructure issues and incidents including emergency response.
- Document incidents and perform root cause analysis (RCA) ensuring thorough follow-up and continuous improvement.
- Lead postmortem reviews and drive action items to completion.
- Design and execute tabletop exercises and disaster recovery simulations (e.g. service/region interruptions failover testing traffic management) to validate high availability and resiliency.
- Document outcomes and collaborate with DevOps and development teams to implement infrastructure and application improvements.
- Create standard procedures for incident scenarios where fixes cannot be immediately implemented.
- Develop dashboards and reports to monitor system performance error rates resource consumption latency and other key indicators.
- Establish consistent reporting schedules for SLA system utilization and customer metrics.
- Provide feedback to development and DevOps teams to drive improvements in application and infrastructure performance.
- Analyze and optimize DevOps pipelines for efficient and reliable deployment operations.
- Drive automation initiatives to improve operational efficiency and reduce manual intervention.
- Experience with progressive rollout strategies and canary deployments.
- Work closely with cross-functional teams across the organization including QA development DevOps and customer service.
- Clearly articulate technical concepts to non-technical colleagues and stakeholders.
- Foster a culture of accountability respect excellence and customer service.
- Contribute to documentation and knowledge sharing across teams.
- Advocate for reliability and operational excellence in architecture and design discussions.
- Participate in capacity planning and scalability assessments to ensure systems can meet future growth and demand.
Required Qualifications:
Education & Experience Guidelines
- Bachelors Degree in computer science or relevant field
- 8-10 years of relevant work experience
- Experience with Azure DevOps Kubernetes Docker CI/CD Azure and AWS.
- Significant experience with database technologies especially Microsoft SQL.
- Experience with Infrastructure as Code (IaC) tools (e.g. Terraform ARM templates).
- Familiarity with observability platforms (e.g. Prometheus Grafana).
- Scripting skills (e.g. PowerShell Python Bash) for automation and tooling.
- Security best practices for cloud environments.
- Strong communication and collaboration skills.
- Excellent self-management and time management abilities.
- Creative problem-solving skills.
- Willingness to learn new technologies and adapt to a rapidly changing environment.
- Technology certifications (e.g. Azure DevOps Engineer) or willingness to obtain.
- Occasional travel may be required.
Other Preferred Knowledge Skills Abilities or Certifications:
- Security Certifications: HITRUST CSF Practitioner CISSP
- AI/ML Integration Awareness: Supporting ML pipelines in DevOps workflows
- Policy-as-Code Tools: Open Policy Agent (OPA) HashiCorp Sentinel
- Disaster Recovery Planning: High-availability architecture and recovery strategies
- Cloud Cost Optimization: Performance tuning and resource efficiency
- Multi-Cloud Experience: Supporting hybrid or multi-cloud environments
Fortive 9 Behaviors by Level:
Executing and Contributing
Customer Obsessed: Understands the customers needs through observation questioning and going to Gemba.
Strategic: Uses data to make informed decisions while anticipating future trends and aligning actions with organizational goals.
Innovation for Impact: Proactively explores new perspectives and experiments to solve day-to-day problems.
Inspiring: Understands how their work contributes to the organizations purpose.
Builds Extraordinary Teams: Actively fosters collaboration by contributing positively supporting shared goals helping others succeed and celebrating team achievements together.
Courageous: Shows strength through actionmoves quickly toward goals embraces uncertainty speaks up and perseveres through challenges with confidence and integrity.
Delivers Results: Sets high standardsand consistently delivers by focusing priorities and overcoming obstacles and upholding organizational values.
Adaptable: Applies rigor by working thoroughly and following processes without cutting corners while remaining adaptable.
Lead with FBS: Goes to Gembaobserves real-world processes not just meetings. Embraces FBSby applying its fundamentals to improve work engage in kaizen and continuously grow knowledge and usage.
Required Experience:
IC
About Company
Fortive Corporation Overview Fortive’s essential technology makes the world stronger, safer, and smarter. We accelerate transformation across a broad range of applications including environmental, health and safety compliance, industrial condition monitoring, next-generation product d ... View more