Own the reliability availability performance and scalability of customer and employee facing platforms. Partner with application infrastructure security and NOC teams to engineer resilient services and automate operations across Azure and on-prem environments. Drive incident response and post-incident reviews implement observability and continuously improve service health through automation and best practices.
Responsibilities:
Build and operate production platforms across Azure (e.g. AKS App Services Functions) Windows/Linux and networking layers in partnership with Platform/Server/Network teams.
Engineer end-to-end observability: metrics logs and traces via Azure Monitor Application Insights Log Analytics Prometheus Grafana and centralized logging.
Automate provisioning and configuration using Infrastructure as Code (Terraform/Bicep) and configuration management (Ansible/PowerShell DSC).
Design and maintain CI/CD pipelines (Azure DevOps/GitHub Actions) with automated testing canary/blue-green deployments and change control alignment.
Establish runbooks SOPs and self-healing automations to reduce MTTR and ticket volume from the NOC and Service Desk.
Harden platform security (identity secrets certificates network segmentation) leveraging Azure Key Vault managed identities and policy guardrails.
Perform capacity planning performance tuning and cost optimization (FinOps) for compute storage and networking.
Partner with Data/ETL teams to ensure reliability of batch and streaming jobs scheduling and dependencies.
Create and maintain documentation (architecture runbooks dashboards) and support audits and compliance requirements.
Qualifications :
Bachelors degree in Computer Science Engineering or equivalent experience.
25 years in SRE/DevOps/Platform Engineering with hands-on production ownership.
Proficiency with Azure services (AKS App Services Functions Azure Monitor Log Analytics Application Insights).
Strong Kubernetes/Docker skills; Helm ingress service mesh (e.g. Istio/Linkerd) experience is a plus.
IaC (Terraform or Bicep) and scripting (PowerShell and/or Python); Git-based workflows.
CI/CD (Azure DevOps or GitHub Actions) artifact management and release strategies (canary/blue-green).
Observability tooling (Prometheus Grafana ELK/OpenSearch Azure Monitor) and alert design to minimize noise.
Experience with ITIL processes (incident change problem) and tools (ServiceNow/Jira).
Knowledge of networking DNS TLS/certificates load balancers and security fundamentals.
Excellent troubleshooting communication and cross-functional collaboration skills.
Certifications such as Microsoft Azure Administrator/DevOps CKA/CKAD or ITIL Foundation are a plus.
Additional Information :
All your information will be kept confidential according to EEO guidelines.
Remote Work :
No
Employment Type :
Full-time
BETSOL is a cloud-first digital transformation and data management company offering products and IT services to enterprises in over 40 countries. BETSOL team holds several engineering patents, is recognized with industry awards, and BETSOL maintains a net promoter score that is 2x the ... View more