AWS Cloud Ops SRE
AWS Cloud Operations / Site Reliability Engineer (SRE) is responsible for delivering secure reliable and scalable cloud infrastructure. This role covers Infrastructure as a Service AWS platform release activities AMI lifecycle management patching infrastructure design documentation terraform scripting and maintaining visibility into the application layer and how it functions in production environments. Experience with Harness for DevOps pipelines is a strong plus.
Required Qualifications
10 years in SRE Cloud Ops or DevOps with heavy AWS experience.
Strong hands-on experience with:
o AWS compute (EC2 ASG EKS/ECS Lambda)
o Networking (VPC Route 53 SG/NACL ALB/NLB)
o Storage (S3 EBS EFS)
o Databases (RDS Aurora DynamoDB)
Expertise in AMI pipeline management image building and OS level hardening.
Solid experience with Terraform or CloudFormation for IaC..
Demonstrated ability to troubleshoot AWS and application stack issues end-to-end
1. AWS Platform Operations & Releases
Own and execute AWS platform release management across environments including validation regression checks and readiness reviews.
Operate and evolve AWS core services: VPC IAM KMS Route 53 networking baselines proxy layers and organizational guardrails.
2. Infrastructure as a Service (IaS) using Terraform
Build manage and scale cloud infrastructure using Terraform as primary IaC tooling.
Create reusable Terraform modules covering networking compute storage EKS and security.
Ensure IaC follows best practices-versioned immutable peer reviewed and automated through CI/CD.
3. Amazon EKS (Kubernetes) Operations
Deploy manage and maintain production grade AWS EKS clusters node groups and cluster add ons.
Implement Kubernetes platform standards for security networking namespaces RBAC and secrets management.
Work closely with application teams to ensure workloads run reliably and securely within EKS.
Optimize cluster scaling workload scheduling resource limits and performance tuning.
4. AMI Lifecycle & Image Management
Manage complete AMI lifecycle: creation CIS hardening vulnerability scanning tagging publishing and deprecation.
Build automated AMI pipelines using image builders Packer (if applicable) and validation workflows.
Maintain golden images for EC2 fleets containers and hybrid workloads.
5. VIT (Vulnerability / Integration / Integrity Testing) & Patch Management
Lead VIT proces s including vulnerability assessments remediation workflows compliance tracking and closure.
Own OS level and image patching using AWS Systems Manager (SSM) Patch Manager and automated maintenance windows.
Generate patch baselines dashboards compliance reports and ensure measurable SLA adherence.
6. Observability & Application Layer Insights
Build and maintain observability stack with CloudWatch X Ray OpenTelemetry and log analytics.
Establish deep visibility into application behavior dependencies performance and error patterns.
Create golden signals dashboards covering latency traffic errors and saturation for both infra and applications.
7. CI/CD & DevOps Automation
Implement and maintain CI/CD pipelines for infrastructure and application deployments.
Harness experience is an added advantage leveraging workflows verification steps and deployment strategies (canary blue/green).
Integrate Terraform AMI pipelines EKS updates and patch automation into CI/CD systems.
8. Reliability Engineering & Incident Response
Participate in on call rotation; lead incident triage and root cause analysis.
Build automation and runbooks to reduce operational toil.
Drive architectural improvements to increase availability resiliency and performance.
9. Documentation & Architecture
Produce high-quality Infrastructure Design Documents (IDDs) runbooks DR procedures release notes and architectural diagrams.
Conduct operational readiness reviews capacity planning and cost-optimization assessments