Position: Senior Messaging Platform SRE (Kafka & RabbitMQ)
We are seeking a Senior Messaging Platform SRE to own the reliability scalability and operational excellence of enterprise messaging and event-streaming platforms including Confluent Kafka and RabbitMQ running on AWS and the Confluent Platform. This role is focused on platform operations SRE practices and infrastructure engineering-ensuring these platforms meet strict SLAs/SLOs for availability latency durability and security. The engineer will be part of the Operations team and will be a single point of contact for Kafka and Rabbit MQ issues.
Qualifications:
- 8 years of experience in SRE Platform Engineering or Infrastructure Operations.
- 3 years operating Confluent Kafka in production at scale.
- 2 years operating RabbitMQ in high-availability distributed environments.
- Strong hands-on experience with AWS-based deployments (MSK EC2 EBS ALB/NLB IAM).
- Deep knowledge of Kafka internals (brokers partitions ISR replication rebalancing).
- Strong operational understanding of RabbitMQ internals (clustering mirroring/quorum queues flow control).
- Expertise in Kubernetes (EKS) for platform workloads and supporting microservices.
- Infrastructure-as-Code experience using Terraform and Helm.
- Advanced experience with monitoring alerting and logging platforms (Splunk Prometheus Grafana ELK).
Responsibilities:
- Own the end-to-end reliability of Kafka and RabbitMQ platforms including uptime performance capacity and fault tolerance.
- Define and track SLOs and operational KPIs for messaging platforms.
- Lead incident response root cause analysis (RCA) and post-incident reviews for Kafka and RabbitMQ outages.
- Operate and maintain Confluent Kafka and RabbitMQ clusters.
- Standardize operational runbooks for cluster lifecycle management broker/node failures rebalancing and disaster recovery.
- Act as the primary escalation point for Kafka and RabbitMQ production issues.
- Mentor junior engineers and influence platform-wide SRE best practices.
- Partner with architecture security and application teams to evolve the messaging platform roadmap.
| Skill Category | Your Experience in Years |
| Confluent Kafka Platform Engineering | |
| RabbitMQ Platform Operations | |
| AWS Infrastructure for Messaging Platforms | |
| SRE Observability & Operational Excellence | |
| Kubernetes IaC & Automation | |
Brandon Consulting Associates Inc. is an EQUAL OPPORTUNITY EMPLOYER and has been in business for 29years.
Position: Senior Messaging Platform SRE (Kafka & RabbitMQ) We are seeking a Senior Messaging Platform SRE to own the reliability scalability and operational excellence of enterprise messaging and event-streaming platforms including Confluent Kafka and RabbitMQ running on AWS and the Confluent Platfo...
Position: Senior Messaging Platform SRE (Kafka & RabbitMQ)
We are seeking a Senior Messaging Platform SRE to own the reliability scalability and operational excellence of enterprise messaging and event-streaming platforms including Confluent Kafka and RabbitMQ running on AWS and the Confluent Platform. This role is focused on platform operations SRE practices and infrastructure engineering-ensuring these platforms meet strict SLAs/SLOs for availability latency durability and security. The engineer will be part of the Operations team and will be a single point of contact for Kafka and Rabbit MQ issues.
Qualifications:
- 8 years of experience in SRE Platform Engineering or Infrastructure Operations.
- 3 years operating Confluent Kafka in production at scale.
- 2 years operating RabbitMQ in high-availability distributed environments.
- Strong hands-on experience with AWS-based deployments (MSK EC2 EBS ALB/NLB IAM).
- Deep knowledge of Kafka internals (brokers partitions ISR replication rebalancing).
- Strong operational understanding of RabbitMQ internals (clustering mirroring/quorum queues flow control).
- Expertise in Kubernetes (EKS) for platform workloads and supporting microservices.
- Infrastructure-as-Code experience using Terraform and Helm.
- Advanced experience with monitoring alerting and logging platforms (Splunk Prometheus Grafana ELK).
Responsibilities:
- Own the end-to-end reliability of Kafka and RabbitMQ platforms including uptime performance capacity and fault tolerance.
- Define and track SLOs and operational KPIs for messaging platforms.
- Lead incident response root cause analysis (RCA) and post-incident reviews for Kafka and RabbitMQ outages.
- Operate and maintain Confluent Kafka and RabbitMQ clusters.
- Standardize operational runbooks for cluster lifecycle management broker/node failures rebalancing and disaster recovery.
- Act as the primary escalation point for Kafka and RabbitMQ production issues.
- Mentor junior engineers and influence platform-wide SRE best practices.
- Partner with architecture security and application teams to evolve the messaging platform roadmap.
| Skill Category | Your Experience in Years |
| Confluent Kafka Platform Engineering | |
| RabbitMQ Platform Operations | |
| AWS Infrastructure for Messaging Platforms | |
| SRE Observability & Operational Excellence | |
| Kubernetes IaC & Automation | |
Brandon Consulting Associates Inc. is an EQUAL OPPORTUNITY EMPLOYER and has been in business for 29years.
View more
View less