Role: Principal Kafka Support & Reliability Engineer
Location: Canton MA
Role Descriptions: Tier 3 Incident Management Escalation SupportAct as the highest technical escalation point for Kafka production incidents Sev 1 Sev deep troubleshooting across 1. Broker instability controller elections ISR shrinkage2. Under replicated partitions and leader imbalance3. Producerconsumer failures lag spikes and rebalance stormsDisk network JVM and request handler saturationProvide hands on remediation for complex issues including Partition reassignment and leader rebalanceBroker configuration tuningThrottlequota strategies for noisy producers or consumersCoordinate with vendor support during service incidents providing logs metrics and forensic Tier 2 teams during major incidents and validate restoration actions.2. Kafka Performance Engineering OptimizationAnalyze Kafka workloads for performance and scalability risks Partition skew and hot partitionsInefficient producer batchingcompressionConsumer lag root cause analysisThread pool IO and network bottlenecksRecommend and validate Topic design (partition count replication factor retention compaction)Producer and consumer configuration best practicesQuotas quotas enforcement and multi tenant controlsSupport onboarding of high throughput or latency sensitive workloads ensuring Kafka is correctly sized and tuned.3. Platform Stability Reliability ResilienceDiagnose and resolve systemic Kafka stability issues Repeated broker failures or flappingMetadatacontroller instability (Zookeeper or KRaft)Recovery issues following failovers or maintenance eventsSupport resilience initiatives Multi AZ cluster health validationReplication and DR strategies (MirrorMaker 2 Replicator or app level DR patterns)Failover testing and validationDefine and improve Kafka SLOs for availability durability and latency.4. Change Upgrade Configuration LeadershipLead medium to high risk Kafka changes including Broker and cluster configuration changesPartition expansion or large scale reassignmentTopic policy changes impacting durability or performanceSupport and plan Kafka version upgradesMSK Confluent upgrade cyclesClient compatibility and rollout strategiesParticipate in CAB reviews assess risk and design rollback and validation plans.5. Root Cause Analysis Continuous ImprovementOwn RCA documentation for major incidents with clear corrective and preventive actions (CAPA).Identify recurring failure patterns and architectural platform-level improvements Automation opportunitiesGuardrails and standardsMonitoring and alerting enhancementsContribute to continuous improvement of runbooks knowledge base articles and operational playbooks.
Essential Skills: Role OverviewThe Kafka Tier 3 Support Engineer is a senior technical role responsible for expert level support advanced troubleshooting performance engineering and platform stabilization of enterprise Apache Kafka environments. This role functions as the final technical escalation point for Kafka-related production incidents and is accountable for root cause analysis (RCA) complex remediation and long term prevention. The engineer works closely with Tier 2 operations Platform Engineering SRE teams application teams and vendor support (AWS MSK Confluent Cloud providers) to ensure Kafka remains a highly reliable scalable and secure streaming backbone.
Desirable Skills:
Keyword:
Skills: Digital : KafkaDigital: Amazon ConnectDigital : Kubernetes Experience Required: 10 & Above
Role: Principal Kafka Support & Reliability Engineer Location: Canton MA Role Descriptions: Tier 3 Incident Management Escalation SupportAct as the highest technical escalation point for Kafka production incidents Sev 1 Sev deep troubleshooting across 1. Broker instability controller elections I...
Role: Principal Kafka Support & Reliability Engineer
Location: Canton MA
Role Descriptions: Tier 3 Incident Management Escalation SupportAct as the highest technical escalation point for Kafka production incidents Sev 1 Sev deep troubleshooting across 1. Broker instability controller elections ISR shrinkage2. Under replicated partitions and leader imbalance3. Producerconsumer failures lag spikes and rebalance stormsDisk network JVM and request handler saturationProvide hands on remediation for complex issues including Partition reassignment and leader rebalanceBroker configuration tuningThrottlequota strategies for noisy producers or consumersCoordinate with vendor support during service incidents providing logs metrics and forensic Tier 2 teams during major incidents and validate restoration actions.2. Kafka Performance Engineering OptimizationAnalyze Kafka workloads for performance and scalability risks Partition skew and hot partitionsInefficient producer batchingcompressionConsumer lag root cause analysisThread pool IO and network bottlenecksRecommend and validate Topic design (partition count replication factor retention compaction)Producer and consumer configuration best practicesQuotas quotas enforcement and multi tenant controlsSupport onboarding of high throughput or latency sensitive workloads ensuring Kafka is correctly sized and tuned.3. Platform Stability Reliability ResilienceDiagnose and resolve systemic Kafka stability issues Repeated broker failures or flappingMetadatacontroller instability (Zookeeper or KRaft)Recovery issues following failovers or maintenance eventsSupport resilience initiatives Multi AZ cluster health validationReplication and DR strategies (MirrorMaker 2 Replicator or app level DR patterns)Failover testing and validationDefine and improve Kafka SLOs for availability durability and latency.4. Change Upgrade Configuration LeadershipLead medium to high risk Kafka changes including Broker and cluster configuration changesPartition expansion or large scale reassignmentTopic policy changes impacting durability or performanceSupport and plan Kafka version upgradesMSK Confluent upgrade cyclesClient compatibility and rollout strategiesParticipate in CAB reviews assess risk and design rollback and validation plans.5. Root Cause Analysis Continuous ImprovementOwn RCA documentation for major incidents with clear corrective and preventive actions (CAPA).Identify recurring failure patterns and architectural platform-level improvements Automation opportunitiesGuardrails and standardsMonitoring and alerting enhancementsContribute to continuous improvement of runbooks knowledge base articles and operational playbooks.
Essential Skills: Role OverviewThe Kafka Tier 3 Support Engineer is a senior technical role responsible for expert level support advanced troubleshooting performance engineering and platform stabilization of enterprise Apache Kafka environments. This role functions as the final technical escalation point for Kafka-related production incidents and is accountable for root cause analysis (RCA) complex remediation and long term prevention. The engineer works closely with Tier 2 operations Platform Engineering SRE teams application teams and vendor support (AWS MSK Confluent Cloud providers) to ensure Kafka remains a highly reliable scalable and secure streaming backbone.
Desirable Skills:
Keyword:
Skills: Digital : KafkaDigital: Amazon ConnectDigital : Kubernetes Experience Required: 10 & Above
View more
View less