oles & Responsibilities
End-to-End Reliability & Operations
- Take full ownership of availability latency scalability and durability across all services and databases.
- Define and enforce Service Level Objectives (SLOs) Service Level Indicators (SLIs) and error budgets for critical systems.
- Lead incident response protocols conduct blameless Root Cause Analyses (RCAs) and drive systemic fixes to improve MTTR and MTTD.
- Build production readiness frameworks and establish best practices for capacity planning deployments rollbacks and change management.
Database Reliability & Architecture
- Ensure the end-to-end reliability of relational databases NoSQL databases caching layers and streaming platforms.
- Design highly available multi-region architectures implementing robust cross-region replication and failover mechanisms.
- Formulate and implement comprehensive backup restore and disaster recovery (DR) strategies.
- Lead system design reviews with a strict focus on fault tolerance scalability bottlenecks data partitioning and sharding.
Platform Automation & Tooling
- Build and evolve internal platforms for database provisioning lifecycle management and service deployment.
- Champion Infrastructure as Code (IaC) and GitOps practices to reduce operational toil through automation and self-healing systems.
- Define golden signals (latency traffic errors saturation) and build comprehensive observability and tooling across the application infrastructure and database layers.
- Develop reusable frameworks for failover automation chaos testing and reliability validation.
Performance Cost & Security
- Optimize system performance and drive cost efficiency across cloud infrastructure (compute network storage) and database usage (IOPS replication backups).
- Ensure systems comply with rigorous security and governance standards by implementing access controls encryption (at rest and in transit) and audit logging.
The Impact You Can Create
As a Staff Engineer (IC4) you will act as a technical leader across the infrastructure platform and data layers. By blending Site Reliability Engineering (SRE) and Database Reliability Engineering (DBRE) you will:
- Drive the organization-wide reliability strategy and solve highly ambiguous high-impact engineering problems.
- Influence system architecture across multiple teams guiding product teams on resilient architecture patterns.
- Raise the overall engineering standards through mentorship design leadership and by operating with high ownership and autonomy.
Skills
- Cloud & Architecture: Strong expertise in distributed systems multi-region architectures Disaster Recovery (DR) and cloud platforms (AWS preferred).
- Databases & Streaming: Deep knowledge of Relational DBs (MySQL PostgreSQL Aurora) NoSQL (DynamoDB Cassandra) Caching (Redis) and event-driven streaming systems (Kafka).
- Programming: Proficiency in coding with Python Go or Java.
- Systems & Observability: Strong understanding of Linux internals networking and storage systems alongside hands-on experience with observability stacks like Prometheus Grafana and Datadog.
Qualifications :
Qualifications
- Experience: 10 years of professional experience in SRE DBRE Infrastructure or Platform Engineering.
- Technical Mastery: Proven hands-on experience managing high-scale production systems reliability engineering and complex incident management.
- Bonus / Preferred: Previous experience building Database-as-a-Service (DBaaS) offerings or robust internal platform engineering systems is highly preferred.
Success Measures
Your impact in this role will be measured by the following outcomes:
- Delivering a measurable improvement in overall system uptime and reliability.
- Driving a demonstrable reduction in incident frequency and Mean Time To Recovery (MTTR).
- Increasing system automation resulting in significantly reduced operational toil.
- Achieving improved database performance alongside measurable cost efficiency gains.
- The successful execution and deployment of multi-region and Disaster Recovery (DR) initiatives.
Additional Information :
At Freshworks we have fostered an environment that enables everyone to find their true potential purpose and passion welcoming colleagues of all backgrounds genders sexual orientations religions and ethnicities. We are committed to providing equal opportunity and believe that diversity in the workplace creates a more vibrant richer environment that boosts the goals of our employees communities and business. Fresh vision. Real impact. Come build it with us.
Remote Work :
No
Employment Type :
Full-time
oles & ResponsibilitiesEnd-to-End Reliability & OperationsTake full ownership of availability latency scalability and durability across all services and databases.Define and enforce Service Level Objectives (SLOs) Service Level Indicators (SLIs) and error budgets for critical systems.Lead incident r...
oles & Responsibilities
End-to-End Reliability & Operations
- Take full ownership of availability latency scalability and durability across all services and databases.
- Define and enforce Service Level Objectives (SLOs) Service Level Indicators (SLIs) and error budgets for critical systems.
- Lead incident response protocols conduct blameless Root Cause Analyses (RCAs) and drive systemic fixes to improve MTTR and MTTD.
- Build production readiness frameworks and establish best practices for capacity planning deployments rollbacks and change management.
Database Reliability & Architecture
- Ensure the end-to-end reliability of relational databases NoSQL databases caching layers and streaming platforms.
- Design highly available multi-region architectures implementing robust cross-region replication and failover mechanisms.
- Formulate and implement comprehensive backup restore and disaster recovery (DR) strategies.
- Lead system design reviews with a strict focus on fault tolerance scalability bottlenecks data partitioning and sharding.
Platform Automation & Tooling
- Build and evolve internal platforms for database provisioning lifecycle management and service deployment.
- Champion Infrastructure as Code (IaC) and GitOps practices to reduce operational toil through automation and self-healing systems.
- Define golden signals (latency traffic errors saturation) and build comprehensive observability and tooling across the application infrastructure and database layers.
- Develop reusable frameworks for failover automation chaos testing and reliability validation.
Performance Cost & Security
- Optimize system performance and drive cost efficiency across cloud infrastructure (compute network storage) and database usage (IOPS replication backups).
- Ensure systems comply with rigorous security and governance standards by implementing access controls encryption (at rest and in transit) and audit logging.
The Impact You Can Create
As a Staff Engineer (IC4) you will act as a technical leader across the infrastructure platform and data layers. By blending Site Reliability Engineering (SRE) and Database Reliability Engineering (DBRE) you will:
- Drive the organization-wide reliability strategy and solve highly ambiguous high-impact engineering problems.
- Influence system architecture across multiple teams guiding product teams on resilient architecture patterns.
- Raise the overall engineering standards through mentorship design leadership and by operating with high ownership and autonomy.
Skills
- Cloud & Architecture: Strong expertise in distributed systems multi-region architectures Disaster Recovery (DR) and cloud platforms (AWS preferred).
- Databases & Streaming: Deep knowledge of Relational DBs (MySQL PostgreSQL Aurora) NoSQL (DynamoDB Cassandra) Caching (Redis) and event-driven streaming systems (Kafka).
- Programming: Proficiency in coding with Python Go or Java.
- Systems & Observability: Strong understanding of Linux internals networking and storage systems alongside hands-on experience with observability stacks like Prometheus Grafana and Datadog.
Qualifications :
Qualifications
- Experience: 10 years of professional experience in SRE DBRE Infrastructure or Platform Engineering.
- Technical Mastery: Proven hands-on experience managing high-scale production systems reliability engineering and complex incident management.
- Bonus / Preferred: Previous experience building Database-as-a-Service (DBaaS) offerings or robust internal platform engineering systems is highly preferred.
Success Measures
Your impact in this role will be measured by the following outcomes:
- Delivering a measurable improvement in overall system uptime and reliability.
- Driving a demonstrable reduction in incident frequency and Mean Time To Recovery (MTTR).
- Increasing system automation resulting in significantly reduced operational toil.
- Achieving improved database performance alongside measurable cost efficiency gains.
- The successful execution and deployment of multi-region and Disaster Recovery (DR) initiatives.
Additional Information :
At Freshworks we have fostered an environment that enables everyone to find their true potential purpose and passion welcoming colleagues of all backgrounds genders sexual orientations religions and ethnicities. We are committed to providing equal opportunity and believe that diversity in the workplace creates a more vibrant richer environment that boosts the goals of our employees communities and business. Fresh vision. Real impact. Come build it with us.
Remote Work :
No
Employment Type :
Full-time
View more
View less