Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailJob Title: Site Reliability Engineer (SRE)
Department: Technology
Location: Bangalore
Reporting To: Head of Infra
Tookitaki is looking for a Site Reliability Engineer (SRE) with 36 years of experience to help maintain and scale the infrastructure that powers our flagship productsFinCense and the AFC Ecosystem. As an SRE you will work at the intersection of software engineering and infrastructure ensuring high availability performance and scalability of our platforms.
You will collaborate with engineering DevOps and client success teams to operationalize deployments across on-premise VPC and Compliance as a Service (CaaS) environments while improving monitoring automation and incident response.
The SRE role is responsible for ensuring the reliability and efficiency of Tookitakis production systems and environments. This includes building monitoring systems improving deployment pipelines automating routine operations and responding to production incidents. Youll help build a resilient infrastructure that supports our mission to provide AI-driven solutions that prevent financial crime.
System Monitoring & Incident Management
Build and maintain monitoring alerting and logging systems using tools like Prometheus Grafana and ELK.
Respond to incidents and outages conduct post-mortems and implement corrective actions.
Infrastructure & Deployment Automation
Automate infrastructure provisioning and application deployment using Terraform Ansible or Helm.
Contribute to CI/CD pipelines improve reliability and speed of software delivery (GitLab CI Jenkins etc.).
Container & Orchestration Management
Manage and troubleshoot Docker containers and Kubernetes clusters ensuring workload scaling resource management and health.
Support application updates rollbacks and blue-green or canary deployments.
Cloud & Platform Operations
Operate within AWS (preferred) or GCP environments (EC2 S3 VPC IAM).
Monitor system availability and resource usage across environments.
Security & Reliability Enhancements
Implement and monitor TLS/SSL RBAC SSO and secure API practices.
Support compliance and security audit activities by maintaining logs access controls and operational hygiene.
Collaboration & Documentation
Work closely with developers infra engineers and support teams to ensure production readiness.
Maintain playbooks runbooks and system documentation for reliability engineering activities.
Bachelors degree in Computer Science Engineering or related technical field.
36 years in Site Reliability Engineering DevOps Platform Engineering or a related role.
Experience with production environments and live system debugging.
Kubernetes Docker Helm experience deploying and scaling services.
Linux administration and command-line debugging.
Hands-on with AWS (preferred) or GCP cloud platforms.
Scripting in Bash and Python for automation and monitoring tasks.
Experience with monitoring and alerting tools like Prometheus Grafana ELK or Datadog.
Familiarity with databases (e.g. MariaDB ScyllaDB) and SQL/CQL querying.
Strong problem-solving and debugging skills.
Ability to work in on-call rotations and high-pressure production environments.
Excellent communication and documentation abilities.
Operational Reliability: Ensures system uptime and performance through proactive monitoring and maintenance.
Automation Mindset: Reduces manual effort through scripting and tooling.
Incident Response: Quick identification and resolution of issues to minimize downtime.
Cross-Functional Collaboration: Works effectively with engineering support and infra teams.
Security Awareness: Applies best practices in infrastructure and platform security.
Maintain 99.9% uptime across production environments.
Reduce mean time to detect (MTTD) and mean time to resolve (MTTR) for critical incidents.
Increase in automation coverage and reduction in manual deployment steps.
High internal satisfaction from developers on CI/CD and platform reliability.
Compliance readiness and security log availability for audits.
Competitive compensation
Work on a globally recognized RegTech platform transforming financial crime prevention.
Exposure to cutting-edge AI and big data infrastructure (Spark Kafka ScyllaDB Flink).
Full-Time