A fast-growing provider of AI povered solutions is scaling its operations. With a strong customer base and increasing demand the existing engineering team is under pressure to handle both infrastructure improvements and customer-facing support.
To meet this growth the company is looking to add an Infrastructure Engineer in a team of two (will be a third engineer) supporting Kafka Redis Opensearch RabbitMq ClickHouse for products.
Tasks
- Manage monitor and optimize ClickHouse clusters in production including schema design query performance tuning replication configuration and capacity planning;
- Operate and maintain Kafka clusters OpenSearch deployments and other distributed systems ensuring high availability and optimal performance;
- Deploy configure and manage containerized applications and stateful workloads on Kubernetes implementing best practices for resource management and scaling;
- Implement and maintain GitOps workflows for infrastructure and application deployments ensuring version-controlled and automated deployment processes;
- Design and implement comprehensive monitoring logging and alerting solutions for distributed systems enabling proactive issue detection and rapid troubleshooting;
- Conduct performance analysis identify bottlenecks and implement optimizations across distributed systems to meet SLA requirements and improve system resilience;
- Create and maintain technical documentation runbooks and operational procedures while collaborating with development teams to ensure smooth integration and operations.
Requirements
- Hands-on experience operating distributed systems in production environments with strong understanding of distributed computing concepts data consistency and fault tolerance;
- Solid experience with ClickHouse including cluster management MergeTree engine families data modeling query optimization and replication strategies;
- Practical experience deploying and managing applications on Kubernetes including StatefulSets persistent volumes networking and security configurations;
- Working knowledge of Apache Kafka (brokers topics partitions consumer groups) and OpenSearch or similar search and analytics engines;
- Experience with GitOps practices and Infrastructure as Code tools (Terraform Helm or similar) with ability to manage infrastructure through declarative configuration;
- Proficiency with monitoring and observability platforms (Prometheus Grafana or similar) and experience implementing metrics collection and alerting strategies;
- Hands-on experience with at least one major cloud platform (AWS GCP or Azure) including compute storage and networking services;
- Strong scripting and programming skills in Python Go or Bash for automation tooling development and operational tasks.
Nice to have:
- Experience with other distributed databases (Redis Spark Flink etc.);
- Knowledge of data streaming patterns and event-driven architectures;
- Strong analytical and troubleshooting skills with ability to diagnose complex distributed systems issues coupled with clear communication skills for cross-functional collaboration.
Benefits
Working conditions:
- This role availible only for candidates from Croatia Serbia Portugal Poland
- Duration: 1 year with extension possibility;
- Locations: Serbia Portugal Croatia Poland;
- Overlap: Until 11:00 AM PST at max.
- Employment Type: Full-time
A fast-growing provider of AI povered solutions is scaling its operations. With a strong customer base and increasing demand the existing engineering team is under pressure to handle both infrastructure improvements and customer-facing support.To meet this growth the company is looking to add an Inf...
A fast-growing provider of AI povered solutions is scaling its operations. With a strong customer base and increasing demand the existing engineering team is under pressure to handle both infrastructure improvements and customer-facing support.
To meet this growth the company is looking to add an Infrastructure Engineer in a team of two (will be a third engineer) supporting Kafka Redis Opensearch RabbitMq ClickHouse for products.
Tasks
- Manage monitor and optimize ClickHouse clusters in production including schema design query performance tuning replication configuration and capacity planning;
- Operate and maintain Kafka clusters OpenSearch deployments and other distributed systems ensuring high availability and optimal performance;
- Deploy configure and manage containerized applications and stateful workloads on Kubernetes implementing best practices for resource management and scaling;
- Implement and maintain GitOps workflows for infrastructure and application deployments ensuring version-controlled and automated deployment processes;
- Design and implement comprehensive monitoring logging and alerting solutions for distributed systems enabling proactive issue detection and rapid troubleshooting;
- Conduct performance analysis identify bottlenecks and implement optimizations across distributed systems to meet SLA requirements and improve system resilience;
- Create and maintain technical documentation runbooks and operational procedures while collaborating with development teams to ensure smooth integration and operations.
Requirements
- Hands-on experience operating distributed systems in production environments with strong understanding of distributed computing concepts data consistency and fault tolerance;
- Solid experience with ClickHouse including cluster management MergeTree engine families data modeling query optimization and replication strategies;
- Practical experience deploying and managing applications on Kubernetes including StatefulSets persistent volumes networking and security configurations;
- Working knowledge of Apache Kafka (brokers topics partitions consumer groups) and OpenSearch or similar search and analytics engines;
- Experience with GitOps practices and Infrastructure as Code tools (Terraform Helm or similar) with ability to manage infrastructure through declarative configuration;
- Proficiency with monitoring and observability platforms (Prometheus Grafana or similar) and experience implementing metrics collection and alerting strategies;
- Hands-on experience with at least one major cloud platform (AWS GCP or Azure) including compute storage and networking services;
- Strong scripting and programming skills in Python Go or Bash for automation tooling development and operational tasks.
Nice to have:
- Experience with other distributed databases (Redis Spark Flink etc.);
- Knowledge of data streaming patterns and event-driven architectures;
- Strong analytical and troubleshooting skills with ability to diagnose complex distributed systems issues coupled with clear communication skills for cross-functional collaboration.
Benefits
Working conditions:
- This role availible only for candidates from Croatia Serbia Portugal Poland
- Duration: 1 year with extension possibility;
- Locations: Serbia Portugal Croatia Poland;
- Overlap: Until 11:00 AM PST at max.
- Employment Type: Full-time
View more
View less