Senior Site Reliability Engineer
Location: NYC NY (Hybrid)
What Youll Do:
- Support the SRE team in developing and implementing enhancements to support workflows focusing on automation and efficiency improvements
- Handle technical escalations troubleshoot complex FIX and API connectivity issues and actively participate in on-call rotations during non-traditional hours to ensure rapid response and resolution
- Adhere to and administer incident and change management policies
- Coordinate incident resolution efforts and implement change management protocols to maintain and enhance system reliability
- Work closely with the Lithuania office to ensure smooth operation and alignment of SRE practices across time zones
- Coordinate Incident Post Mortems and RCA analysis
- Design implement and maintain comprehensive monitoring logging and tracing solutions (observability stack) to provide deep insights into system performance and user experience
- Partner with product and engineering teams to define clear Service Level Indicators (SLIs) and Service Level Objectives (SLOs) managing error budgets to ensure service reliability meets business needs
Required Qualifications:
- 5 years in a senior SRE role or a similar position demonstrating deep knowledge and expertise in site reliability engineering and operations
- Knowledge of FIX protocol and messages ability to read FIX logs
- Familiarity with REST APIs and a strong understanding of API integration
- Proficient in Python and scripting for automation and system management with a proven track record of developing and implementing automation solutions
- Expertise in SQL and transactional databases including querying and troubleshooting
- Strong analytical and troubleshooting skills with a proven ability to identify and resolve technical issues through root cause analysis
- In-depth knowledge of core networking concepts including TCP/IP routing and DNS.
- Familiarity with maintaining and troubleshooting systems within both cloud (AWS) and co-location (colo)
- Availability for flexible work hours and willingness to cover US markets trading sessions including L2 on-call coverage
- Knowledge of change management processes and risk management
Preferred Qualifications:
- Experience in the brokerage or financial industry.
- Proficient with cloud services particularly AWS and knowledgeable about cloud architecture best practices including IAM EC2 S3 and DynamoDB.
- Experience maintaining and supporting containerized systems with familiarity in orchestration tools.
- Knowledge of Infrastructure as Code (IaC) practices and tools such as Terraform or CloudFormation.
- Ability to manage and troubleshoot job scheduling tools like Rundeck or Apache Airflow.
- Advanced skills in managing containerized environments using Kubernetes and OpenShift.
- Practical experience with Confluent Cloud RedPanda for event streaming architectures.
- Experience with API-based applications and a basic understanding of using the browser developer console for front-end debugging.
Senior Site Reliability Engineer Location: NYC NY (Hybrid) What Youll Do: Support the SRE team in developing and implementing enhancements to support workflows focusing on automation and efficiency improvements Handle technical escalations troubleshoot complex FIX and API connectivity issues an...
Senior Site Reliability Engineer
Location: NYC NY (Hybrid)
What Youll Do:
- Support the SRE team in developing and implementing enhancements to support workflows focusing on automation and efficiency improvements
- Handle technical escalations troubleshoot complex FIX and API connectivity issues and actively participate in on-call rotations during non-traditional hours to ensure rapid response and resolution
- Adhere to and administer incident and change management policies
- Coordinate incident resolution efforts and implement change management protocols to maintain and enhance system reliability
- Work closely with the Lithuania office to ensure smooth operation and alignment of SRE practices across time zones
- Coordinate Incident Post Mortems and RCA analysis
- Design implement and maintain comprehensive monitoring logging and tracing solutions (observability stack) to provide deep insights into system performance and user experience
- Partner with product and engineering teams to define clear Service Level Indicators (SLIs) and Service Level Objectives (SLOs) managing error budgets to ensure service reliability meets business needs
Required Qualifications:
- 5 years in a senior SRE role or a similar position demonstrating deep knowledge and expertise in site reliability engineering and operations
- Knowledge of FIX protocol and messages ability to read FIX logs
- Familiarity with REST APIs and a strong understanding of API integration
- Proficient in Python and scripting for automation and system management with a proven track record of developing and implementing automation solutions
- Expertise in SQL and transactional databases including querying and troubleshooting
- Strong analytical and troubleshooting skills with a proven ability to identify and resolve technical issues through root cause analysis
- In-depth knowledge of core networking concepts including TCP/IP routing and DNS.
- Familiarity with maintaining and troubleshooting systems within both cloud (AWS) and co-location (colo)
- Availability for flexible work hours and willingness to cover US markets trading sessions including L2 on-call coverage
- Knowledge of change management processes and risk management
Preferred Qualifications:
- Experience in the brokerage or financial industry.
- Proficient with cloud services particularly AWS and knowledgeable about cloud architecture best practices including IAM EC2 S3 and DynamoDB.
- Experience maintaining and supporting containerized systems with familiarity in orchestration tools.
- Knowledge of Infrastructure as Code (IaC) practices and tools such as Terraform or CloudFormation.
- Ability to manage and troubleshoot job scheduling tools like Rundeck or Apache Airflow.
- Advanced skills in managing containerized environments using Kubernetes and OpenShift.
- Practical experience with Confluent Cloud RedPanda for event streaming architectures.
- Experience with API-based applications and a basic understanding of using the browser developer console for front-end debugging.
View more
View less