Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailNot Disclosed
Salary Not Disclosed
1 Vacancy
Location: Hartford CT
Position: Senior SRE Engineer - (Cloud Platform)
Job Role: Lead SRE implementation specifically for frontend portal monitoring reliability and performance on Google Cloud Platform or Microsoft Azure .
Job Description:
- Design and implement comprehensive SRE monitoring for web portal on GCP
- Set up JVM metrics collection and performance monitoring for Java applications using GCP Monitoring
- Implement logging and tracing standards across all portal components using Cloud Logging and Cloud Trace
- Configure APIGEE monitoring and API performance tracking for portal services
- Implement distributed tracing with W3C Trace Context headers and OpenTelemetry
- Create drill-down dashboards with correlation between metrics logs and traces using GCP tools
- Integrate GCP Monitoring Logging and Trace with existing Prometheus/Grafana stack
- Configure GMP (Google Managed Prometheus) for enhanced metrics collection
- Implement UI zero code instrumentation for frontend monitoring and traceability
- Create RED (Request Error Duration) dashboards for Performance and Production environments
- Build service health dashboards with drill-down capabilities and error message analysis
-Develop and maintain SRE automation/scripts within GKE namespaces (SRE and others) for monitoring deployment and troubleshooting.
Experience: 5 years in SRE/DevOps with proven JVM APIGEE GCP observability Grafana stack GKE OpenTelemetry and UI instrumentation implementation experience
Clear Skills Needed:
- Technical: Python Linux Prometheus Grafana Kubernetes Docker Loki Tempo
- JVM Metrics: Java application monitoring JVM performance tuning heap analysis garbage collection optimization for portal applications
- Logging & Tracing: Splunk distributed tracing log aggregation standards correlation IDs across portal systems
- API Management: APIGEE experience API monitoring rate limiting security performance tracking for portal APIs
- Infrastructure: CI/CD pipelines AI tools like GIT copilot Cursor etc.
- Observability Tools & Query Languages: PromQL InfluxQL for querying metrics(Grafana)
- Strong experience with Kubernetes (GKE) including namespace management RBAC and deploying/maintaining SRE tools via code (Python Bash YAML Helm).
Additional Critical Skills:
- Distributed Tracing Standards: W3C Trace Context headers implementation
- Structured Logging: JSON format with specific fields (traceid )
- Performance Baseline Establishment: Ability to collect and analyze 2-4 weeks historical data for performance baselines
- Dashboard Implementation: Drill-down capabilities service mapping from trace data correlation between metrics/logs/traces
GCP-Specific Observability Skills (CRITICAL):
- Google Cloud Monitoring: GMP (Google Managed Prometheus) Cloud Monitoring dashboards alerting policies
- Google Cloud Logging: Centralized logging log-based metrics log exports
- OpenTelemetry (OTEL): Instrumentation collectors data collection from GCP services
UI Instrumentation & Frontend Monitoring (CRITICAL):
- UI Span Management: Naming conventions for UI-initiated spans W3C Trace Context headers for frontend
- Frontend Observability: User session tracking component-level monitoring UI performance metrics
- Cross-Platform Tracing: End-to-end traceability from UI to backend services
Required Skills : PrometheusGrafanaGoogle Cloud Platform (GCP)Google Cloud LoggingKubernetesweb metricsCloud
Basic Qualification :
Additional Skills :
This is a high PRIORITY requisition. This is a PROACTIVE requisition
Background Check : Yes
Drug Screen : No
Full-time