Role : GCP Agentic Platform Support Lead
Location : New York NY 10019 (Need local candidates/Hybrid)
Client: Persistent
Detailed JD:
The platform support lead will set the foundation and requirements for support on the GCP Data & AI platform. They will define standards for platform health managing incident resolution and executing routine maintenance to support the platform. They will develop GCP cloud logging and monitoring reports to support visibility across the platform.
Activities are comprised of:
1. SLA & Reliability Reporting
1. Establish the initial framework for tracking Mean Time to Repair (MTTR) and Mean Time Between Failures (MTBF)
2. Configure self-service billing and uptime dashboards for Con Edison stakeholders
2. Foundation Maintenance & Optimization
1. Develop and deploy the initial suite of Cloud Logging and Monitoring reports to establish platform visibility
2. Monitor GCP billing for anomalies (e.g. BigQuery slot spikes) and implement tactical fixes to ensure budget adherence
3. Build and maintain the Golden Path runbooks to ensure operational procedures are documented as they are established
3. Platform Monitoring & Incident Management
1. Conduct solo reviews of overnight batch processing logs (e.g. Cloud Composer/Dataflow) to verify completion and identify failures before business hours progress
2. Receive and prioritize platform-related tickets; determine if issues stem from infrastructure pipelines or upstream sources
3. Execute root cause analysis (RCA) and apply fixes for code-based failures IAM errors or configuration drifts
4. Act as the primary technical point of contact for Google Cloud Support or Con Edison Source System teams (SAP GIS) when issues are external to the platform
4. Minor Enhancements (Capacity-Based
1. Maintain a prioritized backlog of minor requests to be addressed only after platform stability and incidents are managed
2. Within available bandwidth execute minor schema updates ingestion schedule tweaks or IAM modifications
Workstream Deliverables:
1. Operations Runbook: The definitive MS Word resource reflecting current operational procedures and recovery steps (MS Word)
2. Integrated Health & Cost Reporting: Automated tracking of service uptime and GCP spend via Cloud Monitoring (Cloud Monitoring Reports)
3. Unified Incident & RCA Logs: A centralized record of Critical/High severity incidents and their resolutions stored in the agreed management tool (ServiceNow/Jira or similar)
4. Recovery & Maintenance Code: Validated code merged into the repository for bug fixes and configuration updates including detailed release notes (GCP Code)
Role : GCP Agentic Platform Support Lead Location : New York NY 10019 (Need local candidates/Hybrid) Client: Persistent Detailed JD: The platform support lead will set the foundation and requirements for support on the GCP Data & AI platform. They will define standards for platform health managin...
Role : GCP Agentic Platform Support Lead
Location : New York NY 10019 (Need local candidates/Hybrid)
Client: Persistent
Detailed JD:
The platform support lead will set the foundation and requirements for support on the GCP Data & AI platform. They will define standards for platform health managing incident resolution and executing routine maintenance to support the platform. They will develop GCP cloud logging and monitoring reports to support visibility across the platform.
Activities are comprised of:
1. SLA & Reliability Reporting
1. Establish the initial framework for tracking Mean Time to Repair (MTTR) and Mean Time Between Failures (MTBF)
2. Configure self-service billing and uptime dashboards for Con Edison stakeholders
2. Foundation Maintenance & Optimization
1. Develop and deploy the initial suite of Cloud Logging and Monitoring reports to establish platform visibility
2. Monitor GCP billing for anomalies (e.g. BigQuery slot spikes) and implement tactical fixes to ensure budget adherence
3. Build and maintain the Golden Path runbooks to ensure operational procedures are documented as they are established
3. Platform Monitoring & Incident Management
1. Conduct solo reviews of overnight batch processing logs (e.g. Cloud Composer/Dataflow) to verify completion and identify failures before business hours progress
2. Receive and prioritize platform-related tickets; determine if issues stem from infrastructure pipelines or upstream sources
3. Execute root cause analysis (RCA) and apply fixes for code-based failures IAM errors or configuration drifts
4. Act as the primary technical point of contact for Google Cloud Support or Con Edison Source System teams (SAP GIS) when issues are external to the platform
4. Minor Enhancements (Capacity-Based
1. Maintain a prioritized backlog of minor requests to be addressed only after platform stability and incidents are managed
2. Within available bandwidth execute minor schema updates ingestion schedule tweaks or IAM modifications
Workstream Deliverables:
1. Operations Runbook: The definitive MS Word resource reflecting current operational procedures and recovery steps (MS Word)
2. Integrated Health & Cost Reporting: Automated tracking of service uptime and GCP spend via Cloud Monitoring (Cloud Monitoring Reports)
3. Unified Incident & RCA Logs: A centralized record of Critical/High severity incidents and their resolutions stored in the agreed management tool (ServiceNow/Jira or similar)
4. Recovery & Maintenance Code: Validated code merged into the repository for bug fixes and configuration updates including detailed release notes (GCP Code)
View more
View less