Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailAt T-Mobile we invest in YOU! Our Total Rewards Package ensures that employees get the same big love we give our customers. All team members receive a competitive base salary and compensation package - this is Total Rewards. Employees enjoy multiple wealth-building opportunities through our annual stock grant employee stock purchase plan 401(k) and access to free year-round money coaches. Thats how were UNSTOPPABLE for our employees!
Job OverviewDesign implement and scale observability and incident response frameworks for AI infrastructure supporting Azure OpenAI Glean AI Gateway and internal LLM-enabled applications.
Architect and scale observability frameworks to support real-time monitoring and incident response for distributed AI platforms across hyperscale environments (Azure AWS GCP) with Azure as the preferred cloud provider.
Champion Implements end-to-end telemetry tracing and logging across AI workloads inference APIs and LLM-integrated services using tools like Solo Application Gateway LiteLLM Splunk Olly and Weave
Develop and manage SLOs alerting systems and dashboards that provide insight into LLM performance latency throughput and cost metrics.
Ensuring the reliability and performance of AI systems through monitoring alerting and automation including capacity planning and performance tuning.
Partner with AI Architecture Security and Engineering teams to enforce governance guardrails and compliance standards in platform operations.
Solve high-severity incidents related to platform or model performance API errors or scaling issues and lead root cause analysis and resolution.
Guide and mentor junior SREs while contributing to a culture of operational excellence in AI/ML systems engineering.
Bachelors Degree Computer Science Engineering or a related field (Preferred)
Masters/Advanced Degree Computer Science Engineering or a related field (Preferred)
7-10 years Working in operations or develops environments Required
7-10 years Solving customer related issues and managing customer relationships Required
7-10 years Very adept with scripting (Bash PowerShell etc.) in addition to other programming languages Java C# etc. SQL also a plus Preferred
Proficiency with observability and tracing tools such as Splunk Grafana Prometheus or Datadog APM.
Hands-on expertise in managing AI or ML infrastructure at scale including OpenAI APIs ChatGPT Glean AI Gateways and LLM evaluation platforms.
Strong coding/scripting abilities in Python Go or Java and experience with IaC tools such as Terraform and Azure Resource Manager (ARM) templates.
Deep understanding of container orchestration (Kubernetes) CI/CD pipelines and automation standard processes.
Strong knowledge of SRE practices: monitoring incident response service level objectives (SLOs) and chaos testing.
AI RAG experience
Gitlab deployment pipeline and software SDLC deployment flow including new feature definition development testing and deployment
Experience managing LLM observability (e.g. token usage latency scoring golden set accuracy) and familiarity with frameworks like Weave or Weights & Biases.
Knowledge of LLMOps model evaluation and audit compliance strategies.
Background in AI/ML environments within large-scale enterprise or telecom settings.
Experience with Azure Data Factory Snowflake and API-based data ingestion and monitoring pipelines.
Familiarity with AI governance policy enforcement and secure API access control mechanisms.
Proficiency in site reliability engineering and automation tools. (Required)
Ability to solve complex technical challenges effectively. (Required)
Understanding of Agile and DevOps methodologies. (Required)
Experience in designing maintaining CICD Pipelines. (Required)
Ability to learn and adapt to new technologies quickly. (Required)
Understanding of software scalability availability latency performance security and capacity management. (Required)
Experience in building applications on cloud native platforms. (Preferred)
Ability to collaborate effectively with various teams. (Required)
Understanding of how to deliver software faster better and cheaper. (Preferred)
Familiarity with statistical analysis and data collection evaluation & presentation. (Preferred)
The pay range above is the general base pay range for a successful candidate in the role. The successful candidates actual pay will be based on various factors such as work location qualifications and experience so the actual starting pay will vary within this range.
At T-Mobile employees in regular non-temporary roles are eligible for an annual bonus or periodic sales incentive or bonus based on their role. Most Corporate employees are eligible for a year-end bonus based on company and/or individual performance and which is set at a percentage of the employees eligible earnings in the prior year. Certain positions in Customer Care are eligible for monthly bonuses based on individual and/or team performance. To find the pay range for this role based on hiring location T-Mobile our benefits exemplify the spirit of One Team Together! A big part of how we care for one another is working to ensure our benefits evolve to meet the needs of our team members. Full and part-time employees have access to the same benefits when eligible. We cover all of the bases offering medical dental and vision insurance a flexible spending account 401(k) employee stock grants employee stock purchase plan paid time off and up to 12 paid holidays - which total about 4 weeks for new full-time employees and about 2.5 weeks for new part-time employees annually - paid parental and family leave family building benefits back-up care enhanced family support childcare subsidy tuition assistance college coaching short- and long-term disability voluntary AD&D coverage voluntary accident coverage voluntary life insurance voluntary disability insurance and voluntary long-term care insurance. We dont stop there - eligible employees can also receive mobile service & home internet discounts pet insurance and access to commuter and transit programs! To learn about T-Mobiles amazing benefits check out.Never stop growing!
As part of the T-Mobile team you know the Un-carrier doesnt have a corporate ladderits more like a jungle gym of possibilities! We love helping our employees grow in their careers because its that shared drive to aim high that drives our business and our culture forward. By applying for this career opportunity youre living our values while investing in your career growthand we applaud it. Youre unstoppable!
T-Mobile USA Inc. is an Equal Opportunity Employer. All decisions concerning the employment relationship will be made without regard to age race ethnicity color religion creed sex sexual orientation gender identity or expression national origin religious affiliation marital status citizenship status veteran status the presence of any physical or mental disability or any other status or characteristic protected by federal state or local law. Discrimination retaliation or harassment based upon any of these factors is wholly inconsistent with how we do business and will not be tolerated.
Talent comes in all forms at the Un-carrier. If you are an individual with a disability and need reasonable accommodation at any point in the application or interview process please let us know by emailing or calling 1-. Please note this contact channel is not a means to apply for or inquire about a position and we are unable to respond to non-accommodation related requests.
Required Experience:
Staff IC
Full-Time