Site Reliability Engineer (SRE)
Location: Hybrid (some at-home and some in Worlds Plano HQ) with preference given to candidates in proximity to the Dallas Fort Worth Metroplex.
Reporting to: Leader of Client Delight
The Client Delight team is responsible for comprehensive service delivery and ensuring ultimate customer satisfaction encompassing Implementation Operations Security Customer Support Change Management and IT. This role involves close collaboration with Worlds Engineering & Development Sales & Solutions Architecture and other cross-functional teams to provide holistic management of the Worlds customer experience.
About Worlds: Worlds is an AI Platform that enhances visibility and automates physical operations by applying AI across existing camera networks. Our end-to-end solution enables enterprises to model train and build automation into their physical environments helping them develop applications that measure detect and track objects in real-time impacting efficiency safety and security. Learn more at .
Job Summary: The Site Reliability Engineer (SRE) is a critical hands-on role responsible for the deployment monitoring and operational support of the Worlds AI platform for our customers. The SRE ensures the reliability scalability and security of customer solutions acting as the primary technical resource for implementation and incident management. This role requires a blend of cloud infrastructure expertise automation skills and a passion for customer success ensuring our Fortune-500 clientele receive best-in-class service and support.
Key Responsibilities:
- Solution Implementation: Deploy configure and update new customer solution environments in Azure Kubernetes Service (AKS) and other cloud platforms (AWS GCP private cloud) utilizing infrastructure-as-code tools like Bicep and Helm charts.
- Custom Solution Integration: Work with our Forward Deployment Engineering team on custom development and integration of the Worlds app within the customers operation. Provide monitoring guidance support documentation and solution health dashboards as required to manage the custom solution.
- Monitoring & Alerting: Implement tune and manage monitoring and alerting solutions using Prometheus and Grafana to meet customer SLAs and ensure optimal performance. Collaborate with core engineering to define and integrate application telemetry.
- Incident Management & Support: Provide production support and lead incident management processes following ITIL guidelines. Troubleshoot and resolve issues escalating to DevOps for tool-related issues or to Core Engineering for any Worlds app stack issues (functionality or performance) with the goal of gaining knowledge to reduce escalations over time.
- Knowledge Management: Develop and maintain comprehensive customer runbooks in Confluence documenting unique solution architectures and return-to-service procedures to ensure operational readiness.
- System & Performance Testing: Test new configurations including performance and load testing to validate solution stability and scalability.
- Security & Compliance: Adhere strictly to Worlds Acceptable Use Policy (AUP) and Access Control Policy operating with the principle of least privilege to ensure the security and compliance of customer environments.
- Customer Communication: Serve as a key technical point of contact for customers communicating effectively on project status incidents and operational performance.
Qualifications & Experience:
- Networking (5 years): Deep experience configuring and troubleshooting TCP/IP networks including subnetting routing firewalls and VPN solutions (OpenVPN WireGuard).
- Linux Administration (3 years): Proficient in building troubleshooting and managing Linux servers including remote access service verification and log analysis.
- Cloud Administration (2 years): Demonstrable experience managing cloud solutions in Azure (required) with familiarity in AWS or Google Cloud as a bonus. Expertise in containerized solutions (Docker Kubernetes) IaaS (VMs) DNS IAM and logging services is essential.
- Automation & Scripting: Experience with configuration management tools such as Ansible Docker Kubernetes and Helm is highly preferred. Proficiency in scripting with Bash and Python for automation is required.
- Database: Ability to write and execute basic SQL select queries for troubleshooting and data verification.
- IT Service Management: Experience with ITSM frameworks (ITIL) and tools (e.g. Jira Service Management) for incident and problem management.
- AI/ML: Experience with Artificial Intelligence (AI) and Machine Learning (ML) concepts is a plus; Worlds will provide training on our specific platform.
Personal Qualities:
- Passionate about delivering Client Delight and taking ownership of the customer experience.
- Ability to thrive in a fast-paced startup environment iterating quickly on solutions and processes while driving the maturation of operations for security and efficiency.
- A proactive and collaborative mindset with excellent problem-solving and communication skills.
Perks and Benefits:
- 100% employer-paid medical coverage for employees and dependents.
- Comprehensive benefits including dental vision 401k and disability.
- Flexible PTO policy.
- Employee stock options.
Qualified candidates should send a cover letter and resume to
The above statements are intended to describe the general nature and level of work performed by employees assigned to this job. They are not intended to be an exhaustive list of all duties responsibilities and qualifications.
Join us at Worlds and help shape the future of industrial operations with cutting-edge technology!