TekWissen is a global workforce management provider headquartered in Ann Arbor Michigan that offers strategic talent solutions to our clients world-wide. Our client provider of digital technology and transformation information technology and services
Location: Frisco TX 75034
- The AIOps Engineer is responsible for integrating machine learning and advanced analytics into our existing monitoring and logging systems.
- This role will leverage artificial intelligence to automate routine operational tasks detect anomalies proactively and implement self-healing frameworks to enhance the stability and performance of our infrastructure.
- The ideal candidate will be proactive in identifying gaps creating strategic roadmaps and implementing phased improvements to achieve operational excellence.
Key Responsibilities:
- Apply machine learning algorithms to existing operational data (logs metrics events) to predict system failures and proactively address potential incidents.
- Implement automation for routine DevOps practices including automated scaling resource optimization and controlled restarts.
- Develop and maintain self-healing systems to reduce manual intervention and enhance system reliability.
- Build anomaly detection models to quickly identify and address unusual operational patterns.
- Collaborate closely with SREs developers and infrastructure teams to continuously enhance the operational stability and performance of the system.
- Provide insights and improvements through visualizations and reports leveraging AI-driven analytics.
- Create a phased roadmap to incrementally enhance operational capabilities and align with strategic business goals.
Required Skills and Qualifications:
- Strong experience with AI/ML frameworks and tools (e.g. TensorFlow PyTorch scikit-learn).
- Proficiency in data processing and analytics tools (e.g. Splunk Prometheus Grafana ELK stack).
- Solid background in scripting and automation (Python Bash Ansible etc.).
- Experience with cloud environments and infrastructure automation.
- Proven track record in implementing proactive monitoring anomaly detection and self-healing techniques.
- Excellent analytical problem-solving and strategic planning skills.
- Strong communication skills and the ability to effectively collaborate across teams.
Preferred Experience:
- Background in DevOps/Site Reliability Engineering.
- Familiarity with containerization and orchestration platforms (Kubernetes Docker).
- Experience in building scalable distributed systems.
- This role is pivotal in enabling our organization to achieve and sustain Operational Excellence through intelligent automation and proactive monitoring practices.
- Short summary: An experienced SRE with knowledge of how to implement AI/ML
- Mandatory Skills(Only 3-4)Machine Learning & AI Frameworks(e.g. TensorFlow PyTorch scikit-learn) Monitoring & Observability Tools(e.g. Splunk Prometheus Grafana ELK Stack) Automation & Scripting(e.g. Python Bash Ansible)DevOps / SRE Background
TekWissen Group is an equal opportunity employer supporting workforce diversity.