Who we are
Artmac Soft is a technology consulting and service-oriented IT company dedicated to providing innovative technology solutions and services to customers.
Job Description:
Job Title : AI/ML inference optimization Engineer
Job Type : W2/C2C
Experience : 8 15 years
Location : San Jose California
Responsibilities:
- 8 years of experience in MLOps Machine Learning Engineering or a specialized Inference Optimization role.
- A portfolio or project experience demonstrating successful deployment of high-performance containerized ML models to production scale.
- Model-Specific Optimization: Analyze and understand the underlying logic and dependencies of various AI/ML models (primarily using PyTorch and TensorFlow) to identify bottlenecks in the inference pipeline.
- High-Performance Serving Implementation: Design implement and manage high-performance inference serving solutions utilizing specialized inference servers (e.g. vLLM) to achieve low latency and high throughput.
- GPU Utilization Optimization: Optimize model serving configurations specifically for GPU hardware to maximize resource efficiency and performance metrics in a production environment.
- Containerization for Deployment: Create minimal secure and production-ready Docker images for streamlined deployment of optimized models and inference servers across various environments.
- Collaboration: Work closely with core engineering and data science teams to ensure a smooth transition from model development to high-scale production deployment.
Required Skillsets:
AI/ML Domain Expertise
- Deep understanding of the AI/ML domain with the core effort centered around model performance and serving rather than general infrastructure.
ML Frameworks
- Expertise in PyTorch and TensorFlow: Proven ability to work with and troubleshoot model-specific dependencies logic and graph structures within these major frameworks.
Inference Optimization
- Production Inference Experience: Expertise in designing and implementing high-throughput low-latency model serving solutions.
- Specialized Inference Servers: Mandatory experience with high-performance inference servers specifically including vLLM or similar dedicated LLM serving frameworks.
- GPU Optimization: Demonstrated ability to optimize model serving parameters and infrastructure to maximize performance on NVIDIA or equivalent GPU hardware.
Deployment and Infrastructure
- Containerization (Docker): Proficiency in creating minimal secure and efficient Docker images for model and server deployment.
- Infrastructure Knowledge (Helpful but Secondary): General knowledge of cloud platforms (AWS GCP Azure) and Kubernetes/orchestration is beneficial but the primary focus remains on model serving and optimization.
Who we are Artmac Soft is a technology consulting and service-oriented IT company dedicated to providing innovative technology solutions and services to customers. Job Description: Job Title : AI/ML inference optimization Engineer Job Type : W2/C2C Experience : 8 15 years Location : San ...
Who we are
Artmac Soft is a technology consulting and service-oriented IT company dedicated to providing innovative technology solutions and services to customers.
Job Description:
Job Title : AI/ML inference optimization Engineer
Job Type : W2/C2C
Experience : 8 15 years
Location : San Jose California
Responsibilities:
- 8 years of experience in MLOps Machine Learning Engineering or a specialized Inference Optimization role.
- A portfolio or project experience demonstrating successful deployment of high-performance containerized ML models to production scale.
- Model-Specific Optimization: Analyze and understand the underlying logic and dependencies of various AI/ML models (primarily using PyTorch and TensorFlow) to identify bottlenecks in the inference pipeline.
- High-Performance Serving Implementation: Design implement and manage high-performance inference serving solutions utilizing specialized inference servers (e.g. vLLM) to achieve low latency and high throughput.
- GPU Utilization Optimization: Optimize model serving configurations specifically for GPU hardware to maximize resource efficiency and performance metrics in a production environment.
- Containerization for Deployment: Create minimal secure and production-ready Docker images for streamlined deployment of optimized models and inference servers across various environments.
- Collaboration: Work closely with core engineering and data science teams to ensure a smooth transition from model development to high-scale production deployment.
Required Skillsets:
AI/ML Domain Expertise
- Deep understanding of the AI/ML domain with the core effort centered around model performance and serving rather than general infrastructure.
ML Frameworks
- Expertise in PyTorch and TensorFlow: Proven ability to work with and troubleshoot model-specific dependencies logic and graph structures within these major frameworks.
Inference Optimization
- Production Inference Experience: Expertise in designing and implementing high-throughput low-latency model serving solutions.
- Specialized Inference Servers: Mandatory experience with high-performance inference servers specifically including vLLM or similar dedicated LLM serving frameworks.
- GPU Optimization: Demonstrated ability to optimize model serving parameters and infrastructure to maximize performance on NVIDIA or equivalent GPU hardware.
Deployment and Infrastructure
- Containerization (Docker): Proficiency in creating minimal secure and efficient Docker images for model and server deployment.
- Infrastructure Knowledge (Helpful but Secondary): General knowledge of cloud platforms (AWS GCP Azure) and Kubernetes/orchestration is beneficial but the primary focus remains on model serving and optimization.
View more
View less