Global Data Insight & Analytics organization is looking for a Principal Software Engineer focused on building and driving the strategy forward for our internal Data Science / AI/ML products and platform. This role will work in a small cross-functional team. The position will collaborate directly and continuously with other engineers business partners product managers and designers and will release early and often. The team you will be working on is focused on building Mach1ML platform an AI/ML enablement platform to democratize Machine Learning across Ford enterprise.
Engineer for Scale:Design and build the end-to-end of our core Generative AI and ML products / platform. You will own the technical blueprints for highly available scalable and modular systems ensuring our infrastructure can handle petabyte-scale data and millions of API requests. You will design a decoupled system of microservices (usingFastAPI gRPC) and event-driven workflows (usingGoogle Pub/Sub Kafka) that allows teams to self-serve jobs manage model lifecycles and deploy inference endpoints with minimal friction.
Lead Hands-On Development:Be the lead developer and subject matter expert for our most complex technical challenges. You will write production-grade Python code for both backend services and frontend tooling creating a seamless experience for our users. This includes developing robust APIs data processing pipelines and core platform components.
Productionize Cutting-Edge AI:Go beyond the notebook. You will be responsible for taking advanced AI/ML modelsincluding LLMs RAG systems and agentic workflowsand productionizing them in a robust repeatable and monitored fashion. Your work ensures our AI is not just smart but also reliable and performant.
Serving:Containerizing models withDockerand deploying them onGoogle Kubernetes Engine (GKE)using sophisticated serving frameworks likeKServeorNVIDIA Triton Inference Server configured for auto-scaling on GPU node pools (H100s).
Optimization:Implementing advanced model optimization techniques like quantization pruning and knowledge distillation to reduce latency and cost for LLM inference.
Monitoring & Reliability:Building a comprehensive monitoring stack usingPrometheus Grafana and OpenTelemetry. You will implement observability for everything: GPU utilization inference latency token costs and model-specific metrics like data drift and output quality using tools likeLangSmithorArize AI.
CI/CD for Models:Designing and implementing robust CI/CD (Continuous Integration/Continuous Delivery) pipelines usingCloud BuildorJenkins/GitLab CI enabling automated RAG/agentic evaluation A/B testing and canary deployments.
Master the Deployment Ecosystem:Implement and manage the deployment of applications and services on our hybrid infrastructure leveraging the best ofGoogle Cloud Platform (GCP) or any cloud platform including GKE Vertex AI and BigQueryand our on-premiseHigh-Performance Computing (HPC)clusters for large-scale model training and inference.You will useTerraformto provision and manage a sophisticated infrastructure that seamlessly blends
Champion Technical Excellence & Mentorship:As our most senior technical expert you will lead through influence. Mentor other engineers conduct rigorous design and code reviews and establish the patterns and best practices that define our engineering culture. You will be the go-to person for our hardest technical problems.
Amplify Productivity with GenAI:Actively leverage and pioneer the use of Generative AI productivity tools (e.g. GitHub Copilot internal code generation models automated testing agents) to accelerate our development lifecycle and foster a culture of hyper-efficiency across the enterprise. Youll build the developer experience. Create a clean well-documented Python SDK and a set of REST/gRPC APIs that become the paved road for data scientists and application developers to interact with the platform.
Software Engineering with Agentic AI: You will design and build sophisticated multi-agent systems from the ground up to automate complex segments of our SDLC. Your primary toolkit will include state-of-the-art frameworks likeLangGraphfor building stateful cyclical agentic architectures alongsideCrewAIfor multi-agent collaboration. Your work will involve:
Implementing Advanced Agentic Patterns:You will move beyond simple ReAct loops to design complex systems involving dynamicplanningself-reflection and hierarchicaltool use. You will build a robust library of custom tools (e.g. functions to interact with our codebase databases and internal APIs) that agents can intelligently select and execute.
Building Goal-Oriented AI Agents:You will design build and deploy autonomous systems such as:
ADevOps Agentthat can autonomously diagnose production alerts by querying logs via theSplunk/Datadog API inspect infrastructure state using theGCP API and execute remediation plans like rolling back a deployment on GKE.
ACode Generation & Refactoring Agentthat takes a Jira ticket as input writes the initial Python implementation usingFastAPI generates corresponding unit tests withPytest runs the tests and then iteratively refactors the code based on feedback from static analysis tools.
Experience:10 years of professional software engineering experience with a proven track record of designing building and operating large-scale distributed systems in a production environment.
Technical Leadership:Demonstrated experience as a technical lead principal engineer or staff engineer where you were responsible for the architectural direction of a team or major project and mentored fellow engineers.
Expert-Level Python:Deep authoritative knowledge of Python and its ecosystem. You have extensive experience building high-performance backend services (e.g. with FastAPI gRPC) data-intensive applications and understand the nuances of the language. You write clean performant and testable code and have extensive experience with modern backend frameworks (FastAPI Pydantic) testing tools (Pytest) and high-performance data libraries (Pandas 2.0).
Architectural Depth:You are a systems thinker that deeply understands their trade-offs. You have a proven history of architecting for resilience scalability and maintainability. You are a vocal advocate for best practices in API design data modeling and clean code. A history of designing modular decoupled systems (e.g. microservices event-driven architecture). You think in terms of APIs data contracts and long-term maintainability.
Production AI/ML Expertise:Proven hands-on experience in productionizing machine learning systems (MLOps). You have deep familiarity with the challenges of deploying and monitoring ML models especially Large Language Models (LLMs).You have expertise in the MLOps lifecycle including hands-on experience with tools likeMLflow Kubeflow KServe/KFServingand the unique challenges of productionizing LLMs.
Cloud & Infrastructure Proficiency:Extensive experience with cloud platformsprimarily Google Cloud (GCP) or any cloud platform. You are an expert in containerization (Docker Kubernetes/GKE) and infrastructure-as-code (Terraform). Experience with HPC environments (e.g. Slurm MPI) is a significant plus.
Education:A Bachelors or Masters degree in Computer Science Engineering or a related field or equivalent industry experience.
Required Experience:
Staff IC
Ford® is Built for America. Discover the latest lineup in new Ford vehicles! Explore hybrid & electric vehicle options, see photos, build & price, search inventory, view pricing & incentives & see the latest technology & news happening at Ford.