Data Scientist Lead
Job Summary
As Data Scientist Lead within Commercial & Investment Bank with the Healthcare Provider team you will lead a team in building advanced solutions for image classification text categorization and intelligent data extraction from scanned documents. You will have deep proficiency in Python PyTorch TensorFlow Hugging Face Transformers AWS SageMaker/Bedrock and hands-on experience with CNN/transformer architectures OCR technologies and multimodal document understanding models. This role involves managing the full ML lifecycle from prototyping to production deployment on AWS EKS.
Job responsibilities
- Lead and mentor a team of data scientists in designing and executing advanced analytics and modeling projects focused on image classification text categorization and intelligent data extraction from scanned document images. Foster a culture of curiosity analytical rigor and continuous learning by developing team members in deep learning computer vision NLP and document AI techniques.
- Define and drive the analytical strategy for document understanding use cases identifying the optimal combination of computer vision NLP and multimodal approaches.
- Build and fine-tune multimodal document understanding and text categorization models. Leverage the interplay of textual content spatial layout and visual features to extract structured fields and key-value pairs from complex scanned documents while enabling automated categorization routing metadata tagging and entity extraction.
- Design rigorous experimentation and data quality frameworks including A/B testing cross-validation strategies and statistical significance testing to evaluate model performance and hyperparameter tuning. Establish best practices for annotation quality management training data curation active learning strategies and ground truth validation to ensure high-quality labeled datasets.
- Design manage and optimize the workflows involved in preparing data for machine learning model training select statistical or Deep Learning models that are best positioned to achieve business results.
- Develop and deploy models using Python and AWS SageMaker managing the full lifecycle from exploratory data analysis and prototyping through production deployment monitoring and performance tracking. Collaborate with data engineers and ML engineers to ensure seamless integration of analytical models into production document processing pipelines and data workflows.
Required qualifications capabilities and skills
- Bachelors degree or MS or PhD in quantitative discipline e.g. Computer Science Mathematics Operations Research Data Science.
- 7 years of experience in data science or quantitative analytics with at least 2 years of experience in document AI computer vision or NLP domains.
- Strong foundation in statistics mathematics and programming including probability mathematical modeling and experimental design with the ability to rigorously evaluate model performance with advanced proficiency in Python for data analysis modeling and visualization and deep experience in PyTorch TensorFlow Hugging Face Transformers scikit-learn OpenCV pandas NumPy matplotlib and seaborn.
- Hands-on experience with CNN and transformer architectures for document AI for image classification transfer learning and feature extraction; multimodal document understanding combining textual visual and layout features; and NLP models for text categorization sequence labeling named entity recognition and semantic analysis with familiarity with additional computer vision models including object detection image segmentation and Vision Transformers.
- Working experience with OCR technologies and image preprocessing for text extraction from scanned documents with an understanding of OCR accuracy metrics preprocessing optimization and error analysis. Proficiency in image preprocessing techniques for scanned documents in TIF/PNG format including deskewing binarization resolution enhancement noise removal and multi-page document handling.
- Hands-on experience with AWS SageMaker and Amazon Bedrock including building training tuning and deploying ML models in cloud-based production environments (notebook instances training jobs inference endpoints) as well as exploring foundation models and generative AI capabilities to augment document understanding and classification workflows and experience with containerized deployments on AWS EKS for productionizing data science models and analytical services at scale.
- Proficiency in SQL with strong working knowledge of Oracle databases for complex data extraction transformation and analysis of document metadata and extracted content with working knowledge of Java and Groovy for collaborating with engineering teams and understanding enterprise application codebases and strong understanding of annotation tools active learning strategies and training data management for supervised learning in document AI use cases.
Preferred qualifications capabilities and skills
Domain expertise in the healthcare industry
Required Experience:
IC
About Company
JPMorganChase, one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the world’s most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans ov ... View more