Key Responsibilities
Design build and maintain robust and scalable data pipelines using PySpark
Leverage Hadoop ecosystem (HDFS Hive etc.) for big data processing
Develop and deploy data workflows in cloud environments (AWS GCP or Azure)
Use Kubernetes to manage and orchestrate containerized data services
Collaborate with cross-functional teams to develop integrated data solutions
Monitor and optimize data workflows for performance reliability and security
Follow best practices for data governance compliance and documentation
Requirements
Proficiency in PySpark for ETL and data transformation tasks
Hands-on experience with at least one cloud platform (AWS GCP or Azure)
Strong grasp of Hadoop ecosystem tools such as HDFS Hive etc.
Practical experience in Kubernetes for service orchestration
Proficiency in Python and SQL
Experience working with large-scale distributed data systems
Familiarity with tools like Apache Airflow Kafka or Databricks
Experience working with data warehouses like Snowflake Redshift or BigQuery
Exposure to MLOps or integration of AI/ML pipelines
Understanding of CI/CD pipelines and DevOps practices for data workflows
Benefits
Opportunity to work on cutting-edge data projects with global clients
A collaborative innovation-driven work culture
Continuous learning via internal training certifications and mentorship
Competitive compensation and growth opportunities
Educational Background: Bachelor's and Master s degrees in Data Science, Computer Science, Statistics, or a related field. Technical Skills: Proficient in Python and familiar with key data science libraries (Pandas, Scikit-Learn, TensorFlow, or PyTorch). Strong understanding of all complex machine learning algorithms not limited to decision trees, random forests, and gradient boosting machines. Competence in data preprocessing, cleaning, and analysis. Familiarity with data cleaning, transformation, and preprocessing techniques. Experience with SQL and possibly some NoSQL databases for data querying and manipulation. Basic knowledge of data visualization tools like Matplotlib and Seaborn. Strong skills in SQL for data extraction, and the ability to work with complex database systems. Advanced knowledge of analytical tools and software such as Excel, Tableau, or more specialized software depending on the industry (e.g., SAS, SPSS). Experience with data visualization and the ability to create interactive dashboards. Vast knowledge of NLP, Deep Learning, Machine Learning , Knowledge of genAI with implementation knowledge (LLM, fine tuning RAG implementation) and many more. Familiarity with cloud services (AWS, Azure, Google Cloud) for data processing and storage.. Professional Experience: 2-3 years of experience in a data science or related role. Proven track record of developing and deploying machine learning models to solve business problems. Experience in projects that involve complex data structures and large-scale datasets. Exposure to model validation and implementation in a production environment. Demonstrated experience in analyzing large datasets and delivering actionable insights. Proven track record of effectively communicating findings to help inform business decisions. Experience in performing statistical analysis, forecasting, and establishing data structures that optimize analytical capabilities. Soft Skills: Strong problem-solving skills with a capability to work through complex issues using a logical and analytical approach. Good communication skills to effectively articulate insights and technical details to non-technical stakeholders. Ability to work collaboratively in team settings and manage project timelines effectively. Strong attention to detail with the capability to work on multiple projects simultaneously. Effective communication skills, capable of presenting complex information in an understandable and compelling manner. Collaboration skills to work effectively with both technical teams and business units.