Company: Lifesight
Payroll: Lifesight
Responsibilities
- Build highly scalable available faulttolerant distributed data processing systems (batch and streaming systems) processing over 100s of terabytes of data ingested every day and petabytesized data warehouse and elasticsearch cluster.
- Build quality data solutions and refine existing diverse datasets to simplified models encouraging selfservice.
- Build data pipelines that optimize on data quality and are resilient to poorquality data sources.
- Own the data mapping business logic transformations and data quality.
- Lowlevel systems debugging performance measurement & optimization on large production clusters.
- Participate in architecture discussions influence product roadmap and take ownership and responsibility over new projects.
- Maintain and support existing platforms and evolve to newer technology stacks and architectures
Mandatory
- Strong Data Engineering Profile
- Mandatory (Experience 1 Must have 5 YOE in Data Engineering using data transformation tools like HDFS YARN MapReduce Hive Kafka Spark Airflow Presto etc.
- Mandatory (Experience 2 Must have worked on large distributed architectures using any of Kafka Spark Hive Hadoop
- Mandatory (Experience 3 Must have handled projects involving at least 100 GB of data.
- Mandatory (Core Skills) Expertise in Apache Spark (RDDs data frames Spark tuning) and PySpark.
- Mandatory (Data Handling) Familiarity with data formats like Parquet Avro and NoSQL databases.
- Mandatory (Tech Stack) Proficiency with distributed systems and big data technologies such as HDFS YARN Kafka Hive MapReduce Hadoop etc.
- Mandatory (Company) Midsized Product Companies or Analyticsheavy companies
- Mandatory (Exclusions 1 Dont want Candidates from large companies (like Walmart McAfee Oracle etc)
- Mandatory (Exclusions 2 Dont want Candidates from IT Services Companies
Preferred
- Preferred (Education) Bachelor s degree in Computer Science Engineering or related fields from a Tier 1 or Tier 2 college
Ideal Candidate
- Proficiency in Python and PySpark.
- Deep understanding of Apache Spark Spark tuning creating RDDs and building data frames.
- Experience in big data technologies like HDFS YARN MapReduce Hive Kafka Spark Airflow Presto etc.
- Experience in building distributed environments using any of Kafka Spark Hive Hadoop etc.
- Good understanding of the architecture and functioning of distributed database systems.
- Experience working with various file formats like Parquet Avro etc. for large volumes of data.
- Experience with one or more NoSQL databases.
- Experience with AWS GCP.
- 5 years of professional experience as a data or software engineer.
nosql databases,spark,kafka,airflow,avro,pyspark,parquet,presto,hive,sql,data engineering,apache spark,distributed systems,map-reduce,yarn,hdfs,hadoop