Experience in big data processing with Apache Spark (PySpark or Scala)
Strong experience with data migration from legacy systems to Spark
Proficiency in Scalding, MapReduce, Hadoop, and Hive
Hands-on experience writing unit tests for Spark pipelines and strong SQL/data validation skills
Requirements:
Analyze the current Scalding-based data pipelines, documenting existing business logic and transformations
Migrate the logic to Spark (PySpark/Scala), refactor data transformations and aggregations, and optimize Spark jobs for performance and scalability
Develop and execute data parity validation tests to compare outputs between Scalding and Spark implementations; resolve discrepancies with stakeholders
Write robust unit and integration tests and enforce engineering best practices (modular, reusable, well-documented) for Spark pipelines
Job description
Job Title: Spark Developer / Engineer (2 positions) Location: US Remote, work during PST time zone Duration: 6-12 Months
Workflows are powered by offline batch jobs written in Scalding, a MapReduce-based framework. To enhance scalability and performance, migrating these jobs from Scalding to Apache Spark.
Key Responsibilities: Understanding the Existing Scalding Codebase
Analyze the current Scalding-based data pipelines.
Document existing business logic and transformations.
Migrating the Logic to Spark
Convert existing Scalding jobs into Spark (PySpark/Scala) while ensuring optimized performance.
Refactor data transformations and aggregations in Spark.
Optimize Spark jobs for efficiency and scalability.
Ensuring Data Parity & Validation
Develop data parity tests to compare outputs between Scalding and Spark implementations.
Identify and resolve any discrepancies between the two versions.
Work with stakeholders to validate correctness.
Writing Unit Tests & Improving Code Quality
Implement robust unit and integration tests for Spark jobs.
Ensure code meets engineering best practices (modular, reusable, and well-documented).
Required Qualifications:
Experience in big data processing with Apache Spark (PySpark or Scala).
Strong experience with data migration from legacy systems to Spark.
Proficiency in Scalding and MapReduce frameworks.
Experience with Hadoop, Hive, and distributed data processing.
Hands-on experience in writing unit tests for Spark pipelines.
Strong SQL and data validation experience.
Proficiency in Python, Scala
Knowledge of CI/CD pipelines for data jobs.
Familiarity with Apache Airflow orchestration tool.