Job description

This is a remote position.

We are seeking a self-motivated, intellectually curious Data Engineer to join our Data Science and Solutions team. This engineer will be responsible for building robust, scalable data pipelines using Databricks on AWS, integrating a wide range of data sources and structures into our AI and analytics platform. We have built our ‘minimum viable product’ and are now scaling up to support multi-tenancy in a highly secure environment.

The ideal candidate has more than 2 years’ experience in Databricks, and preferably building scalable, high-quality data pipelines in a distributed, serverless cloud environment. They will be well-versed in CI/CD best practices, system monitoring and the Databricks control surface as you will be building infrastructure-as-code to deploy secure, isolated, and monitored environments and data pipelines for our end users and AI agents. Most of all, you will be an expert in collaboration in a distributed, remote environment, a team player, and always learning.

Data Pipeline Development

Design, build, and maintain ETL/ELT pipelines in Databricks to ingest, clean, and transform data from diverse product sources.
Construct gold layer tables in the Lakehouse architecture that serve both machine learning model training and real-time APIs.
Monitor data quality, lineage, and reliability using Databricks best practices.

AI-Driven Data Access Enablement

Collaborate with AI/ML teams to ensure data is modeled and structured to support natural language prompts and semantic retrieval using 1^st and 3^rd party data sources, vector search and Unity Catalog metadata.
Help build data interfaces and agent tools to interact with structured data and AI agents to retrieve and analyze customer data with role-based permissions.

API & Serverless Backend Integration

Work with backend engineers to design and implement serverless APIs (e.g., via AWS Lambda with TypeScript) that expose gold tables to frontend applications.
Ensure APIs are performant, scalable, and designed with data security and compliance in mind.
Utilize Databricks and other APIs to implement provisioning, deployment, security and monitoring frameworks for scaling up data pipelines, AI endpoints, and security models for multi-tenancy.

Requirements

3+ years of experience as a Data Engineer or related role in an agile, distributed team environment with a quantifiable impact on business or technology outcomes.
Proven expertise with Databricks, including job and workflow orchestration, change data capture and medallion architecture.
Proficiency in Spark or Scala for data wrangling and transformation on a wide variety of data sources and structures.
Practitioner of CI/CD best practices, test-driven development and familiarity with the MLOps / AIOps lifecycles.

Proven ability to work in an agile environment with product managers, front-end engineers, and data scientists.

Preferred Skills

Familiarity with AWS Lambda (Node.js/TypeScript preferred) and API Gateway or equivalent serverless platforms, knowledge of API design principles and working with RESTful or GraphQL endpoints.
Exposure to React-based frontend architecture and the implications of backend data delivery on UI/UX performance – including end-to-end telemetry to measure performance and accuracy for the end-user experience.
Experience with A/B testing, experiment and inference logging and analytics.