Mozn is a rapidly growing and leading data science & product development firm based in Riyadh with a proven track record of excellence in supporting and growing the analytics ecosystem in Saudi Arabia. Mozn is a trusted analytics partner for the largest government organizations in Saudi Arabia, as well as many large corporations and startups. We are in a critical stage of scaling the company to build institutional analytics knowledge within Mozn and Saudi Arabia. It is an exciting time to work in Saudi Arabia; through Vision 2030, the rate of social and industrial change is staggering.

We seek a proactive and skilled Data and Messaging Reliability Senior Engineer to ensure our data systems and messaging platforms' high availability, reliability, and scalability. This role is critical in maintaining operational excellence for Kafka, ClickHouse, MySQL, and PostgreSQL systems, enabling robust and secure data operations across the organization.

Requirements

Reliability Engineering for Data Systems:

High Availability Design: Architect and implement resilient systems for ClickHouse, MySQL, and PostgreSQL to meet uptime requirements.
Performance Optimization: Monitor and optimize database performance, ensuring low-latency operations under heavy loads.
Disaster Recovery: Develop and maintain disaster recovery plans and backup strategies for all critical data systems.
Capacity Planning: Forecast and plan for future data growth to ensure system scalability and reliability.

Reliability Engineering for Messaging Systems:

Kafka Cluster Management: Design and maintain highly available Kafka clusters, ensuring fault tolerance and minimizing downtime.
Stream Reliability: Optimize Kafka Streams and other stream processing solutions for real-time data delivery with guaranteed reliability.
Integration Resilience: Ensure robust integration between messaging systems and other components in the data ecosystem.
Proactive Monitoring: Set up and refine monitoring tools and alerting mechanisms to preemptively identify and resolve issues.

Operational Excellence:

Incident Management: Lead root cause analyses and postmortems for incidents, implementing preventive measures.
Automation: Automate operational tasks, including database failovers, scaling, and Kafka partition management.
Security and Compliance: Collaborate with security teams to maintain secure, compliant systems, implementing RBAC and encryption.
Documentation: Maintain thorough documentation for reliability processes, disaster recovery, and operational best practices.

Collaboration and Leadership:

Work closely with engineering, operations, and security teams to align on reliability goals and system improvements.
Provide technical mentorship and guidance to junior engineers.
Stay current with industry trends and innovations in reliability engineering, applying them to enhance platform stability.

Qualifications:

Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
Minimum of 5+ years in reliability engineering, data engineering, or messaging systems.
Hands-on experience with managing Kafka, ClickHouse, MySQL, and PostgreSQL in high-availability production environments.
Technical Skills:
Strong expertise in reliability tools and practices, including monitoring solutions like Prometheus, Grafana, or Datadog.
Proficiency in database tuning and query optimization for ClickHouse, MySQL, and PostgreSQL.
Experience with disaster recovery planning and execution.
Advanced skills in scripting or programming (e.g., Python, Bash, or Java).
Familiarity with orchestration and containerization tools (e.g., Kubernetes, Docker).
Soft Skills: Exceptional problem-solving abilities, collaboration, and communication skills.

Preferred Qualifications:

Experience with advanced Kafka features, including Kafka Connect, Schema Registry, and tiered storage.
Expertise in designing and managing multi-region or geo-distributed architectures.
Familiarity with cloud-native reliability tools and services.
Knowledge of database replication and sharding techniques for scalability and reliability.
Proven track record of automating operational workflows for reliability and efficiency.

Benefits

We think you'll enjoy working at Mozn. Here's why:

We selectively choose to undertake projects with impact; our users and clients trust us to solve mission-critical problems.
We move quickly, but carefully and confidently. Iterations happen on the scale of days to weeks, and we invest considerable effort in minimizing the operational overhead to empower you to do your best work.
You will be given a lot of responsibility and trust. We believe that the best results come when the people responsible for a product are given the freedom to do what they think is best.

Data and Messaging Reliability Senior Engineer

Offer summary

Qualifications:

Key responsabilities:

Job description

Qualifications:

Preferred Qualifications:

Required profile

Experience

Hard Skills

Other Skills

Related jobs

Remote - Senior Cloud Migration Consultant with Healthcare Domain experience

Construction Inspector/CAD Technician (Full-Time)

Analista de Sistemas Pleno .NET

MTM Clinical Specialist - Remote Call Center

SAP FICO Consultant with German