Offer summary

Qualifications:

Experience with cloud platforms such as GCP, AWS, and Azure., Proficiency in Infrastructure-as-Code tools like Terraform and Helm., Strong problem-solving and troubleshooting skills., Excellent communication and collaboration abilities..

Key responsibilities:

Design and maintain scalable, reliable systems in a multi-cloud environment.

Develop automation tools and scripts for deployment and incident response.

Configure monitoring and alerting systems to proactively detect issues.

Participate in incident management and contribute to system security and documentation.

Job description

Position Overview

As a Site Reliability Engineer (SRE), you will play a critical role supporting our Blockdaemon team by ensuring the reliability, scalability, and performance of our systems and services. You will collaborate closely with crossfunctional teams to design, implement, and maintain robust and resilient infrastructure solutions in a MultiCloud environment.

The ideal candidate is passionate about automation, possesses strong analytical skills, and thrives in a fastpaced, dynamic environment.

Blockdaemon is a Blockchain Infrastructure Company operating in a multicloud configuration with a global footprint. The expectation for this role is a candidate capable of supporting systems & infrastructure stack across the major clouds, Google Cloud Platform (GCP) and Amazon Web Services (AWS), Azure.

Your Impact

System Architecture and Design: Collaborate with software engineering teams to design scalable, highly available, and resilient systems. Drive architectural improvements to enhance system reliability and performance.
Implement Infrastructure as Code to manage services and deployments in a multicloud, multiproject configuration.
Automation and Tooling: Develop automation tools and scripts to streamline deployment, monitoring, and incident response processes. Implement and maintain infrastructure as code frameworks.
Monitoring and Alerting: Configure and maintain monitoring systems to detect and mitigate potential issues proactively. Define alerting thresholds and response procedures to ensure timely incident resolution.
Incident Management: Respond to and resolve critical incidents, perform root cause analysis, and implement preventive measures to minimize the likelihood of recurrence. Participate in an oncall rotation to provide 247 support as needed.
Capacity Planning and Performance Optimization: Analyze system performance metrics, identify bottlenecks, and propose optimizations to improve resource utilization and efficiency.
Security and Compliance: Work closely with security teams to implement best practices for data protection, access control, and compliance with regulatory requirements. Conduct periodic security audits and vulnerability assessments.
Documentation and Knowledge Sharing: Document system configurations, procedures, and troubleshooting steps. Share knowledge and best practices with team members to foster a culture of continuous learning and improvement.