Job description

Overview:

We are seeking a talented and experienced Chaos Engineering Architect to join our dynamic team. In this role, you will be responsible for designing and implementing chaos engineering practices to enhance the resilience and reliability of our cloud-based systems. You will work closely with cross-functional teams to create chaos engineering drills and ensure our observability tools provide meaningful insights into system performance and behavior under stress.

Key Responsibilities:

Design and Implementation: Develop and execute chaos engineering strategies, including chaos experiments and drills, to identify weaknesses in our cloud infrastructure and applications.
Cloud Environment Expertise: Leverage your experience with cloud platforms (AWS, Azure, GCP) to implement chaos experiments that simulate various failure scenarios, ensuring systems can withstand unexpected disruptions.
Collaboration: Partner with development, operations, and QA teams to integrate chaos engineering practices into the CI/CD pipeline, fostering a culture of reliability and resilience.
Observability Enhancements: Utilize observability tools and practices (e.g., Prometheus, Grafana, ELK Stack) to monitor and analyze system performance, helping teams understand the impact of chaos experiments.
Documentation and Training: Create comprehensive documentation for chaos engineering methodologies and conduct training sessions to upskill team members on best practices.
Continuous Improvement: Analyze results from chaos experiments to drive improvements in system design, architecture, and operational practices.
Incident Management: Collaborate with incident response teams to refine incident management processes and improve system recovery times based on findings from chaos experiments.

Qualifications:

Education: Bachelor’s degree in Computer Science, Engineering, or a related field; advanced degree preferred.
Experience:
- 5+ years of experience in software engineering, systems architecture, or related fields.
- Proven experience with chaos engineering principles and practices in cloud environments.
- Familiarity with chaos engineering tools (e.g., Gremlin, Chaos Monkey, Litmus) and observability platforms.
Technical Skills:
- Strong knowledge of cloud computing architectures (AWS, Azure, GCP).
- Proficiency in programming/scripting languages (Python, Go, Java, etc.) for automation of chaos experiments.
- Experience with observability tools (e.g., Prometheus, Grafana, Datadog) to derive insights from chaos tests.
Soft Skills:
- Excellent problem-solving skills and ability to think critically under pressure.
- Strong communication skills to effectively share insights and findings with technical and non-technical stakeholders.
- Ability to work collaboratively in a fast-paced, agile environment.

Preferred Qualifications:

Experience with site reliability engineering (SRE) practices.
Familiarity with microservices architectures and container orchestration (e.g., Kubernetes).
Understanding of incident response and disaster recovery planning.

Required profile