Codeworks is an IT Services firm, known for our strong commitment to quality and for our direct client relationships.

Our financial services client is looking for a Cloud Reliability Engineer with experience in reliability engineering for a large-scale transition from private to public cloud and related strategies on a contract basis.

As a Cloud Reliability Engineer on this team, this individual will be responsible for make a lasting impact on the company's digital transformation journey, drive customer-centric innovation and automation, and position the organization as a leader in the competitive digital banking landscape.

Specifically, the Cloud Reliability Engineer will be responsible for the following:

Strategize and drive the building blocks of reliability engineering as we make the transition from private to public cloud.
Ensure the reliability, availability, and performance of applications and services, focusing on minimizing downtime, optimizing response times, and maintaining high availability for users.
Lead incident response efforts for incidents, including identification, triage, resolution, and post-incident analysis to prevent recurrence and improve system resilience.
Develop and maintain monitoring solutions and alerting mechanisms for infrastructure, application performance, and user experience metrics, enabling proactive issue detection and mitigation.
Implement automation tools and processes to automate routine tasks, scale infrastructure, and ensure seamless deployments, updates, and rollbacks with minimal user impact.
Conduct capacity planning, performance tuning, and resource optimization for environments, collaborating with development and operations teams to meet scalability and performance goals.
Collaborate with security teams to implement security best practices, perform vulnerability assessments, and ensure compliance with security standards and regulatory requirements for applications.
Manage deployment pipelines, release processes, and configuration management for app deployments, ensuring consistency, reliability, and version control across environments.
Develop and test disaster recovery plans, backup strategies, and failover mechanisms for app services, ensuring business continuity and data integrity in case of failures or disasters.
Participate in on-call rotations and provide 24/7 support for critical incidents, troubleshoot issues, and coordinate with teams for resolution, escalation, and follow-up actions as per defined SLAs.

The ideal candidate will have at least a couple of years of professional experience with the following:

Specific experience in reliability engineering for a large-scale transition from private to public cloud and strategies for such.
Proficient in development technologies, architectures, and platforms (web, api) to understand system complexities and performance considerations.
Experience in cloud platforms (e.g., AWS, Azure, Google Cloud) and infrastructure as code (IaC) tools for managing app infrastructure and deployments.
Knowledge of monitoring tools (e.g., Dynatrace, Logrocket, DataDog) and logging frameworks (e.g., ELK Stack) for real-time visibility into system health, performance metrics, and user experience.
Experience in incident management, including incident response, triage, root cause analysis (RCA), and post-mortem reviews to prevent recurring issues.
Strong troubleshooting skills to diagnose complex technical issues in app environments, infrastructure, networking, and performance bottlenecks
Proficiency in scripting languages (e.g., Python, Bash) and automation tools (e.g., Ansible, Terraform) for automating routine tasks, deployments, and infrastructure management.
Experience in implementing continuous integration/continuous deployment (CI/CD) pipelines for apps using tools like Jenkins, GitLab CI/CD, or Azure DevOps.
Expertise in setting up monitoring solutions, configuring alerts, and creating dashboards to monitor system performance, application metrics, and user experience.
Familiarity with APM (Application Performance Monitoring) tools to analyze app performance, identify bottlenecks, and optimize resource utilization.
Commitment to continuous learning, staying updated with industry trends, new technologies, and best practices in app reliability, performance, and operations.
Adaptability to evolving requirements, technologies, and business needs, with a focus on driving continuous improvement and operational excellence.

For immediate consideration, qualified candidates should send their resumes. Attn: Laura. Apply Here

About CODEWORKS:

Codeworks has more than 20 years of experience successfully serving Fortune 1000 companies. Our Recruiting team consists of highly skilled Talent Specialists skilled at evaluating, advising, and connecting IT professionals with new career opportunities that facilitate career growth. Codeworks has consistently been recognized by Inc. Magazine as one of the fastest growing private companies in the US.

Diversity and Inclusion Job Posting statement will be added to Templates**: "At Codeworks, we're committed to diversity, equity, and inclusion in our workforce and beyond. We believe in equal opportunities and value the unique perspectives that every individual brings to our team. Join us in creating an inclusive, innovative, and collaborative workplace where your talents can thrive.

Cloud Reliability Engineer

Offer summary

Qualifications:

Key responsabilities:

Job description

Required profile

Experience

Hard Skills

Other Skills

Site Reliability Engineer Related jobs

Site Reliability Engineer (SRE) - EMEA

Site Reliability Engineer (Remote)

Site Reliability Engineer

Site Reliability Engineer

Staff Site Reliability Engineer - Federal, Security Clearance