Oceans Code Experts is looking for talented individuals that are ready for the next step in their career, we offer a collaborative professional environment as full of rewarding experiences as it is of challenges.

A SRE Architect at Oceans can expect to work on multiple projects, work with a cross-functional team, and are transparent about time and tasks to help clients understand the progress of their projects.

Candidates must LOVE helping people, solving business problems, and pushing themselves to slay the next beast of a project.

Job Summary

A Site Reliability Engineering (SRE) Architect typically has a background in both software development and operations, with a focus on designing and implementing systems that are reliable, scalable, and maintainable. The role involves collaborating with development and operations teams to ensure the overall health and performance of systems in production. This person will be responsible for building strategies and solutions to help clients solve complex business problems and take advantage of opportunities to be more innovative and competitive. Additionally, they will work with and mentor current and incoming SRE teammates and assist in vetting new team members.

Job Responsibilities

System Architecture Design: Designing and architecting reliable and scalable systems, selecting appropriate technologies, and ensuring alignment with organizational goals.
Service Level Objectives (SLOs) and Service Level Indicators (SLIs): Defining, establishing, and monitoring SLOs and SLIs for critical services to ensure they meet reliability standards.
Incident Management: Leading and participating in incident response efforts, conducting post-incident reviews, and implementing improvements to prevent recurrence.
Automation and Tooling: Developing and implementing automation tools for provisioning, configuration, and maintenance, as well as building monitoring and logging systems for visibility.
Problem-Solving Skills: Analyzing complex issues, identifying root causes, and developing effective solutions to improve system reliability and performance.
Collaboration and Communication: Effectively collaborating with cross-functional teams, clearly communicating ideas and solutions, and working well with both technical and non-technical stakeholders.
Adaptability and Continuous Learning: Adapting to changing technologies, methodologies, and organizational requirements, and committing to ongoing learning and staying updated on industry trends.
Leadership and Mentoring: Providing leadership by influencing and guiding teams, mentoring junior SREs, and fostering a collaborative and inclusive work environment.
Attention to Detail and Quality: Ensuring high-quality solutions and meticulous attention to details in system design, implementation, and troubleshooting.

Job Requirements

Great English proficiency (B2+ written and spoken).
16 years of professional experience. Previous roles: Systems Administrator/Operations Engineer, DevOps Engineer, Network Engineer, Database Administrator (DBA), Cloud Engineer/Architect, DevOps Architect
Impeccable punctuality (schedules are flexible but being on time for meetings is crucial).
Infrastructure as Code (IaC) Tools:

Terraform: Infrastructure provisioning and management.
Ansible: Automation for configuration management and application deployment.

Containerization and Orchestration:

Docker: Containerization for packaging applications and dependencies.
Kubernetes: Container orchestration for automating the deployment, scaling, and management of containerized applications.

Monitoring and Observability:

Prometheus: Monitoring and alerting toolkit for containerized environments.
Grafana: Visualization and monitoring dashboard for various data sources.

Logging and Tracing:

ELK Stack (Elasticsearch, Logstash, Kibana): Log management and analysis.
Jaeger or Zipkin: Distributed tracing for identifying performance bottlenecks.

Performance Testing:

Apache JMeter or Gatling: Load and performance testing tools to simulate user traffic and analyze system performance.

Incident Management:

PagerDuty or VictorOps: Incident alerting and on-call management.
StatusPage: Communication and status updates during incidents.

Version Control:

Git: Version control for managing and tracking changes in code and configurations.

CI/CD Tools:

Jenkins or GitLab CI/CD: Continuous integration and continuous delivery pipelines for automating software delivery processes.

Configuration Management:

Consul or etcd: Service discovery and configuration management.
Zookeeper: Distributed coordination and configuration management.

Security and Compliance:

HashiCorp Vault: Secrets management and data protection.
OpenSCAP or Nessus: Security compliance scanning and vulnerability assessment tools.

Networking:

Istio or Linkerd: Service mesh for managing and securing microservices communication.
Wireshark: Network protocol analyzer for troubleshooting and debugging.

Collaboration and Documentation:

Confluence or Notion: Documentation and collaboration platforms.
Slack or Microsoft Teams: Communication and collaboration tools for teams.

Nice to Have

Additional relevant skills and experiences that enhance the candidate's ability to perform in the role but are not mandatory.

Position Type and Expected Hours of Work

This is a full-time consultancy, with up to 40 weekly hours during regular business times. We operate under a flexible core hours policy to accommodate various schedules, allowing consultants to perform during their peak productivity times. Additionally, we offer the flexibility to work remotely.

SRE Architect | remote

Offer summary

Qualifications:

Key responsabilities:

Job description

Your missions

Required profile

Experience