Logo for MARGO

Network Reliability Engineer

Key Facts

Remote From: 
Full time
English

Other Skills

  • •
    Collaboration
  • •
    Communication
  • •
    Problem Solving
  • •
    Open Mindset
  • •
    Mentorship

Roles & Responsibilities

  • Experience with Go or Python
  • Strong scripting skills (Bash, Python)
  • Hands-on experience with Linux systems (Ubuntu/Debian)
  • Knowledge of networking (VLAN/LAN, TCP/IP, DNS, BGP, load-balancing, IPv6, etc.)

Requirements:

  • Build a large AI infrastructure with monitoring, diagnosis, and remediation of production incidents
  • Troubleshoot high-impact production issues in collaboration with other engineering teams
  • Participate in an on-call rotation to handle incidents and ensure service continuity
  • Implement and maintain observability solutions to monitor AI infrastructure and application health

Job description

 
#HPC #AI #GPU #CLUSTERS
 
YOUR DAILY ROUTINE
- Build a large AI infrastructure with monitoring, diagnosis, and remediation of production incidents- Troubleshoot high-impact production issues in collaboration with other engineering teams
- Participate in an on-call rotation to handle incidents and ensure service continuity
- Implement and maintain observability solutions to monitor AI infrastructure and application health
- Contribute to AI infrastructure lifecycle management across different environments and countries
- Promote and apply best practices in terms of stability, resiliency, scalability, and security
- Maintain clear technical documentation for tools and procedures
- Contribute to system and tool evolution based on production feedback
- Collaborate closely with development teams to ensure infrastructure readiness- Participate in team rituals and knowledge-sharing initiatives
 
ABOUT YOU
 
🎯 SOFTSKILLS : 
- Proactive and solution-oriented mindset
- Passion for automation and continuous improvement
- Strong collaboration and communication skills
- Ability to work independently and in a team
- Willingness to mentor and share knowledge
 
💻 HARDSKILLS : 
- Experience with Go or Python 
- Strong scripting skills (Bash, Python)
- Hands-on experience with Linux systems (Ubuntu/Debian)
- Preferred hands-on experience with GPU & HPC infrastructure 
- Knowledge of networking (VLAN/LAN, TCP/IP, DNS, BGP, load-balancing, IPv6, etc.)
- Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic, etc.)
- Comfortable with Infrastructure-as-Code (Ansible, Salt, AWX, etc.)
- Experience managing relational databases (MariaDB)
- Understanding of CI/CD pipelines (GitLab)
- Comfortable with English (written and spoken)
 

Network Systems Engineer Related jobs

Other jobs at MARGO

We help you get seen. Not ignored.

We help you get seen faster — by the right people.

🚀

Auto-Apply

We apply for you — automatically and instantly.

Save time, skip forms, and stay on top of every opportunity. Because you can't get seen if you're not in the race.

✨

AI Match Feedback

Know your real match before you apply.

Get a detailed AI assessment of your profile against each job posting. Because getting seen starts with passing the filters.

Upgrade to Premium. Apply smarter and get noticed.

Upgrade to Premium

Join thousands of professionals who got noticed and hired faster.