Match score not available

NOC / Incident Manager

Remote: 
Full Remote
Contract: 
Work from: 

Offer summary

Qualifications:

2+ years of experience in a NOC, Incident Management, or technical support role., Experience with monitoring tools such as Grafana, Prometheus, and Datadog., Strong troubleshooting skills and ability to analyze logs and metrics., Familiarity with cloud environments like AWS, Azure, or GCP..

Key responsabilities:

  • Monitor production systems in real-time to detect and respond to incidents.
  • Manage live incidents with clear communication and timely resolution.
  • Document and improve incident response processes, including updating runbooks.
  • Collaborate with SREs and developers to implement long-term reliability improvements.

Gloat logo
Gloat
201 - 500 Employees
See all jobs

Job description

Description

About the company:

Gloat puts people and companies in motion. Our Agile Workforce Operating System is helping the world's most renowned enterprises become dynamic organizations, future-fit for any eventuality, and poised for continuous growth and innovation in today's ever-changing economic climate.

We deliver AI-powered intelligence, infrastructure, and applications that enable organizations to effectively tackle change with agility, unlock capacity and productivity, and reduce workforce risk. Today we support industry leaders around the world including HSBC, Spotify, Nestle, Standard Chartered Bank, Schneider Electric, and many more.


Life at Gloat:

Gloat is a revolutionary startup with a global workforce. We have offices in Tel Aviv, New York City and London and work with customers around the globe. We value collaboration, innovative thinking, and curiosity and we’re looking for bright, driven, and passionate people to grow with us. If you care about empowering businesses and people to reach their potential, you’re in for a fun ride.


Who we’re looking for:

We’re looking for a NOC / Incident Manager to join our Production Operations team and play a key role in ensuring the stability and reliability of our systems. In this role, you will monitor production environments, detect and respond to incidents, and work closely with SREs and engineering teams to improve system resilience.

This is a hands-on role for someone who thrives in fast-paced environments, enjoys troubleshooting complex issues, and is passionate about reducing downtime and improving incident response processes.


Responsibilities

  • Real-time monitoring of production systems to detect and respond to incidents.
  • Analyze and triage alerts, identifying root causes and escalating when necessary.
  • Manage live incidents, ensuring clear communication and timely resolution.
  • Document and improve incident response processes, including updating runbooks and playbooks.
  • Collaborate with SREs and developers to drive post-mortem analysis and implement long-term reliability improvements.
  • Reduce alert fatigue by tuning monitoring systems and ensuring alerts are actionable.
  • Participate in on-call rotations, ensuring 24/7 incident response coverage.
  • Proactively suggest improvements to monitoring, alerting, and automation strategies.

Requirements

  • 2+ years of experience in a NOC, Incident Management, or technical support role.
  • Experience with monitoring tools (Grafana, Prometheus, ELK, Datadog, New Relic, etc.).
  • Strong troubleshooting skills, with a structured approach to problem resolution.
  • Ability to analyze logs and metrics to identify root causes of incidents.
  • Excellent communication skills, with the ability to coordinate across teams.
  • Familiarity with cloud environments (AWS, Azure, GCP) and modern infrastructure concepts.
  • Ability to work under pressure, responding to incidents in a high-scale production environment.


Bonus Points

  • Experience with incident automation tools and self-healing mechanisms.
  • Scripting skills (Bash, Python) to automate tasks and improve monitoring.
  • Familiarity with on-call management tools like PagerDuty or Opsgenie.
  • Understanding of SRE principles and site reliability best practices.

Why Join Us?

This role is an opportunity to be at the heart of Gloat’s production operations, ensuring our platform runs smoothly and reliably. If you’re excited about real-time problem-solving, operational excellence, and working with cutting-edge technologies, we’d love to hear from you!


At Gloat, we believe that building the most important company in the history of human capital begins with having a diverse and inclusive workforce ourselves. This means that we look for individuals who can bring unique strengths, perspectives, skills, and backgrounds to our existing teams. Gloat is proud to be an Equal Opportunity Employer, and does/will not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, gender identity or expression, age, marital status, veteran status, disability status, pregnancy, parental status, genetic information, political affiliation, or any other status protected by the laws or regulations in the locations where we operate.





Required profile

Experience

Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Troubleshooting (Problem Solving)
  • Calmness Under Pressure
  • Collaboration
  • Communication

Incident Response Analyst Related jobs