Key Facts

Remote From:

Full time

Mid-level (2-5 years)

English

Hard Skills

AWS Cloud Services Terraform Kubernetes Observability Datadog Internet Protocols Suite Analytical Dashboard Unix Virtual Private Networks (VPN) Containerization Root Cause Analysis Continuous Delivery AWS Cost Management SMTP (Simple Mail Transfer Protocol) Post-Mortem Care Splunk JavaScript (Programming Language) SQL (Programming Language) Category Management Log Monitoring Cloud Computing Vsftpd Microsoft Azure Python (Programming Language) Engineering Documentation Cloud Computing Ruby (Programming Language) Bash (Scripting Language) CI/CD Java (Programming Language) Hybrid Testing Automated Information Systems Configuration Management Incident Management System Level Troubleshooting Infrastructure as Code (IaC) New Relic (SaaS) Go (Programming Language) Help Desk Support System Administration Software Development gRPC Database Administration Customer Centricity Stakeholder Communications Continuous Improvement Process

Other Skills

•
Collaboration
•
Communication
•
Problem Solving

Job description

About Us:

Intrado is dedicated to saving lives and protecting communities, helping them prepare for, respond to, and recover from critical events. Our cutting-edge company strives to become the most trusted, data-centric emergency services partner by uniting fragmented communications into actionable intelligence for first responders. At Intrado, all of our work truly matters.

Responsibilities:

In this Site Reliability Engineering (SRE) role, you’ll partner closely with development and business teams to create effective monitoring, alerting, and observability solutions that improve system performance and visibility. You’ll support production systems,

troubleshoot complex issues, and help drive long-term stability through proactive incident management and automation. You'll get to design secure, cost-effective, and reliable cloud infrastructure.

This role will work nights between 9 or 10pm to 5 or 6am in SST.

Reliability Engineering & System Operations

Design, implement, and maintain scalable, reliable production systems.
Troubleshoot and resolve complex application and system issues.
Collaborate with development teams to build features with reliability, observability, and performance in mind.
Apply Site Reliability Engineering (SRE) best practices including Service Level

Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).

Monitoring & Observability

Develop and maintain monitoring, alerting, synthetic testing, and dashboards to ensure visibility into system health.
Configure agents for metrics/log collection and manage incident notification channels.
Analyze trends and recurring issues to drive proactive improvements.

Cloud Infrastructure Management

Manage and optimize AWS/Azure environments in staging and production.
Collaborate with architecture, development, and finance teams to design secure, cost-effective, and reliable cloud infrastructure.

Incident & Problem Management

Participate in 24/7 on-call rotations, quickly respond to production incidents, and identify root causes.
Lead post-mortems and implement long-term fixes.
Escalate and communicate issues as appropriate.

Automation & Tooling

Automate repetitive operational tasks and improve system efficiency.
Build and maintain deployment and configuration tools.
Working in CI/CD tools such as GitHub Actions.

Collaboration & Customer Focus

Partner with product and development teams to prioritize and resolve production-impacting issues.
Support internal teams with tools and insights for efficient self-service.
Ensure timely resolution of tickets and clear communication with stakeholders.

Architecture & Documentation

Review technical documentation (HLDs/FRDs) to identify potential issues early.
Maintain knowledge of product platforms and usage patterns.

What You Bring:

Education: Bachelor’s in Computer Science, MIS, or related field (or equivalent experience).
Experience: 4+ years in application support; experience in development, databases, or systems administration preferred.
Cloud: Expertise in AWS and/or Azure (GCP a plus) with hands on experience.
Languages: Skilled in one or more languages (Python, Go, Java, Ruby, JavaScript); scripting with Bash or Python.
Monitoring Tools: Experience with tools like DataDog, Splunk, New Relic; dashboard creation and performance monitoring.
Systems & Networking: Strong Linux/Unix skills; SQL, VPN, TCP/IP, FTP/SMTP troubleshooting.
Containers & IaC: Production level of Kubernets and Terraform.
SRE Practices: Knowledge of SLIs/SLOs/SLAs, CI/CD, and automation strategies.
Soft Skills: Excellent problem-solving, communication, and collaboration.