Logo for Manila Recruitment

Site Reliability Engineer (Remote) - #35039

Key Facts

Remote From: 
Full time
Mid-level (2-5 years)
English

Other Skills

  • Collaboration
  • Communication
  • Proactivity
  • Customer Service
  • Problem Solving

Roles & Responsibilities

  • Strong experience with Linux and Kubernetes (kubectl: logs, exec, describe)
  • Ability to read and interpret Python or Go stack traces to diagnose issues across distributed services
  • Solid proficiency in PostgreSQL / SQL (psql)
  • Experience with Google Cloud Platform (GCP) and hands-on infrastructure provisioning; Terraform (or equivalent IaC)

Requirements:

  • Monitor and troubleshoot the running platform across multiple services and components
  • Analyze Cloud Run logs, Temporal workflow UI, GKE pod status, and Pub/Sub queues to identify and resolve issues
  • Perform end-to-end triage to determine whether issues originate from the agent layer (Python), workflow layer (Temporal), API layer (Go), or frontend (Vue)
  • Support new customer onboarding, including provisioning and validating customer environments to maintain reliability

Job description

Company Profile:

Our client is a U.S.-based group of affiliated companies operating at the intersection of legal technology and mass tort litigation. The organization includes a legal technology platform that automates medical record retrieval and case qualification for law firms, a Washington, D.C.–based mass tort litigation firm, and related holding entities. It is a lean, high-growth environment where each team member plays a significant and impactful role.

Overall purpose and responsibilities of the role:
As a Site Reliability Engineer, you will help build and support a technology platform while working closely with support staff and developers. You will be responsible for monitoring and troubleshooting the live platform to ensure optimal performance and stability. The role will also involve participating in new customer onboarding, provisioning customer environments, and resolving production issues to maintain system reliability and performance.

Duties and Responsibilities:

●      Monitor and troubleshoot the running platform across multiple services and components

●      Analyze Cloud Run logs, Temporal workflow UI, GKE pod status, and Pub/Sub queues to identify and resolve issues

●      Perform end-to-end triage to determine whether issues originate from the agent layer (Python), workflow layer (Temporal), API layer (Go), or frontend (Vue)

●      Support resolution of paralegal-facing operational issues such as stuck cases, failed faxes, and pending qualifications

●      Execute and write SQL queries against AlloyDB for investigation, validation, and troubleshooting

●      Participate in platform development and improvement initiatives, including identifying recurring issues and contributing to fixes

●      Support new customer onboarding, including provisioning and validating customer environments

●      Contribute to the build and enhancement of internal tools, services, and platform components

●      Act as a Level 2 support engineer, going beyond surface-level platform monitoring to identify and resolve deeper system and integration errors

●      Develop and maintain runbooks, escalation procedures, and operational documentation to improve incident response and system reliability

Requirements

Must-have Skills / Qualification:

●      Strong experience with Linux and Kubernetes (kubectl: logs, exec, describe)

●      Ability to read and interpret Python or Go stack traces to diagnose issues across distributed services

●      Solid proficiency in PostgreSQL / SQL (psql)

●      Experience with GCP, AWS, or Azure (GCP preferred), including hands-on infrastructure provisioning and management

●      Practical experience with Kustomize or Helm

●      Exposure to workflow orchestration tools (preferably Temporal; also Airflow, Argo, Dagster, or AWS Step Functions)

●      Experience with CI/CD pipelines (e.g., GitHub Actions or equivalent)

●      Hands-on Terraform (or equivalent IaC) experience for provisioning cloud resources

●      Experience with observability tooling: Cloud Logging, Grafana / Prometheus, OpenTelemetry, or equivalent

●      Comfort working with HIPAA-adjacent / PHI data; understands secure-logging hygiene (no raw PHI in logs or traces)

●      Must have own equipment

Advantageous or Nice-to-Have Skills/Experience:

●      Experience with Google Cloud Platform (GCP) services such as Cloud Run, GKE, Pub/Sub, Cloud SQL / AlloyDB, IAM, and Secret Manager

●      Terraform at scale (multi-environment modules, remote state)

●      Legal ops or litigation support background is a bonus

Location:Work-from-home

Working hours / Job Type:

Monday to Friday, 6:00 AM – 3:00 PM Pacific Time (9:00 PM – 6:00 AM Philippine Time), with a 2-hour overlap for collaboration between teams. This schedule includes 8 core working hours, exclusive of a 1-hour break

**You will be a full-time contractor of our client’s US based company**

Site Reliability Engineer (SRE) Related jobs

Other jobs at Manila Recruitment

We help you get seen. Not ignored.

We help you get seen faster — by the right people.

🚀

Auto-Apply

We apply for you — automatically and instantly.

Save time, skip forms, and stay on top of every opportunity. Because you can't get seen if you're not in the race.

AI Match Feedback

Know your real match before you apply.

Get a detailed AI assessment of your profile against each job posting. Because getting seen starts with passing the filters.

Upgrade to Premium. Apply smarter and get noticed.

Upgrade to Premium

Join thousands of professionals who got noticed and hired faster.