Company Profile:
Our client is a U.S.-based group of affiliated companies operating at the intersection of legal technology and mass tort litigation. The organization includes a legal technology platform that automates medical record retrieval and case qualification for law firms, a Washington, D.C.–based mass tort litigation firm, and related holding entities. It is a lean, high-growth environment where each team member plays a significant and impactful role.
Overall purpose and responsibilities of the role:
As a Site Reliability Engineer, you will help build and support a technology platform while working closely with support staff and developers. You will be responsible for monitoring and troubleshooting the live platform to ensure optimal performance and stability. The role will also involve participating in new customer onboarding, provisioning customer environments, and resolving production issues to maintain system reliability and performance.
Duties and Responsibilities:
● Monitor and troubleshoot the running platform across multiple services and components
● Analyze Cloud Run logs, Temporal workflow UI, GKE pod status, and Pub/Sub queues to identify and resolve issues
● Perform end-to-end triage to determine whether issues originate from the agent layer (Python), workflow layer (Temporal), API layer (Go), or frontend (Vue)
● Support resolution of paralegal-facing operational issues such as stuck cases, failed faxes, and pending qualifications
● Execute and write SQL queries against AlloyDB for investigation, validation, and troubleshooting
● Participate in platform development and improvement initiatives, including identifying recurring issues and contributing to fixes
● Support new customer onboarding, including provisioning and validating customer environments
● Contribute to the build and enhancement of internal tools, services, and platform components
● Act as a Level 2 support engineer, going beyond surface-level platform monitoring to identify and resolve deeper system and integration errors
● Develop and maintain runbooks, escalation procedures, and operational documentation to improve incident response and system reliability
Requirements
Must-have Skills / Qualification:
● Strong experience with Linux and Kubernetes (kubectl: logs, exec, describe)
● Ability to read and interpret Python or Go stack traces to diagnose issues across distributed services
● Solid proficiency in PostgreSQL / SQL (psql)
● Experience with GCP, AWS, or Azure (GCP preferred), including hands-on infrastructure provisioning and management
● Practical experience with Kustomize or Helm
● Exposure to workflow orchestration tools (preferably Temporal; also Airflow, Argo, Dagster, or AWS Step Functions)
● Experience with CI/CD pipelines (e.g., GitHub Actions or equivalent)
● Hands-on Terraform (or equivalent IaC) experience for provisioning cloud resources
● Experience with observability tooling: Cloud Logging, Grafana / Prometheus, OpenTelemetry, or equivalent
● Comfort working with HIPAA-adjacent / PHI data; understands secure-logging hygiene (no raw PHI in logs or traces)
● Must have own equipment
Advantageous or Nice-to-Have Skills/Experience:
● Experience with Google Cloud Platform (GCP) services such as Cloud Run, GKE, Pub/Sub, Cloud SQL / AlloyDB, IAM, and Secret Manager
● Terraform at scale (multi-environment modules, remote state)
● Legal ops or litigation support background is a bonus
Location:Work-from-home
Working hours / Job Type:
Monday to Friday, 6:00 AM – 3:00 PM Pacific Time (9:00 PM – 6:00 AM Philippine Time), with a 2-hour overlap for collaboration between teams. This schedule includes 8 core working hours, exclusive of a 1-hour break
**You will be a full-time contractor of our client’s US based company**

Fieldguide

Coralogix

Manila Recruitment

Addepar

The Home Depot

Manila Recruitment

Manila Recruitment

Manila Recruitment