Offer summary

Qualifications:

Minimum 5 years of experience in Site Reliability Engineering or similar roles., Extensive expertise in AWS cloud platform and services., Proficiency in scripting languages like Python or PowerShell., Strong knowledge of Linux systems, networking, load balancing, and security principles..

Key responsibilities:

Design, build, and maintain cloud infrastructure solutions using AWS.

Lead incident response efforts and perform root cause analysis.

Mentor and provide technical guidance to junior SREs.

Implement automation and process improvements to enhance system reliability.

Job description

Job Description

Senior Site Reliability Engineer

Pay Status and Classification: Exempt, Regular Full-time

Supervisor Title : Director of DevOps

Work Location: Remote in New York or Texas. If in New York and local to company headquarters in Schenectady, NY there are days the Senior Site Reliability Engineer is expected to be in the office for company meetings.

Position purpose: The Senior Site Reliability Engineer (SRE) ensures the reliability, scalability, and performance of Transfinder’s cloud-based software solutions. This role blends software engineering and systems administration to support and enhance critical infrastructure, working closely with development and operations teams to deliver secure and cost-effective cloud environments.

Essential Duties And Responsibilities

Cloud Infrastructure Architecture and Implementation: Designs, builds, and maintains robust cloud infrastructure solutions using AWS and other cloud technologies.
Mentorship and Team Development: Provides technical guidance and mentorship to junior SREs, promoting a culture of continuous learning and improvement.
Operational Efficiency and Automation: Identifies and implements process improvements through automation and optimization to enhance reliability and reduce manual effort.
Performance and Reliability Management: Develops and executes strategies to meet and exceed Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
Incident Management: Leads incident response efforts, perform root cause analysis, and implement preventive measures to minimize downtime.
Capacity Planning and System Optimization : Proactively identifies performance bottlenecks, optimize resource utilization, and ensure system scalability.
Security and Compliance: Implements cloud security best practices, including least-privilege IAM policies, secrets management, and evidence generation for compliance frameworks (e.g., SOC 2, ISO 27001).
Other duties and projects as assigned.

Required Skills/Abilities

Strong problem-solving, troubleshooting, and analytical skills.
Excellent communication and collaboration abilities.
Organizational skills with attention to detail.
Ability to manage time and prioritize tasks.
Proficiency in scripting languages (e.g., Python, PowerShell).
In-depth knowledge of Linux systems, networking, load balancing, and security principles.

Experience

5+ years in Site Reliability Engineering or a similar role.
Extensive expertise in AWS (Amazon Web Services) cloud platform and services.
Experience with GitOps practices and CI/CD tooling (e.g., GitHub Actions, Jenkins, ArgoCD, or similar).
Experience with Infrastructure as Code (e.g., Terraform).
Experience designing and maintaining observability stacks (e.g., Prometheus, Grafana, ELK) with a focus on actionable metrics, alerting, and SLOs.

Physical Requirements

Prolonged periods of sitting at a desk and working on a computer.
Must be able to lift up to 15 pounds at times.

Annual Salary Range: $100,000.00 to $150,000.00

Compensation: Salary is established based on various factors, including, but not limited to, prior employment history, job-related knowledge, education and training, skills, and geographic location.

Required profile