Key Facts

Remote From:

Category: Site Reliability Engineer (SRE)

Fixed term

English

Hard Skills

Observability Infrastructure as Code (IaC) Incident Response Cloud Computing Datadog Prometheus (Software) Logstash Kibana Forced Degradation AWS CloudFormation +21 more

Other Skills

•
Collaboration
•
Communication
•
Leadership
•
Adaptability

Roles & Responsibilities

Proven experience leading incident response and postmortem processes for high-availability production systems
Deep expertise in designing highly available architectures (EC2, Fargate, auto-scaling, health checks, graceful degradation)
Strong experience with AWS cloud infrastructure and IaC tools (Terraform, CloudFormation)
Hands-on experience with CI/CD automation using GitHub Actions or equivalent tools

Requirements:

Lead incident response and develop sustainable on-call practices, including runbooks, blameless postmortems, and continuous improvement to reduce MTTR
Build and maintain self-service observability tools (Datadog, Prometheus, ELK) for proactive monitoring and troubleshooting
Create and maintain Infrastructure as Code (IaC) using Terraform or CloudFormation for consistent, secure AWS environments
Design and implement robust CI/CD pipelines (GitHub Actions) with advanced deployment strategies (blue/green, canary) and collaborate with development teams to ensure reliability

Job description

Title: Senior Site Reliability Engineer (SRE)
Location: Remote

About January

At January, we’re transforming the lives of borrowers by bringing humanity to consumer finance. Our data-driven products empower financial institutions to streamline collections and help borrowers regain financial stability and control over their lives. We’re not just expanding access to credit — we’re restoring dignity and paving the way for millions to achieve financial freedom.

About the Role

As a Senior Site Reliability Engineer (SRE), you will establish SRE practices from the ground up — ensuring reliability, scalability, and performance as January scales from thousands to millions of borrowers. You’ll architect resilient infrastructure, design modern observability solutions, and build sustainable on-call processes that evolve with our rapid growth.

Your work will directly address scaling challenges including database optimization, async workflow infrastructure, and data pipeline reliability — enabling the engineering team to ship confidently and efficiently.

Key Responsibilities

Lead incident response and develop sustainable on-call practices, including runbooks, blameless postmortems, and continuous improvement to reduce MTTR.
Build and maintain self-service observability tools (Datadog, Prometheus, ELK) for proactive monitoring and troubleshooting.
Create and maintain Infrastructure as Code (IaC) using Terraform or CloudFormation for consistent, secure AWS environments.
Partner with development teams to architect resilient, scalable infrastructure for critical components like databases, networking, async workflows, and data pipelines.
Design and implement robust CI/CD pipelines (GitHub Actions) with advanced deployment strategies (blue/green, canary).
Drive best practices in reliability and performance early in the design phase to future-proof January’s systems.

Required Skills & Experience

Proven experience leading incident response and postmortem processes for high-availability production systems.
Deep expertise in designing highly available architectures (EC2, Fargate, auto-scaling, health checks, graceful degradation).
Strong experience with AWS cloud infrastructure and IaC tools (Terraform, CloudFormation).
Hands-on experience with CI/CD automation using GitHub Actions or equivalent tools.
Proficiency in observability and monitoring stacks (Datadog, Prometheus, ELK).
Solid scripting/programming skills in Python (for automation, tooling, and debugging).
Excellent communication and documentation skills, with the ability to collaborate across engineering and platform teams.

Requirements

Tools & Technologies

Cloud: AWS
IaC: Terraform, CloudFormation
CI/CD: GitHub Actions
Monitoring: Datadog, Prometheus, ELK
Languages: Python
Infrastructure: EC2, Fargate

Additional Details

Remote role (NYC-based preferred for hybrid collaboration).
Opportunity to build and own the entire SRE practice for a growing FinTech startup.
Fast-paced, innovative environment working on AI-forward consumer finance products.

Ready to apply?

APPLY

Site Reliability Engineer (SRE) Related jobs

New York (USA)Site Reliability Engineer (SRE)

Site Reliability Engineer IV

Today

OpenX

Full time

gRPCAmazon Web ServicesJava (Programming Language)Python (Programming Language)Terraform

Site Reliability Engineer

Today

Core Specialty Insurance Holdings, Inc.

Full time

Microsoft AzureAmazon Web ServicesInfrastructure as Code (IaC)Reliability EngineeringMicroservices

Senior Site Reliability Engineer (SRE)

Today

Oowlish

Full time

Site Reliability EngineeringObservabilityContinuous MonitoringWell LoggingPython (Programming Language)

Associate Principal Engineer, Performance and Site Reliability

Today

Nagarro

Full time

ObservabilityJava (Programming Language).NETSQL (Programming Language)NoSQL

Security Site Reliability Engineer

Today

TierPoint

Full time

ElasticsearchPython (Programming Language)Cloud ComputingSecurity TechnologyBash (Scripting Language)

Other jobs at Gov Services Hub

SAP IBP Consultant

6 days ago

Gov Services Hub

Freelance

SAP ERPSAP HANASAP ConfigurationSAP ImplementationRisk Management

GCP Data Quality Test Engineer with retail domain

6 days ago

Gov Services Hub

Fixed term
Senior (5-10 years)

Google Cloud Platform (GCP)SQL (Programming Language)gRPCTest DataData Engineering

SDET(Software Development Engineer in Test)

6 days ago

Gov Services Hub

Fixed term
Senior (5-10 years)

Selenium WebdriverJava (Programming Language)REST AssuredAPI TestingAppium

We help you get seen. Not ignored.

We help you get seen faster — by the right people.

🚀

Auto-Apply

We apply for you — automatically and instantly.

Save time, skip forms, and stay on top of every opportunity. Because you can't get seen if you're not in the race.

✨

AI Match Feedback

Know your real match before you apply.

Get a detailed AI assessment of your profile against each job posting. Because getting seen starts with passing the filters.

Upgrade to Premium. Apply smarter and get noticed.

Upgrade to Premium

Join thousands of professionals who got noticed and hired faster.