Basic Function
The Senior Site Reliability Engineer (SRE) at Lumin Digital is responsible for ensuring the availability, scalability, and performance of our digital banking platform. This role requires a deep understanding of both development and operations, leveraging automation to reduce manual tasks and enhance reliability. The SRE will work closely with Software Engineers to incorporate best practices from design to deployment, ensuring that Service Level Objectives (SLOs) are consistently met.
Essential Functions, Responsibilities, Experience:
Develop and manage CI/CD pipelines, ensuring efficient deployment and system updates.
Monitor and troubleshoot application and infrastructure issues across all environments, proactively ensuring SLOs and uptime requirements are met.
Collaborate with development and security teams to integrate best practices and ensure system resilience.
Engage in capacity planning and demand forecasting to anticipate performance bottlenecks and proactively scale the environment.
Manage change and configuration, ensuring stability and consistency across deployments.
Provide metrics to track system performance and identify areas for improvement.
Implement monitoring and alerting strategies that promote automation, self-healing, and effective incident response.
Participate in a 24x7 on-call rotation to support system reliability and availability.
Perform other duties as assigned.
Growth Opportunities:
~30 Days: Gain familiarity with Lumin Digital’s systems, tools, and processes, and begin supporting CI/CD pipeline and monitoring tasks.
~90 Days: Take ownership of specific areas of the Lumin Digital tech stack, implementing monitoring and alerting best practices while working closely with development teams.
~1 Year: Lead initiatives in capacity planning, proactive scaling, and process improvement to enhance system reliability and SLO attainment.
Knowledge, Skills, & Abilities:
Strong problem-solving skills with an operations mindset and an ability to anticipate issues in large-scale systems.
Proficiency with configuration management tools such as Chef, Ansible, or Puppet.
Knowledge of standard networking protocols and components (HTTP, DNS, TCP/IP, ICMP).
Expertise in AWS or other cloud hosting environments, with a security-focused approach to data integrity and availability.
Hands-on experience with containerization and orchestration technologies, including Docker and Kubernetes.
Advanced understanding of Terraform, CI/CD architecture, and the ability to automate workflows.
Physical Demands:
These are standard across Lumin Digital, if you have unique requirements, add them to the Other bullet below.
While performing the duties of this Job, the employee is regularly required to sit; use hands to type, handle, or feel and talk or hear
Specific vision abilities required by this job include close vision
Individuals with a disability who are otherwise able to perform the essential functions of the job may request reasonable accommodation through the Human Resources department.
Ability to respond to incidents during off hours.
Travel:
Minimal, generally 12 days or less per year, ~2X team get togethers a year