Match score not available

Remote Systems Engineer III

72% Flex

Remote:

Full Remote

Contract:

Full time

Experience:

Mid-level (2-5 years)

Work from:

United States

Offer summary

Qualifications:

Engineering degree or equivalent experience, Experience with Monitoring, Observability, Site Reliability Engineering, and Azure.

Key responsabilities:

Teach peers best practices in monitoring observability
Guide proper toolset use for service improvement
Develop/improve monitoring of software/hardware ecosystem
Provide guidance on monitoring best practices, improve instrumentations and observability across ecosystem
Create scripts/infrastructure as code for monitoring implementations to lower overhead

EDI Staffing, an EDI Specialists Company https://www.edistaffing.com

11 - 50 Employees

See more EDI Staffing, an EDI Specialists Company offers

Job description

Your missions

What does a day look like?

Sample Day 1

The day starts with a quick run through email to see if any of the teams you partner with had any active questions about using Datadog to monitoring & observe their product. From there it's onto the daily Scrum call to give the team a quick update of where you stand. It's now 9:15AM EST and you're ready to work on your tasks which are focused on identifying Service Level Indicators (SLIs) for a team running a j2ee application in Azure. You leverage your experience & a bit of online research to determine the right items to identify key SLIs and create Service Level Objectives (SLOs) based off those in the Datadog tool. From there you build out an Azure DevOps pipeline to leverage Infrastructure as Code to create those SLOs. In the afternoon you partner with another team to build a dashboard to visualize the SLOs over different time periods. After that you coordinate a meeting for the following week to guide the team in reviewing how and why the SLOs were met or not.

Sample Day 2

The day starts with a quick run through email to see if any of the teams you partner with had any active questions about the monitoring & observability you guided them in setting up. From there it's onto the daily Scrum call to give the team a quick update of where you stand. It's now 9:15AM EST and you're ready to work on your tasks which all relate to adding observability to Product X by integrating it into Datadog (tagging, base agent installation, integration setup, etc). After lunch you have a meeting with the team who own Product X where you run them through the basics of Datadog while showing them their product data (from QA). After that you're back to getting the jmx integration configured and working in the QA environment. Once done you look at the out of box dashboard for JMX and notice it doesn't have some key information. So, you go ahead and copy it to add a few metrics. Now it's almost the end of the day so you take a quick peek at what other tasks you have coming. Looks like with JMX done it is moving on to adding tracing to the application tomorrow then working with the team to configure their monitors next week (and validate their ServiceNow alert mapping & routing).

Responsibilities

Teaching peers about monitoring & observability best practices.
Guiding & reinforcing proper use of our toolset to improve the quality, reliability & availability of the services our teams offer.
Implement and enhance monitoring of the hardware & software across our ecosystem.
Developing and improving instrumentations/integrations.
Providing guidance on monitoring best practices.
Providing guidance on monitoring specific hardware & software items (key points to monitor).
Implement and enhance observability of products & platforms across our ecosystem.
Developing and improving instrumentation
Providing guidance on key areas to observe.
Educating teams on how observability tools work.
Being responsible for ensuring we provide our internal customers with the best monitoring & observability possible to aid them in raising the quality, reliability & availability of IT corporate infrastructure.
Scripting / Infrastructure as Code / Process Creation for monitoring & observability implementations & enhancements to lower overhead & improve efficiency.

Requirements

Experience with Monitoring / Observability / Site Reliability Engineering
Engineering degree or equivalent experience and familiarity with engineering best practices.
Working knowledge of how hardware & software interact in a corporate retail environment.
Experience with Azure / Azure DevOps
Deeper knowledge in one or more of the following domains of hardware/software:
Application Servers (IIS, Tomcat, WebSphere, jBoss, etc)
Containerization (Kubernetes, VMWare, etc)
Database (SqlServer, Postgres, DB2, Oracle, etc)
Message Bus (IBM MQ, Kafka, Active MQ, Rabbit MQ)
Networking (Cisco ACI, F5 Load Balancers, Firewalls, etc)
Operating Systems (RedHat, Windows, etc)
Programming (java, .net, pyton, etc)
Storage Devices
Web Servers (apache, nginx, etc)
Familiar with Agile Scrum process.
Ability to interact with a variety of personalities and technical skill levels across multiple product & platform teams.
Proficient in developing and maintaining technical documentation.

Nice To Haves

Experience with:
Datadog
Nagios
ServiceNow Event Management / Service Operations Workspace
Knowledge on the Google Site Reliability Engineering model
Experience with Infrastructure as Code / Configuration Management tools:
Terraform
Ansible
Azure Dev Ops
Skills in troubleshooting production environments (this is not a day to day responsibility of this role but this experience will prove valuable as we build the tools those teams utilize).
Strong ownership attitude / track record of taking responsibility.

Required profile

Experience

Level of experience: Mid-level (2-5 years)

Spoken language(s):

English

Check out the description to know which languages are mandatory.

Hard Skills

Nagios Monitoring Observability Site Reliability Engineering (SRE)Azure Azure DevOps Server Technologies Containerization Database Management Operating Systems Coding Scrum Datadog Monitoring ServiceNow Terraform Ansible

Soft Skills

Networking
Excellent Communication
Proactive Mindset
Sense of Ownership

Are you interested?

Go Premium: Access the World's Largest Selection of Remote Jobs!

Largest Inventory: Dive into the world's largest remote job inventory. More than half of these opportunities can't be found on standard platforms.
Personalized Matches: Our AI-driven algorithms ensure you find job listings perfectly matched to your skills and preferences.
Application fast-lane: Discover positions where you rank in the TOP 5% of applicants, and get personally introduced to recruiters with Jobgether.
Try out our Premium Benefits with a 7-Day FREE TRIAL.
No obligations. Cancel anytime.

Upgrade to Premium

Find other similar jobs

SEE MORE JOBS