Experience implementing and administering Hadoop infrastructure with cluster maintenance, troubleshooting, monitoring, and backup/recovery strategies
Hands-on experience provisioning and managing multiple clusters (e.g., EMR, EKS) with infrastructure monitoring, logging, and alerting using Prometheus, Grafana, and Splunk
Strong skills in performance tuning, capacity planning for Hadoop clusters/workloads, memory management, and queue allocation in cloud-era environments
Ability to scale production clusters (18/5 or 24/5), monitor connectivity and security, manage HDFS, and contribute to RCA and change management processes
Requirements:
Implement and administer Hadoop infrastructure, including cluster maintenance, troubleshooting, monitoring, and backup/recovery strategies
Provision and manage the lifecycle of multiple clusters (e.g., EMR, EKS) with infrastructure monitoring, logging, and alerting using Prometheus, Grafana, and Splunk
Performance tuning of Hadoop clusters and workloads, capacity planning at application/queue level, memory management, queue allocation, and ensuring scalability in production environments
Meet SLA targets, ensure changes to production systems are planned and approved via Change Management, collaborate with application teams for OS/Hadoop updates, and maintain central dashboards for system, data, utilization, and availability metrics
Job description
Job Title - Hadoop Admin Location - Remote Duration - 12 Plus Months Rate - DOE U.S. Citizens and those authorized to work in the U.S. are encouraged to apply. We are unable to sponsor at this time.
Job Description
Responsible for implementation and ongoing administration of Hadoop infrastructure.
Responsible for Cluster maintenance, trouble shooting, Monitoring and followed proper backup & Recovery strategies.
Provisioning and managing the life cycle of multiple clusters like EMR & EKS. Infrastructure monitoring, logging & alerting with Prometheus/Grafana/Splunk.
Performance tuning of Hadoop clusters and Hadoop workloads and capacity planning at application/queue level. Responsible for Memory management, Queue allocation, distribution experience in Hadoop/Cloud era environments.
Should be able to scale clusters in production and have experience with 18/5 or 24/5 production environments. Monitor Hadoop cluster connectivity and security, File system (HDFS) management and monitoring.
Investigates and analyzes new technical possibilities, tools, and techniques that reduce complexity, create a more efficient and productive delivery process, or create better technical solutions that increase business value. Involved in fixing issues, RCA, suggesting solutions for infrastructure/service components.
Responsible for meeting Service Level Agreement (SLA) targets, and collaboratively ensuring team targets are met.
Ensure all changes to the Production systems are planned and approved in accordance with the Change Management process.
Collaborating with application teams to install operating system and Hadoop updates, patches, version upgrades when required.
Maintain central dashboards for all System, Data, Utilization, and availability metrics.