

Python Automation
Key Responsibilities
• Develop Python-based automation solutions to streamline on-prem and cloud infrastructure management on GCP and Kubernetes.
• Continuously identify and implement the opportunities to enhance the operational excellence.
• Build proactive and innovative solutions that can scale.
• Implement and manage configuration automation using Ansible (desirable).
• Integrate various tools and services via APIs and client libraries, enabling seamless interoperability across systems.
• Enhance deployment reliability by implementing automated chaos strategies, failover mechanisms, and self-healing infrastructure.
• Develop proactive monitoring and alerting solutions using tools like Splunk, GCP Operations Suite, Grafana, and Prometheus.
• Perform deep root cause analysis (RCA), incident management for complex system failures and develop automation to prevent recurrence.
• Work on system resilience and performance tuning, ensuring mission-critical applications run efficiently under high loads.
• Apply AI/ML techniques to automation workflows, enhancing anomaly detection, predictive scaling, and intelligent alerting.
• Identify and develop AIOps opportunities, reducing operational overhead through intelligent automation.
• Experiment with machine learning models to optimize log analysis, monitoring insights, and failure predictions.
Required Skills & Experience
• Strong background in Systems Engineering with a focus on automation and reliability.
• Proficiency in Python (intermediate to expert level) for developing automation and integrations.
• Hands-on expertise with Kubernetes and cloud platforms (GCP or any major cloud).
• Experience integrating various tools and platforms via APIs and client libraries.
• Deep understanding of monitoring and alerting using Splunk, GCP Operations Suite, Grafana, and Prometheus.
• Ability to work in aggressive, high-stakes environments where reliability and uptime are critical.
• Strong problem-solving skills, capable of navigating uncertainty and handling complex challenges.
• Experience with Ansible for infrastructure automation.
• Prior experience working in mission-critical teams handling large-scale, high-availability systems is a plus.
• Enthusiasm for AI/ML and AIOps, with a desire to apply it in automation and operations.
Key Responsibilities
• Develop Python-based automation solutions to streamline on-prem and cloud infrastructure management on GCP and Kubernetes.
• Continuously identify and implement the opportunities to enhance the operational excellence.
• Build proactive and innovative solutions that can scale.
• Implement and manage configuration automation using Ansible (desirable).
• Integrate various tools and services via APIs and client libraries, enabling seamless interoperability across systems.
• Enhance deployment reliability by implementing automated chaos strategies, failover mechanisms, and self-healing infrastructure.
• Develop proactive monitoring and alerting solutions using tools like Splunk, GCP Operations Suite, Grafana, and Prometheus.
• Perform deep root cause analysis (RCA), incident management for complex system failures and develop automation to prevent recurrence.
• Work on system resilience and performance tuning, ensuring mission-critical applications run efficiently under high loads.
• Apply AI/ML techniques to automation workflows, enhancing anomaly detection, predictive scaling, and intelligent alerting.
• Identify and develop AIOps opportunities, reducing operational overhead through intelligent automation.
• Experiment with machine learning models to optimize log analysis, monitoring insights, and failure predictions.
Required Skills & Experience
• Strong background in Systems Engineering with a focus on automation and reliability.
• Proficiency in Python (intermediate to expert level) for developing automation and integrations.
• Hands-on expertise with Kubernetes and cloud platforms (GCP or any major cloud).
• Experience integrating various tools and platforms via APIs and client libraries.
• Deep understanding of monitoring and alerting using Splunk, GCP Operations Suite, Grafana, and Prometheus.
• Ability to work in aggressive, high-stakes environments where reliability and uptime are critical.
• Strong problem-solving skills, capable of navigating uncertainty and handling complex challenges.
• Experience with Ansible for infrastructure automation.
• Prior experience working in mission-critical teams handling large-scale, high-availability systems is a plus.
• Enthusiasm for AI/ML and AIOps, with a desire to apply it in automation and operations.