Designing Resilient Systems for Highly Available Operations

S1 E010
|
DEVOPS
Dec 15, 2023

About Speaker

Suresh Kumar Khemka
Suresh Kumar Khemka
Suresh Kumar Khemka

Suresh Kumar Khemka

Suresh Kumar Khemka
Head of Platform Engineering and Infrastructure, apna

Suresh is an accomplished engineering leader with 18 years of experience in Platform Engineering, DevOps, SRE, Security, and Performance Engineering. His expertise lies in building and leading high-performing teams that deliver scalable, reliable, and secure solutions. He has a proven track record of driving innovation, delivering results, and exceeding expectations, whether working with Fortune 100 companies or startups. As a thought leader in his field, Suresh is passionate about sharing his knowledge and mentoring the next generation of engineers.

Currently, as the leader of Engineering Platform at Apna, he is scaling the platform to enable explosive growth in business along with engineering teams by customizing DevOps, SRE, and Platform Engineering approaches. In his previous roles at Target/EY, he enabled products and services for 5 9's of availability through the application of SRE practices, application and cloud performance engineering, tuning, observability, and monitoring, and Ops strategy. Suresh’s passion for SRE and Performance Engineering drives him to enable applications and products for high performance, scale, and reliability in various domains like e-commerce, retail, telecom, manufacturing, logistics, and banking.

As a DevOps and SRE evangelist, he works with teams to ensure highly scalable and reliable operations using his deep expertise in SRE, Performance Engineering and Testing, Reliability Engineering, Business and infrastructure monitoring, through a methodical and concrete approach for DevOps/SRE Adoption. During his 6-year stint at WalmartLabs, he played an instrumental role in defining best practices for Performance Optimization & Tuning of the full software stack for Walmart's migration to OpenStack Private cloud and Service-oriented architecture. He led operations for Walmart.com's Supply chain, helped launch their two-day delivery capabilities, and improved on-time shipment metrics to over 96.6% despite a 300% increase in shipments. Additionally, he managed Walmart Stores' Online grocery migration as it grew from one to 1,300 stores and maintained a consistent Net Promoter Score by implementing a comprehensive Monitoring/Operations strategy with proactive problem identification through analytics and introducing self-healing capabilities within the software systems.

With his diverse experiences across industries, I am confident in his ability to lead high-performing teams, deliver results, and drive innovation in any engineering environment.

Suresh is an accomplished engineering leader with 18 years of experience in Platform Engineering, DevOps, SRE, Security, and Performance Engineering. His expertise lies in building and leading high-performing teams that deliver scalable, reliable, and secure solutions. He has a proven track record of driving innovation, delivering results, and exceeding expectations, whether working with Fortune 100 companies or startups. As a thought leader in his field, Suresh is passionate about sharing his knowledge and mentoring the next generation of engineers.

Currently, as the leader of Engineering Platform at Apna, he is scaling the platform to enable explosive growth in business along with engineering teams by customizing DevOps, SRE, and Platform Engineering approaches. In his previous roles at Target/EY, he enabled products and services for 5 9's of availability through the application of SRE practices, application and cloud performance engineering, tuning, observability, and monitoring, and Ops strategy. Suresh’s passion for SRE and Performance Engineering drives him to enable applications and products for high performance, scale, and reliability in various domains like e-commerce, retail, telecom, manufacturing, logistics, and banking.

As a DevOps and SRE evangelist, he works with teams to ensure highly scalable and reliable operations using his deep expertise in SRE, Performance Engineering and Testing, Reliability Engineering, Business and infrastructure monitoring, through a methodical and concrete approach for DevOps/SRE Adoption. During his 6-year stint at WalmartLabs, he played an instrumental role in defining best practices for Performance Optimization & Tuning of the full software stack for Walmart's migration to OpenStack Private cloud and Service-oriented architecture. He led operations for Walmart.com's Supply chain, helped launch their two-day delivery capabilities, and improved on-time shipment metrics to over 96.6% despite a 300% increase in shipments. Additionally, he managed Walmart Stores' Online grocery migration as it grew from one to 1,300 stores and maintained a consistent Net Promoter Score by implementing a comprehensive Monitoring/Operations strategy with proactive problem identification through analytics and introducing self-healing capabilities within the software systems.

With his diverse experiences across industries, I am confident in his ability to lead high-performing teams, deliver results, and drive innovation in any engineering environment.

Suresh is an accomplished engineering leader with 18 years of experience in Platform Engineering, DevOps, SRE, Security, and Performance Engineering. His expertise lies in building and leading high-performing teams that deliver scalable, reliable, and secure solutions. He has a proven track record of driving innovation, delivering results, and exceeding expectations, whether working with Fortune 100 companies or startups. As a thought leader in his field, Suresh is passionate about sharing his knowledge and mentoring the next generation of engineers.

Currently, as the leader of Engineering Platform at Apna, he is scaling the platform to enable explosive growth in business along with engineering teams by customizing DevOps, SRE, and Platform Engineering approaches. In his previous roles at Target/EY, he enabled products and services for 5 9's of availability through the application of SRE practices, application and cloud performance engineering, tuning, observability, and monitoring, and Ops strategy. Suresh’s passion for SRE and Performance Engineering drives him to enable applications and products for high performance, scale, and reliability in various domains like e-commerce, retail, telecom, manufacturing, logistics, and banking.

As a DevOps and SRE evangelist, he works with teams to ensure highly scalable and reliable operations using his deep expertise in SRE, Performance Engineering and Testing, Reliability Engineering, Business and infrastructure monitoring, through a methodical and concrete approach for DevOps/SRE Adoption. During his 6-year stint at WalmartLabs, he played an instrumental role in defining best practices for Performance Optimization & Tuning of the full software stack for Walmart's migration to OpenStack Private cloud and Service-oriented architecture. He led operations for Walmart.com's Supply chain, helped launch their two-day delivery capabilities, and improved on-time shipment metrics to over 96.6% despite a 300% increase in shipments. Additionally, he managed Walmart Stores' Online grocery migration as it grew from one to 1,300 stores and maintained a consistent Net Promoter Score by implementing a comprehensive Monitoring/Operations strategy with proactive problem identification through analytics and introducing self-healing capabilities within the software systems.

With his diverse experiences across industries, I am confident in his ability to lead high-performing teams, deliver results, and drive innovation in any engineering environment.

About Host

Sathya Narayanan Nagarajan

Sathya Narayanan Nagarajan

Sathya Narayanan Nagarajan

Co-founder and CTO, Amnic

Sathya is an experienced technologist with over two decades in Artificial Intelligence (AI), Electric Vehicles (EV), and Distributed Systems. As the Co-founder and CTO of Amnic, he drives the development of a cloud Intelligence Platform, emphasizing efficiency, cost reduction, and reliability. Sathya's leadership spans roles at Ola Electric Mobility, Ola Cabs, Yahoo and many internet companies. With 11 patents in AI, EV, and Distributed Systems, he is committed to knowledge sharing and guiding industry thought leaders.

Summary of Podcast

In this podcast, Suresh, an infrastructure and performance engineering expert, shares his diverse career journey that involved building tools for monitoring systems, optimizing infrastructure, and transitioning to the cloud. He discusses how the industry's fast pace of change requires continuous learning and adaptation, leading him to roles in DevOps, site reliability engineering, and automation. Suresh also touches on the importance of reducing cognitive load for developers to help them focus on building features.

He shares his experience implementing a co-pilot system, Gen, to address developer challenges with documentation and infrastructure-related issues. Another speaker from Sterling, a software company, shares their experience using machine learning to match job seekers with jobs. They faced adoption challenges and discuss the importance of trust-building, demonstrating value, and starting small with select teams for successful implementation.

The podcast emphasizes the importance of immediately solving problems and building trust when implementing new technologies or applications, with a specific focus on Kubernetes. It highlights the need for proper setup, auto-scaling, and observability for improved reliability.

In this podcast, Suresh, an infrastructure and performance engineering expert, shares his diverse career journey that involved building tools for monitoring systems, optimizing infrastructure, and transitioning to the cloud. He discusses how the industry's fast pace of change requires continuous learning and adaptation, leading him to roles in DevOps, site reliability engineering, and automation. Suresh also touches on the importance of reducing cognitive load for developers to help them focus on building features.

He shares his experience implementing a co-pilot system, Gen, to address developer challenges with documentation and infrastructure-related issues. Another speaker from Sterling, a software company, shares their experience using machine learning to match job seekers with jobs. They faced adoption challenges and discuss the importance of trust-building, demonstrating value, and starting small with select teams for successful implementation.

The podcast emphasizes the importance of immediately solving problems and building trust when implementing new technologies or applications, with a specific focus on Kubernetes. It highlights the need for proper setup, auto-scaling, and observability for improved reliability.

In this podcast, Suresh, an infrastructure and performance engineering expert, shares his diverse career journey that involved building tools for monitoring systems, optimizing infrastructure, and transitioning to the cloud. He discusses how the industry's fast pace of change requires continuous learning and adaptation, leading him to roles in DevOps, site reliability engineering, and automation. Suresh also touches on the importance of reducing cognitive load for developers to help them focus on building features.

He shares his experience implementing a co-pilot system, Gen, to address developer challenges with documentation and infrastructure-related issues. Another speaker from Sterling, a software company, shares their experience using machine learning to match job seekers with jobs. They faced adoption challenges and discuss the importance of trust-building, demonstrating value, and starting small with select teams for successful implementation.

The podcast emphasizes the importance of immediately solving problems and building trust when implementing new technologies or applications, with a specific focus on Kubernetes. It highlights the need for proper setup, auto-scaling, and observability for improved reliability.

About Amnic

Amnic is a cloud cost observability platform, helping businesses measure and rightsize their cloud costs. Amnic helps businesses visualize, analyze and optimize their cloud spends, in turn building a lean cloud infrastructure. Amnic offers out of the box solutions that help breakdown cloud bills and provide greater visibility and understanding into cloud costs along with recommendations to lower spends, alerts and anomaly detection.

Amnic delivers a wide range of features including K8s visibility, cost analyzer, alerts and custom reporting, budgeting, forecasting and smart tagging. DevOps and SRE teams rely on Amnic to deliver a simplified view into their cloud costs, allowing them to maintain governance and build a culture of cost optimization. Setup in 5-minutes and get 30-days of free trial.

Visit www.amnic.com to get started.

Build a culture of cloud cost optimization

Build a culture of

cloud cost observability

Build a culture of

cloud cost observability

Build a culture of

cloud cost observability