SHRUTHI NAIDU – Online Resume

SHRUTHI NAIDU  Site Reliability Engineer
Email
shruthinaidu91@gmail.com Phone
0450392123 LinkedIn
LinkedIn 

Profile

A site reliability engineer, looking to make sense out of chaos while driving necessary change to help build a reliable platform.

Tech Delivery

Leadership — Stakeholder Management | Mentorship | Incident Command
Execution — Agile/Scrum | Tool Optimization
Strategy — Product Roadmapping | Cloud Governance

Tech Stack

Cloud — AWS | OpenStack
OS and Software — Linux | Python Scripting
Container Management — Docker | Kubernetes | Helm
Gitops — Gitlab | GHA | Argo
Observability and Alerting — ELK Stack | Grafana Stack | HoneyComb | Open Telemetry | Incident IO
Collaboration Tools — JIRA | Confluence | Backstage | Zendesk
AI — Litellm | MCPs | Opencode
Other Tools — Kafka

Professional Experience

SRE, FDJ

•Led and worked on a multi-quarter migration initiative to modernize cluster management, streamlining global observability operations and ensuring the platform's long-term scalability.
•Managing and scale a high‑performance on‑premise observability platform built on the Grafana stack and OpenTelemetry architecture, deployed as Kubernetes workloads.

06/2025 – PresentSydney, Australia

•Partnered with diverse service teams to translate business requirements into actionable alerting strategies and SLO-driven dashboards, directly improving Mean Time to Detect (MTTD) across the engineering org.

[email protected]

•Improving platform reliability, scalability, and efficiency through continuous optimization, performance tuning, capacity planning, and automation.

•Applying SRE best practices — including SLIs/SLOs definition, incident analysis, and proactive monitoring — to increase system resilience and reduce operational toil.

Projects : 

Infrastructure Modernization (ArgoCD): Orchestrated a large-scale migration to GitOps-based deployment (Argo), standardized delivery pipelines, and improved deployment frequency and governance.

Service Level Excellence Program: Led a company-wide initiative to engage service teams in defining SLIs/SLOs, bridging the gap between technical metrics and customer-facing reliability goals.

AI Observability:  Built observability for custom AI hosting workflows using LiteLLM and OpenTelemetry, enabling visibility into request latency, token usage, error patterns, and service health.

Site Reliability Engineer, Workday

•Spearheaded the integration of Agile methodologies into infrastructure operations, transitioning the SRE team from reactive firefighting to a structured, sprint-based delivery model.
•Cultivated a high-performance reliability culture by leading incident response and post-mortem rituals, shifting the team’s focus toward systematic root-cause analysis (RCA) and long-term remediation.

03/2022 – 06/2025Sydney, Australia

•Functioned as the SRE Product Owner, managing and prioritizing a complex technical backlog. Successfully balanced high-impact project work with operational stability to meet regional business objectives.

•Managed and prioritized the product backlog for collaborating with engineering teams to define and implement SLOs/SLIs while also working to onboard alerts onto the existing observability platform, while working on daily operational tasks. 

Projects : 

ANZ Reliability Framework (SLO/SLI Engagement): Led a regional program to standardize reliability metrics across disparate service teams, creating a unified language for system health and performance.

SRE Automation Program (Team Bot): Acted as Product Owner for the development of an internal SRE automation bot, designed to reduce cognitive load and automate repetitive operational workflows (Toil Reduction) while managing the roadmap and execution of the initiative.

[email protected]

Site reliability engineer (Remote), BlockFi

•Acted as the primary Incident Commander and SRE Evangelist, institutionalizing a culture of blameless post-mortems and systematic root-cause analysis (RCA) to drive continuous organizational learning.
•Partnered with cross-functional service teams as a subject matter expert to architect robust deployment pipelines, standardized alerting frameworks, and reliability metrics, ensuring alignment with global best practices.

04/2021 – 12/2021Bangalore, India

Projects : Service Improvements - Observability setup

Site Reliability Engineer, Criterion Networks

•Designed and documented scalable operational workflows and "Standard Operating Procedures" (SOPs) that standardized hybrid cloud management and improved cross-team onboarding efficiency.
•Orchestrated collaborative testing strategies with QA teams, integrating SRE-driven metrics into pre/post-deployment phases to significantly reduce production escape rates.

07/2019 – 03/2021Bangalore, India

•Acted as a strategic partner for key clients, leveraging incident management data and usage metrics to ensure service delivery met contractual obligations and customer expectations.

Projects : Client Usage Metrics and cloud costing

Engineer 1 ( Systems engineer), Jcpenney Services India

Providing L1 support for all issues from around 450+ store locations in the US, data centers as well as IBOs (international buying offices) handling hardware installation and troubleshooting. Leading the operations center (NOC) team on daily task allocation, incident queue management, process documentation as well as root cause analysis.

12/2016 – 06/2019Bangalore, India

Courses

KCNA, CNCF

08/2024 – 09/2024

Terraform Associate, ACG

06/2023

Google Cloud Professional Certificate: Cloud Engineer

10/2021

Projects

Fun, Supe-story-project

A simple dockerized python app created to give you a random super hero story. 

Self Hosting, Bookstack

A wonderful note keeper spun up locally using docker compose

[email protected]

Education

Master's Degree (MTECH)

Major : Digital communication and Networking
School : The Oxford college of engineering 

09/2014 – 06/2016

Bachelor's Degree (BTECH)

Major : Electronics and Communication
School : New Horizon college of engineering

08/2009 – 06/2013

Interests

SingingSport enthusiastAmatuer GuitaristPart of the Toastmaster fraternityAttending tech conferences and webinars

Acheivements

Recognised for : Customer Support and Employees, Workday

Received this recognition from my managers for putting customers first while always looking to improve how our team works. 

09/2024

Recognised for : Innovation, Workday

Issued by the manager of the service health for my zest to learn by curiosity. 

04/2024

Employee of the month, Criterion Networks

Received this award for working on building the SRE team while
maintaining customer success with all clients from onboarding to exit.

04/2020

Warrior Spirit Award, JCPenney

Received this award for building the operations (NOC) team from the ground up including setting processes in place as well as training team members.

[email protected]