A site reliability engineer, looking to make sense out of chaos while driving necessary change to help build a reliable platform.
AWS, GCP, OpenStack
Linux, Python
Docker, Kubernetes
Github, GHA, Argo
ELK Stack, Grafana Stack, HoneyComb, Big Panda, Open Telemetry
JIRA, Confluence, Pager Duty, Slack, Zendesk
Maintaining and scaling a high-performance on-premise observability platform built on the Grafana stack and OpenTelemetry architecture, deployed as Kubernetes workloads. Collaborating with service teams to define meaningful alerting strategies and develop actionable dashboards that drive operational visibility. Ensuring the reliability, scalability, and efficiency of the observability ecosystem through continuous design improvements, performance tuning, capacity planning, and automation. Applying SRE best practices such as SLIs/SLOs definition, incident analysis, and proactive monitoring to enhance system resilience and reduce operational toil.
Building a reliable and highly availble product, though active engagement in incidents and postmortems and guiding the team towards SRE an transformation, by fostering a mindset focused on triage and root cause analysis. Additionally, worked as an SRE Product Owner within the ANZ region with an aim to infuse agile methodology into daily operations via project work. Managed and prioritized the product backlog for collaborating with engineering teams to define and implement SLOs/SLIs while also working to onboard alerts onto the existing observability platform, while working on daily operational tasks.
Projects : Service team SLO/SLI engagement, Mean Time to Detect reduction via alerting, Team SRE Bot.
Being an SRE evangelist, handling incidents as a first line responder and spearheading blameless incident postmortems. Assisting service teams in adopting best practices in terms of deployments, alerting and metrics while managing tooling and maintenance of the public cloud infrastucture.
Projects : Service Improvements - Observability setup
Ensuring customer success through incident management and maintaining client usage metrics, while assiting the QA team in post/pre deployment testing while advocating for adoption of SRE practices in metrics and observability. Maintaning the hybrid cloud environment while developing good processes well as maintaining product documentation.
Projects : Client Usage Metrics and cloud costing
Major : Digital communication and Networking
School : The Oxford college of engineering
Major : Electronics and Communication
School : New Horizon college of engineering
Received this recognition from my managers for putting customers first while always looking to improve how our team works.
Issued by the manager of the service health for my zest to learn by curiosity.
Received this award for working on building the SRE team while
maintaining customer success with all clients from onboarding to exit.
Received this award for building the operations (NOC) team from the ground up including setting processes in place as well as training team members.