Site Reliability Engineer-III - InnovaccerView Company Profile
- Job Title
- Site Reliability Engineer-III
- Job Location
- Noida, Uttar Pradesh, India
- Job Listing URL
- Job Description
As a Site Reliability (Cloud Engineer) -III, you will be responsible for being a combination of both Ops/Support and Site reliability experience and most importantly, a “can-do” attitude and strong sense of ownership. These services are offered as Managed Service/SaaS and hence total ownership of the solution, securing, keeping it always up and running remains with us. Being part of a critical healthcare application and keeping it up and running 24x7/365 is very critical and stakes are high. We are seeking an experienced engineering team member who brings a combination of both Ops/Support and Site reliability experience and most importantly, a “can-do” attitude and strong sense of ownership.
A Day in the Life
- In this role you will be responsible for various pillars of SRE - Deployment, Reliability, Scalability, Service Availability - SLA/SLO/SLI, Performance, Cost etc.
- Lead production roll out of new releases/emergency patches using CICD pipelines and constantly improving pipelines. Establish a solid production promotion/change management process with a solid quality gate working across Dev/QA teams.
- Roll out a solid observability stack across various components of the tech stack so as to proactively detect outage v/s service degradation before the customer notifies us.
- Apply strong analytical skills to understand production system metrics, drive change, optimize system utilization and drive cost efficiency.
- Autoscale/down the platform during peak season scenarios.
- Understand end to end platform architecture and how to best and fast perform triage/RCA by looking at various data points derived from observability tool chain.
- You will work towards reducing the number of alerts/escalation to the next level team – dev/devops.
- Lead monthly operations review with the executive team. Some examples include, but are not limited to – Platform/Application/Infrastructure KPIs -UpTime, RCA , CAP (Corrective Action Plan) and PAP (Preventive Action Plan), security reports, audit reports.
- You will be responsible for Operating and Managing production and staging cloud platforms, responsible for Ops (executing/automation runbook/SOP/ Maintain up-time/SLA) as well as Site Reliability engineering.
- Collaboration is key to this role so as to work across a spectrum of teams - Dev/DevOps/QA/Customer Success etc. derive RCA/5 why analysis and drive product improvements.
- Ensure that the Platform is secured as per guidelines established by CISO. e,g, Secure against DDoS attacks by implementing WAF, Vulnerability and Patch management, install required security agents etc.
- Lead least privilege based RBAC for various production services and tool chain.
- Build and execute Disaster Recovery plan.
- Key stakeholder to participate incase of IR (Incident Response).
What You Need
- Solid experience of min 7+ yrs with at least one of the clouds with automation focus is MUST - AWS, Azure, GCP. Certification has advantages.
- Building reliability, scalability and performance systems in Production. This requires significant engineering experience and risk evaluation.
- Log/Metrics/Tracing tool chain experience is MUST to have; strong analytical skills to understand various data points to understand platform behavior/RCA.
- Hands-on experience with Kubernetes along with Linux is MUST to have.
- Programming experience with scripting languages e.g. Python is MUST.
- Must be good at documenting and structuring documents be it process or RCA.
- Ticketing system, Incident management experience is preferred.
- Security background and security first approach mindset is preferred.
- Experience with CICD pipelines and tool chains is preferred.
- Hands-on experience with a few of these - Kafka,Postgre, SnowFlake etc. is preferred.
- Must be able to perform with cool head under pressure situations without taking any shortcuts during production issues.
- Collaboration with solid verbal and oral communication skills are very critical to this role. Possesses excellent verbal and written communication skills and the ability to interact professionally with a diverse group of developers, product owners, and subject matter experts.
- Strong cross-functional collaboration skills, relationship building skills, and ability to achieve results without direct reporting relationships
- Ability to quickly identify and drive to the optimal solution when presented with a series of constraints.
- Excellent judgment, analytical thinking, and problem-solving skills.
- Self-motivated individual that possesses excellent time management and organizational skills.
- Strong sense of personal responsibility and accountability for delivering high quality work.
- MultiCloud - AWS, Azure, GCP
- Distributed Compute - Kubernetes (EKS/AKS), Containerization
- Persistence stores - Postgres, MongoDB
- DataWarehousing - Snowflake, DataBricks
- Messaging - Kafka
- CICD - Jenkins, ArgoCD, GitOps
- Observability - ElasticSearch, Prometheus, Grafana, Jaeger, NewRelic etc.
What We Offer
- Industry-Focused Certifications: Meet leading healthcare experts, discuss innovative strategies, and become a subject matter expert with our comprehensive set of certifications.
- Rewards and Recognition: Feeling like you’re outperforming on your projects? Get recognition for your dedicated efforts and demonstrated work ethic.
- Health Insurance and Mental Well-being: We offer health benefits and insurance to you and your family for hospital-related expenses pertaining to any illness, disease, or injury. We also have Employee Assistance Programs (EAPs) to give you 24X7 access to certified therapists and psychologists.
- Sabbatical Leave Policy: Do you want to focus on skill development, pursue an academic career, or just reset? We’ve got you covered.
- Open Floor Plan: Cubicles are a thing of the past and to modernize our office space, we have open floor sittings at every office location. Share ideas with your peers and bond better in an open floor office where there are no barriers and you are inspired to be creative.
- Paternity and Maternity Leave: Enjoy the industry’s best parental leave policy to welcome your bundle of joy and enjoy quality time with them.
Innovaccer Headquarters Location
San Francisco, CA
Innovaccer Company Size
Between 1,000 - 5,000 employees
Innovaccer Founded Year
Innovaccer Total Amount Raised
Innovaccer Funding RoundsView funding details