Staff Site Reliability Engineer

  • The Hartford
  • Chicago, Illinois
  • Full Time
Staff Reliability Engineer - IE07KE

Were determined to make a difference and are proud to be an insurance company that goes well beyond coverages and policies. Working here means having every opportunity to achieve your goals and to help others accomplish theirs, too. Join our team as we help shape the future.

The Hartfords CARE - RE&A Organization is seeking an experienced and highly motivated Staff Reliability Engineer to lead infrastructure engineering initiatives, drive AI-powered automation, and integrate Generative AI (GenAI) into reliability engineering.

This role will have end-to-end accountability for the reliability of IT services within a defined application portfolio and building scalable, self-healing infrastructure by leveraging cloud-native architectures, predictive analytics, and AI-driven automation. The engineer will design and implement AI-powered observability solutions, intelligent incident response, and automated remediation strategies to proactively prevent failures and enhance service resilience.

Successful candidates will have expertise in infrastructure engineering, software reliability, and AI-driven automation while demonstrating strong problem-solving skills and leadership in cross-functional, AI-powered site reliability engineering (SRE) initiatives.

Responsibilities :

Guide the use of best-in-class software engineering standards and design practices for instrumenting code/application technology stack to enable the generation of relevant metrics on overall technology health - availability, performance, quality, currency and resiliency. Serve as key liaison between the architecture and software engineering teams to influence the technical strategy for the organization, keeping in mind its cross-functional impacts, integration across the organization, and architecture rationalization.

Function as the go-to technical expert for the applications supported, requiring depth and breadth of knowledge in technologies, applications, integration, interfaces and business domain.

IT Ops Responsibilities:

  • Ensure operational excellence. Independently drive the triaging and service restoration of all high impact incidents in order to minimize the mean time to service restoration and impact to the business. Demonstrate end-to-end ownership.

  • Partner with infrastructure teams to design and implement intelligent incident routing, enhanced monitoring/alerting capabilities and automated service restoration processes. Take proactive measures to prevent high impactful incidents.

  • Architect, build, and maintain highly available, scalable, and fault-tolerant infrastructure in cloud environments (AWS, GCP, Azure).

  • Implement observability solutions using tools like Splunk, Dynatrace, CloudWatch, Prometheus, Grafana and Open Telemetry to enhance visibility into system health.

  • Lead capacity planning, performance tuning, and incident response processes across distributed cloud-native architectures.

  • Develop self-healing mechanisms using AI/ML models to predict and mitigate infrastructure failures before they impact production.

DevSecOps Solution Responsibilities:

  • Develop effective tooling, alerts, and response mechanisms to identify and address reliability risks leveraging automation to support problem prevention, detection, mitigation, and resolution.

  • Progressively implement preventative controls and drive increased automation and self-healing capabilities. Continue to improve cost efficiency baselines.

  • Design and develop infrastructure as code (IaC) solutions using Terraform, CloudFormation, and CDK.

  • Implement CI/CD pipelines and enforce DevSecOps best practices for secure, compliant, and scalable deployments.

  • Promote and implement innovative solutions.

  • Automate infrastructure provisioning, configuration management, and remediation workflows using Python, or Bash scripting.

Generative AI & Intelligent Operations

  • Integrate Generative AI models into infrastructure operations to enhance incident detection, root cause analysis, and automated remediation.

  • Develop AI-powered chatbots or copilots to assist with troubleshooting, log analysis, and predictive maintenance.

  • Utilize LLMs and Vector Databases for intelligent automation in site reliability workflows.

  • Research and implement AI-driven anomaly detection to proactively identify risks and performance bottlenecks.

Qualifications:

  • 10+ years of experience in Infrastructure Engineering, Site Reliability Engineering (SRE), or DevOps.

  • Bachelors degree or equivalent work experience in Computer Science, Information Technology Management, or associated degree

  • Ability to interact with diverse technical and non-technical groups in a matrix organization

  • Solid understanding of SAFe Agile methodologies

  • Familiarity with programming languages (Python, Java or JavaScript/Node.js)

  • Expertise with cloud platforms like AWS and microservices architecture

  • Hands on experience with Observability tools such as Dynatrace, SPLUNK, CloudWatch, CloudTrail, etc.

  • Hands on Experience with continuous integration and DevOps methodologies, tools including GitHub, Jenkins, Nexus,

  • Hands-on application development and production support is a plus

  • Hands-on experience with AI/ML frameworks, including Generative AI models for infrastructure automation.

  • Experience with AI-driven reliability engineering solutions is a strong plus.

  • Ability to develop, manage and communicate frameworks: e.g., Cloud Security Alliance

  • Solid understanding of technologies that support the services offered for cloud applications

  • Excellent analytical and problem-solving skills

  • Must have exceptional communication skills (written, oral, presentation and facilitation)

This role will have a Hybrid work schedule, with the expectation of working in an office (Columbus, OH, Chicago, IL, Hartford, CT or Charlotte, NC) 3 days a week (Tuesday through Thursday).

Candidates must be authorized to work in the US without company sponsorship. The company will not support the STEM OPT I-983 Training Plan endorsement for this position.

Compensation

The listed annualized base pay range is primarily based on analysis of similar positions in the external market. Actual base pay could vary and may be above or below the listed range based on factors including but not limited to performance, proficiency and demonstration of competencies required for the role. The base pay is just one component of The Hartfords total compensation package for employees. Other rewards may include short-term or annual bonuses, long-term incentives, and on-the-spot recognition. The annualized base pay range for this role is:

$126,160 - $189,240

Equal Opportunity Employer/Females/Minorities/Veterans/Disability/Sexual Orientation/Gender Identity or Expression/Religion/Age

About Us | Culture & Employee Insights | Diversity, Equity and Inclusion | Benefits

Job ID: 469238223
Originally Posted on: 3/14/2025

Want to find more Quality Control opportunities?

Check out the 34,220 verified Quality Control jobs on iHireQualityControl