Asset manager Vanguard and global bank Morgan Stanley are trying to carefully balance their software development and operations functions as they make a large-scale transition to the cloud.
Vanguard has been going through what it calls an iterative transformation from managing 2,000 of its own servers in 2015, to running mostly on Amazon Web Services (AWS). As a result, its 7,000 developers have been moving from updating monolithic applications on a quarterly cycle to a set of microservices that are built and run by discrete teams.
Those teams are now supported by a centralized platform team that provides standardized CI/CD pipelines and infrastructure for their code to land in, with site reliability engineering (SRE) oversight both centrally and embedded across those teams.
Morgan Stanley started its agile and cloud transformation in 2018, and has more closely aligned with Microsoft Azure.
The initiative began with a three-year training effort to establish modern devops and SRE practices across the bank’s 15,000 technologists. That program hinged on what Gus Paul, executive director for application infrastructure at Morgan Stanley, identified as three key areas: “Accelerate software development and delivery; increase predictability, frequency, and quality of change; and revolutionize how we operate technology,” he said during a presentation at the Devops Enterprise Summit.
Today, Morgan Stanley has “agile teams with a product owner, engineers with dev and ops expertise, and they could be targeting on-premises or cloud infrastructure,” Trevor Brosnan, head of devops and enterprise technology architecture at Morgan Stanley, told InfoWorld. “My philosophy is everyone has specializations; we all have a superpower in technology.”
Changing well-established build and run behaviors will always be challenging for organizations as big, complex, and cautious as Vanguard and Morgan Stanley. That hasn’t stopped them from trying to carefully tread the line between giving developers the ability to move faster, all while maintaining the level of control expected from firms that manage billions or even trillions of dollars. These are companies and cultures that don’t tolerate risk or downtime.
Flexibility and risk management
Christina Yakomin is a site reliability engineer at Vanguard, where she is part of a team that supports business-aligned developer teams. Her team sets and enforces certain deployment controls by running what she calls “shared service platforms,” such as standardized CI/CD pipelines and cloud infrastructure platforms.
This helps give the risk-averse financial services company confidence that certain controls are being enforced at the deployment stage, while also reducing repeated work across different development teams, “so that every team doesn’t have to reinvent the wheel,” she told InfoWorld.
Taking a leaf from streaming giant Spotify’s “golden path” playbook, Yakomin has clearly been influenced by the cloud-native concept of providing golden paths for developers to follow. “We have found that because of how complex the necessary controls are to build applications in this industry, we strive to pave the standard path with gold, while also making sure it is open to deviation,” she said.
Due to the strict level of control required, however, Yakomin says most developers tend to stick to the golden path. If teams do manage to deviate to a new technology or technique, they become instantly responsible for doing it.
Despite having a similar structure, Morgan Stanley takes a different approach to managing risk when deploying into production. Previously, this would require a developer to toggle between three separate Jira instances, file a change ticket, and follow 81 steps to get even one line of code approved. Now, the bank has started to adopt modern infrastructure as code and CI/CD practices to streamline that process in pockets across its diverse developer teams, with a central team responsible for encouraging and incentivizing other teams to follow suit.
On top of this, the bank built an automated risk calculator, which assesses each change and assigns a risk score. Changes that come in below a certain threshold can be deployed using an automated pipeline; those that come in above the threshold fall back to a more manual approval process.
The SRE safety blanket
Inserting an SRE safety blanket at both the central operations level and within individual developer teams has helped build confidence at both Vanguard and Morgan Stanley that they are striking the right balance between developer velocity and operational stability.
However, this function does open up the possibility of separating concerns and creating a disconnect between dev and ops, once again.
“It’s a nuanced problem to solve,” Yakomin said. “Introducing SRE does make people feel like we are siloing ops again into that role.”
Similarly at Morgan Stanley, establishing SRE principles is “sometimes misunderstood as a rebranding of the ops team,” Brosnan said.
Rather than separating dev and ops, Yakomin wants to encourage Vanguard developers and operations specialists to share responsibility for security and ensure that teams with shared platforms take full operational responsibility for them.
Robbie Daitzman, a senior manager for the intermediary technology platform at Vanguard, said they were able to overcome this problem by “creating a rallying cry to centralize around certain platforms.” Centralization benefits engineers “by balancing cognitive load and implementing the shared responsibility model,” he said.
Similarly, at Morgan Stanley, Brosnan sees “SRE as crossing both dev and ops and the whole development lifecycle.” For example, the fundamental SRE practice of eliminating toil will typically be most keenly felt by operations specialists, but developers are well suited to automate away those tiresome tasks. Or reliability, which is a core SRE concern, also falls to developers, which have a responsibility to architect their applications “to be resilient at their core,” Brosnan said.
Building resilient, observable systems
The central SRE team at Vanguard is also responsible for ensuring its various systems are resilient and observable.
Yakomin and Daitzman had both formerly worked on the chaos engineering team at Vanguard. Chaos game days and fire drills already were key to validating the resiliency of new systems at the company.
Vanguard also moved from alert-only visibility of its core systems to adopting Amazon CloudWatch, Honeycomb’s cloud-native monitoring, and the open source OpenTelemetry standard for collecting metrics, logs, and traces.
“Observability in SRE has been a quality-of-life thing for engineers, to help understand if we are in a good condition or negatively impacting clients,” Daitzman said. “It also helps claims of innocence within that shared responsibility model.”
On top of these shared observability metrics, Vanguard has built out a set of homegrown dashboards, which can be tweaked by each developer team to suit their needs.
However, that hasn’t stopped teams from clamoring for the latest and greatest observability platform to lay on top of this infrastructure. “Every team wants different things and if we had all that it would be very expensive,” Yakomin said.
Seeking the right balance
Despite all this progress, Yakomin admits that her team at Vanguard is still trying to strike the right balance between efficiency and flexibility for its developers.
Her plan is to make sure that everyone gets the training they need to transition to the new shared responsibility model, while also having the capacity to work on delivery, complete with accurate, blameless post-incident reviews. Lastly, she wants to make it easier for developer teams to safely experiment and deviate from the golden path where it is deemed worthwhile.
For Brosnan at Morgan Stanley, “you are never really done.” He vows to continue to “focus on sustaining that community momentum, to help make this a permanent part of the culture.”