1. Emerging Operational Approaches
|Agile & Scrum||SRE|
|Agile is a software development approach, which breaks development work into small increments to minimize on up-front planning and design.
Scrum is an Agile framework. It is designed for small developer teams to break their work into actions that can be completed within time-boxed iterations, called sprints. Agile works very well in projects where more is unknown than known and partners very closely with the customer at every stage of development.
|Site Reliability Engineering is a discipline that incorporates Software Engineering and Systems Automation to solve operations problems. It aims to create highly available, and reliable systems that scale easily and strikes a balance between operations and innovation (to minimize toil).
Its principles are: eliminate toil (manual, low-value tasks), set SLAs, align risk tolerance, and monitor. Best practices in this domain use automation to accomplish progressive rollouts, problem detection and rollback.
|DevOps||Infrastructure & Platform as a Service|
|DevOps is a software engineering culture and practice that aims at unifying software development (Dev) and software operation (Ops).
The main characteristic of the DevOps movement is to strongly advocate automation and monitoring at all steps of software construction, from integration, testing, releasing to deployment and infrastructure management.
A key goal for DevOps teams is the establishment of a high cadence of trusted, incremental production releases.
|Infrastructure as a Service is fundamentally a concept to provide high-level APIs to reference physical computing resources, location, data partitioning, scaling, security and backup. In a sense, pull hardware through software.
Combined with Platform as a Service, the work of managing hardware is abstracted. Developers are given an “environment” in which the operating system and server software, as well as the underlying server hardware and network infrastructure are taken care of, leaving them free to focus on the business side of scalability, and the application development of their product or service.
With reference to the above listed approaches, it is clear that development is trending towards a modular and quick-deploy style where even hardware is “software-managed”. While there will still be a need for classic waterfall deployments, it is expected to be fewer. As such, whether an organization adopts Agile, DevOps and/or SRE, we can expect the impact to change management is a surge in the volume of changes, requiring speedy turnaround to accommodate scaling business needs, while doing its part to ensure resiliency.
2. Cultural Impact to Change Management
We can predict some changes to the operating culture that stem from the emerging approaches:
|Operational Culture Change||Impact to Change Management|
|Development teams will become modular||Teams will structure themselves to cater for modular work. Leadership and development will be delegated closer to development enabling agility. This will result in quicker deployment, resulting in an increase in the number of change owners/sponsors|
|The risk level of all changes would drop||Changes stemming from modular work with small scale low risk impact, may not even qualify for the Change Advisory Board review|
|Wide-participation meetings will fade, deployments will be assumed to be known||Through an increase in the number of change owners and changes, it is expected that intra-department meetings will replace inter-department meetings, Change Advisory Boards will be considered redundant and memos/tables in email would be the way to communicate each team’s work to the department, and one department’s work to the next|
|An acceptance of failure; to fail small and fast||To meet the demands on speed, a level of comfort with failing (albeit small and fast) will grow, probably governed by metrics and corresponding issue resolution|
|Assumptions that all deployments are safe||With a focus on generating business value through quicker releases, developers will continually enabled to deploy. They will grow an assumption (if not a need), that there is a governance process in place, that handles the conflict management and resiliency assessment function for them, as a service|
3. Value of Governance
Present day industry practices a preference of using the waterfall model for changes, and more so for infrastructure changes, because of 3 factors:
- Impact assessment
- Pre-implementation conflict management
- The extent of resourcing needed to be pre-aligned in the event of rollback/issues
Considering the speed in which software releases continue, there will be a continuing need for conflict management, governance and enterprise visibility (operations). This does not become redundant.
- Even with the growth of over the cloud Infrastructure-As-A-Service (IAAS) models, infrastructure still need to be premised and maintained for the foreseeable future, unless the corporation completely outsources hosting.
- Breaking change management into independent unit-levels process not only limits ability to standardize or assure quality, it loses end-to-end governance capability to prevent collision. It further erodes the ability to value add for future resiliency, disabling the develop-to-deploy pipeline through observation and advise on chokepoints.
Therefore, the value of change management can be summarized to these key points:
- End-to-End visibility; plan-to-post implementation
- Governance; conflict management, resiliency assessment
- Business Insights; process improvement opportunity areas
In the ITIL framework, change management is defined as part of “Service Transition”, transitioning something newly developed. It ensures that standardized methods and procedures are used for efficient and prompt handling of all changes to control IT infrastructure, in order to minimize the number and impact of any related incidents upon service.
In the past, the waterfall model placed much of the checking and handling upfront. The quicker world ahead requires change to operate as a service to development teams. With the goal of allowing them to focus on development, governance needs to become heavily automated and pulling on resources in an event driven way when issues surface.
Consider the perspectives from management and technology tower leaders, what output do they seek from change management?
|Management||Why did something fail, what’s to stop recurrence?|
|Is there an interesting trend affecting operations?|
|Forecast of activity with high impact or probability of failure|
|Tower Lead||Are there other tasks that conflict with our plan?|
|Any process change we need to know about?|
|How are we performing compared to other towers?|
5. Enabling Change Management 2.0
Considering the perspectives we have reviewed and the cultural change we can anticipate, the change management practice could consider process re-engineering on a few fronts.
|Instant Approval||Consider the typical change record, written in its usual technical way: “71019 COBS ACFS VE UN-ENCRYPTION” plastered with a technical plan that explains this with more acronyms.
Change Management Tools-of-Records try to make the change owner explain internal and external impact in layman terms and very often the approvers for a change, approve with little idea themselves, feeling reassured that others have approved it before them.
Moreover, change reports float up to management with a list of approved changes for review, with very rare discussion around the changes (barring a previously failed one or a well know high impact activity).
We should reconsider the value in change approvals systems; identify what fundamentally adds value to the change and what delays. It is possible that we move into a model where changes are approved from the start and only delayed if there is known conflict.
|Risk Level Automation||The sleeping data we have in our systems – change history, configuration management database, real estate information to name a few – can be energized into insights that adjust our risk ratings for changes.
For example, a low risk change being done at a location that is increasingly reduced in size, with a corresponding reduction in site assets and thereby reduction in redundancy (continuity) could actually be a high risk change for the site if it fails, even if it’s a fairly routine activity.
Most corporations wait for an annual exercise to revisit risk levels, considering the importance of different assets to different sites and users. In this new age, can we really wait, and moreover, are we really enabled to do this exercise manually?
|Conflict Management||Change Advisory Boards (CAB) have fundamentally been the way corporations de-conflict and prioritize changes for the past couple decades. Changes are submitted, peer reviewed, reviewed by change managers and then entered into a CAB meeting. It is tried, tested and successful.
Let us assume a world where all change records are immediate approved from the time they are submitted. Assume a peer review is the only review done to check that nothing is missed technically. Given the subject matter experts have already stated what their plans are, all that is left in the way is to ensure they aren’t in conflict of another activity. And there are many factors here – business restrictions that prevent changes during a specific window, a site power outage, a separate activity by another tower planned on the same technology asset.
If conflict management could be a triggered model, where a change is only unapproved if there is a conflict, the entire processing pipeline speeds up exponentially.
|Predictive Analysis||Consider the enormous history of change data we have, can it be used to tell us about the future.
For example, is there an upcoming possibility that changes get deferred because of lack of resources in a given month? In an ever smaller, connected world of globally distributed teams, where a change is planned in one country and executed by a shift-based team in another, it might not be immediately known that there is a general trend for people to be out of office due to official or un-official adhoc public holiday or seasonal weather events.
Intelligent trending can provide insights into our environment beyond our immediate understanding and save on wasted efforts.
|Post Implementation||In a rapid deployment world, one that accepts a small amount of failure or potential failure, audits become valuable in assessing that the intended actions of a change were done as planned without inadvertent modifications to assets that were not part of the original activity.
There are persistent tools that can be leveraged to audit the technological landscape for any irregularities.
This would be incredibly beneficial for quick identification of inadvertent changes that did not cause incidents but rather a change that gets uncovered downstream, potentially even causing issues in separate activities, without traceability to the original event that made the modification.
The journey to change 2.0 is further down the road, to be enabled by emerging technologies. However change management in the current cultural shift should position the service as a contributor of business insights. One that ensures resiliency and pushes for speed in deployment.
A way to enable this is through reframing the service measurement model. It is an old adage that teams focus on what they are measured against. If change management tried a different set of measures, the service could pull insights that drive risk visibility, performance understanding and business insights for the benefit of the larger organization, thereby enabling process improvement efforts. Moreover, a service’s metrics are the proof of its value and to have a front seat in the new age, change must demonstrate the value it can provide to the organization from its end-to-end viewpoint.
|Vision||Stewards of Speed and Resiliency|
|Mission||Accelerate the governance pipeline
Drive business value from the change management process
|Speed||Average time spent to raise a change||From draft to pending-approval. Provides insights into speed of ITSM users, and usage.|
|Average time spent to secure change approval||From pending-approval to ready-to-implement. Provides insights into speed of ITSM approvers.|
|Average outage window duration by tower||In an agile/SRE climate, it is vital to ensure that towers are not hogging assets for maintenance longer than necessary to ensure maximum uptime.|
|% of PDPA / non-PDPA?||Indicates the ratio between pre-approved, low time consuming changes and regular changes.|
|Accuracy||% fail / success (by technology & incident type)||The overall success/failure rate and severity of issues organized by technology tower|
|Number of accepted first time vs. rejected||In an agile/SRE climate, time should not be lost in poor planning or multiple review cycles of the same activity|
|# Changes rolled back||Provides visibility on issues leading to partially completed or totally backed out changes|
|Resiliency||# of unauthorised changes||Review of process breaks|
|% of changes scheduled outside maintenance window||A maintenance window is a prescheduled time when a system can be taken off-line for maintenance, outside of it the risk level increases|
|% of changes that cause incidents||Number of implemented changes that have caused incidents, relative to all implemented changes within a certain time-period. Prerequisite for measuring this KPI is that incidents are correlated to changes.|
|% of incidents caused by changes versus total number of incidents||Perspective on change related incident volume in comparison to the wider set of incidents|
|Trend||Total number of changes, by tower, by status, by year||Provides visibility on work volume on a year-to-year trend – # executed successfully, # failed, # rolled-back|
|Ratio of number of incidents versus number of changes||Acceptable failure percentage|
|Process Time||Benchmarked-Change Time Reduction||Year-on-year comparison if a benchmark change (e.g. a regular event such as AIX Frame upgrade) is done faster or slower|
|Most Deferred Changes||Visibility on the roadblocks that prevent changes from being implemented as scheduled|
|Availability||Planned outages due to changes||Outage (unavailability) due to implementation of planned changes, relative to the service hours|
|Unplanned outages due to changes||Unplanned means that the outage (or part of the outage) was not planned before implementation of the change.|
|Configuration Items||CI with the most # number of change related incidents||Which assets have had the most failed implementations?|
|CI with the most # of change rollbacks|
|Most touched CIs||CI with most number of associated changes. Provides visibility on critical assets and insights into protecting them.|
|Site||Site with the most # number of change related incidents||Which sites have had the most failed implementations?|
|Site with the most # of change rollbacks|
|Most touched Site||Site with highest change volume. Provides visibility on the organization’s busiest (and inherently important) sites.|
|Technology||# of failed changes by technology||Advises which technology area has not been as successful as others|
|# of rolled-back changes by technology|
|Process||# Number & type of emergency changes||Trends in emergency changes|
|# of backlogged changes (by technology)||Change backlog|
|Same change, repeated fail (rollback and incident)||Changes that were not successful on multiple occasions|
|% of time coordinating changes||Percentage of time (in labor hours) used to coordinate changes relative to all time used to implement (and coordinate) changes.|
|Audit||# of audited changes (pass/fail )||Changes that are inspected based on a weekly random selection or via Evolven to ensure accuracy in CI/Time information between change record and actual implementation|