Friday Basics: RPO and RTO

I did a post last month titled RTO and RPO are myths unless you’ve tested recovery, but I only briefly covered what RPO and RTO are. This post goes into a bit more explanation of those acronyms and their importance. First, let’s go back to my working definitions:

  • Recovery Point Objective (RPO) – How much data can you afford to lose.
  • Recovery Time Objective (RTO) – How much time do you have to get the system back on-line.

Recovery Point Objective (RPO)

We measure RPO in terms of time, not size. This is why some folks, when learning about the two, get RPO and RTO confused because both are measured in time. RPO is focused on data loss. At what point in time does the amount of data loss exceed what the organization considers acceptable? This is exactly what it sounds like: a business decision.

What is acceptable is going to be based on cost versus lost. We can get to near zero data loss in a recovery situation, especially a disaster recovery situation, when the organization. However, the closer to zero we get, the most it is going to cost. For instance, having Always On Availability Groups so that you have more than one replica requires Enterprise Edition. Why would you need more than one replica? If you wanted a replica in the primary data center as well as at least one replica in another data center so that you cover HA both locally and across data centers, that’s a minimum of two replicas. Basic Availability Groups are out and Enterprise Edition with Always On Availability Groups is required. Enterprise Edition is much more expensive.

Backups also have to be more extensive in the event that something goes wrong with the current installation. That requires more storage and more compute. Also, if we have an availability group across data centers, we need to minimize latency, meaning the network connectivity has to be top notch and sufficient for all traffic going across that line. That’s a cost, too. So business has to take into account the cost vs. amount loss and determine what it’s willing to lose vs. pay.

Recovery Time Objective (RTO)

This is a measure of how long it takes to bring the system back on-line so that it’s accessible by users. Again, given it’s a measure of time, it’s again a tradeoff between cost versus loss. In this case it’s the business lost while the system is unavailable. Therefore, it is once again a business decision.

There are a lot of technologies we can employ to help improve RTO, but some of our ability to reduce RTO is based on the solution itself. For instance, a few years ago I had to review the architecture of an identity management system. It only allows for a warm standby mode. Data from one set of systems to a system in another data center can only be done in an asynchronous manner. Previously it only supported log shipping but now, at least, an availability group in asynchronous commit mode is allowed. What isn’t allowed is for any services to be running on the application servers at the secondary site. Therefore, there is start up time required if the secondary cluster needs to be brought on-line.

If you’re thinking, “If I can detect the first system being down I can automate the restart of the secondary cluster,” that’s correct. However, the fact that the services have to be started means there is some downtime. That’s unavoidable. Also, the recovery scripts may be available, but the organization may want to have them ready to be run and not automatically executed in case there are occasional “blips” in connectivity between data centers. This would increase the time to recover the solution. What’s supported definitely constrains our options to improve RTO.

One last thing to say about RTO: even “zero down time” solutions can fail and things can become so bad that restoring from backup is required. RTO should consider this possibility. I’ve seen too many cases where RTO is calculated and reported based on a best case scenario. As IT folks, we need to consider reasonable obstacles to recovery when we communicate solutions for both RPO and RTO. If we are forced to restore from backup, how long will that take? Never assume that backups aren’t necessary. There’s a lesson on that from the security files about a huge worldwide organization was only saved because a single domain controller in a remote site happened to be offline due to bad power after a nation-state action destroyed their systems with notPetya. They weren’t targeted. They were just “collateral damage.”

Non-Functional Requirements

I have found that non-functional requirements (NFRs) can be hard to define for a given solution. I’ve seen teams struggle with NFRs. However, to ensure I’m speaking the same language as everyone else, let’s look at a common definition to work with. This is taken from the Scaled Agile Framework (SAFe):

Nonfunctional Requirements (NFRs) are system qualities that guide the design of the solution and often serve as constraints across the relevant backlogs.

As opposed to functional requirements, which specify how a system responds to specific inputs, nonfunctional requirements are used to specify various system qualities and attributes, such as:

  • Performance: How fast a system should respond to requests
  • Scalability: How well a system can handle an increase in users or workload
  • Security: How well a system protects against unauthorized access and data breaches
  • Usability: How easy a system is to use
  • Maintainability: How easy it is to update and modify the system

Of course, there’s more, because when we’re talking about qualities and attributes, there are others we need to consider for any technical solution. Recoverability is a good example. For critical systems, resiliency could be another. Some of these are easier than others to properly define. For instance, recoverability can be defined using Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Others are tough. For instance, let’s take a look at performance.

Performance, based on the SAFe is how fast a systems responds. Here’s the million dollar question: “How fast is fast enough?” It sounds easy at first glance, but it turns out that unless you have a mature system, that can be an extremely hard question to answer. We have ways to quantify and measures. For instance, we could capture transactions per second. We could record response time for a web page or time to load. But even if we know what to measure, do we know what the target should be?

How do we get to a definition of “fast enough?” One of the easier ways, if you have the setup to do so, is to have users test before production. When the users say, “It’s fast enough,” we can just record that and call it good, right? Not necessarily. One user, all alone on a system might get blazing fast speeds. Of course that is “fast enough.” But in production, what if there are several thousand concurrent users? That user test isn’t valid. We need to test with production load. Actually, we need to test with production load, using product like quantity of data (the old 10 rows in development versus 10,000,000 rows in production problem), executing production like workflows/processes. If we are doing that, we have a better chance to answer the question of what fast enough looks like.

So what happens if we can’t properly define that non-functional requirement? Have you ever deployed a system and the business comes back and indicates the system is too slow? How do they know it’s too slow? How do you respond back? If you’re just now dealing with the question, “How fast is fast enough,” it’s too late to be able to say to business that the system is running fast enough. There’s no foundation to make that claim.

If you’re in production and you haven’t already answered the question, then baselining the system is critical. Actually, baselining the system if you have answered, “How fast is fast enough,” is critical, too. You need to know if the system is meeting the non-functional requirement. So getting the baseline allows you to get the numbers you need to compare against “fast enough.” The baseline also allows you to deal with the “the system is slow” situation if it should come up and you hadn’t answered the question, “How fast is fast enough?” When you have the baseline, you can ask if the system is fast enough. Then you can define and record the NFR. If, later on, business says the system is slow, you can check against baseline. If you see that you aren’t meeting the numbers captured in your baseline, you can agree and investigate. If the numbers are still in line with baseline, that gives you the ability to lead off the conversation with the fact that you have numbers, but if business is saying performance is bad, you want to narrow down on what specifically is bad and see what kind of numbers you have.

I’ve focused in on one NFR, but we can take a planned and methodical approach with each NFR that’s relevant to that particular platform. Without the NFRs, how do we know that we are meeting the business needs for the organization? And keep in mind that in discussing the NFRs, such as performance and availability, business may have a certain level of expectations. There’s a cost based on the solution. By giving business those numbers, they can make the decision. For instance, if they want performance for 5,000 concurrent users with a page load of XXX ms, there’s a certain cost to support that. It may be that the cost isn’t worth that level of performance. The more data we can provide related to the NFR, especially around cost, the better equipped business will to be to make a decision. Document the decision, the hows and whys, and keep that handy. Because you may need them to justify why you are at YYY ms page load, which was greater than the XXX ms page load, but that’s what business agreed to due to cost.

Attacking the Weakest Link

When I look at a system and think about its security model, the first thing I start poking around at is where I think security is weakest. For instance, if my target is a Microsoft SQL Server box, I don’t generally look for a weakness in SQL Server itself. I start looking at the operating system, I look at accounts that may have access, and since I’m really worried about the data being taken, I look to see how backups are handled and where they are written to.

I believe I started to think like this because of playing a lot of chess growing up. In chess, the name of the game is to checkmate the enemy king. However, once you get past a certain proficiency level, direct attacks against don’t work initially. I learned to accumulate advantages elsewhere and press those advantages until the enemy king was vulnerable. That’s how I think about attacking a system.

Of course, the history of warfare also teaches us to think this way. A bad actor isn’t going to play fair. This isn’t a jousting contest between two knights. It’s about one side who’ll do anything to win and the other side having to plan and prepare for such an adversary. Case in point, the Maginot Line looked to be source of strength for France against an invasion by Germany. The problem was that while France built the series of fortifications on the French/German line, It did not do so along the French/Belgian one. After all, in a likely conflict, Belgium would remain neutral and Germany would be forced to try it’s hand at the French/German line. But Germany didn’t play fair. It ignored Belgium’s neutrality and rolled through the nation, into France, and took out the French forces at the Maginot Line from behind, thereby allowing full access across the French/German border. This is the way bad actors will “fight” if they want our assets.

2019 PASS Summit – How I Would Attack SQL Server

While this talk is about five years old now, it covers how I would go after SQL Server if I were the bad actor. The principles haven’t changed. I’m going to go after the weakest link. Why do more unnecessary work? This isn’t about fighting the beat someone’s strength. I don’t have time for that. If I am doing this to make money, the faster I’m in and out, the better. I have less chance of getting caught and I have more time to raid someone else. Same thing is true if I am an opposing corporation or nation. You can decided to “fight fair,” but I guarantee you there are plenty of adversaries out there who won’t.

Friday Basics: Authentication vs. Authorization

Another security fundamentals topic is authentication versus authorization. For those who have a clear understanding of the difference between the two, like with Recovery Point Objective (RPO) vs. Recovery Time Objective (RTO), it is sometimes easy to forget that others mix them up. In a nutshell:

Authentication is proving who you are.

Authorization is what you’re allowed to do.

Authorization is dependent on Authentication. If I don’t know who you are, I don’t know what your permissions are.

One way I’ve recommended for folks to remember the difference is saying to yourself, “I authorize you to enter this restricted facility,” meaning you are giving permission for that person to enter. You wouldn’t allow someone to enter the facility whom you didn’t already recognize / validate.

Yes, a lot of identity solutions do both and for the end user it may appear to be one and the same. For instance, when someone logs in to an Windows computer (on a domain or not), that person is authenticated, and if that authentication is successful, the security layer looks up the user and their security group memberships, among other things, and puts it all together into an access token (two for administrators) which, from an authorization perspective, includes those security group memberships and other permissions. To the person logging in, though, it appears to be one and the same process.

Develop and Test Your Rollback Plan

The rollback plan… what to do when things go wrong to get back to where you were before the deployment or implementation. I’ve seen too many cases where a rollback plan is required, but it’s never tested prior to a deployment. Because it’s never tested, it rarely gets properly developed. And when deployments go well or well enough that issues can be fixed during the deployment, the deployment’s rollback plan doesn’t get used. Eventually, folks wonder why we do one. And they wonder until the one deployment when things do go wrong and they don’t have a properly tested rollback plan. And they can’t rollback.

That doesn’t happen in modern enterprises? I beg to differ. Recently I was talking with a representative from an organization that I receive services on. The representative apologized profusely because they couldn’t pull up my records. They and many other representatives were locked out of their system. Talk about a nightmare. What’s worse was the estimated time when their ability to access their system would be restored. The representative indicated it would be one, possibly two, days.

If the estimated time to fix the issues in the deployment are measured in days for something critical like authenticating to the system, it’s time for the rollback plan. The fact that the estimated time was in days tells me either there wasn’t a rollback plan or it wasn’t fully tested and they weren’t able to execute it when the implementation failed. One to two days for an identity platform sounds like they chose to roll forward, likely meaning they had no other choice.

Now I could be wrong. They could have had a rollback plan that they did fully test but management decided to roll forward anyway. I’ve seen that happen. If you’ve been in information technology long enough, you likely have, too. But regardless of what they did do, hopefully this real life scenario has convinced some of the imperative to have a proper rollback plan and to test it before the production deployment.

Microsoft Fabric Training and Tutorials

Microsoft Fabric is the new data offering in Microsoft Azure and there is a great deal of interest in it. How do you get started? Where are the tutorials? Sometimes these types of resources can be hard to find, so here’s a small collection for anyone looking to get started:

Microsoft Fabric Documentation

Microsoft Fabric Tutorials

If you’re a visual learner, there’s a lot of resources on YouTube and a quick search will find plenty of new content. However, you cannot go wrong with Adam and Patrick and their channel, Guy in a Cube, where they’ve put together a Microsoft Fabric playlist to help you learn. It starts here:

Friday Basics: the CIA Triad

In information security (INFOSEC), there several foundational concepts and principles. One of the ones that’s introduced almost immediately is called the CIA triad or the Information Security Triad. While it may look like a version of the Triforce, this triad has nothing to do with a video game.

Requirements for cybersecurity in agricultural communication networks - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/The-Confidentiality-Integrity-Availability-CIA-triad_fig1_346192126 [accessed 26 Apr, 2024]

The three elements are defined as:

  • Confidentiality – Read access is restricted only to authorized personnel.
  • Integrity – Write (Add/Change/Delete) access is restricted only to authorized personnel.
  • Availability – The system or platform is available to authorized personnel when needed.

I usually expand “authorized personnel” with “authorized personnel via authorized processes.” This covers the case of service accounts and accounts acting on behalf of a user and it covers the situations like when a database is intended to be accessed only through an app but permissions allow a user to connect via Excel. The addition of “via authorized processes,” indicates that a user accessing via Excel would be in violation of the CIA triad. With respect to data we’re used to CRUD (Create, Read, Update, and Delete) operations. Confidentiality covers the R of CRUD while Integrity covers the C, U, and D.

One of the things I talk about with other security professionals is about ensuring Availability is met. I have been the over zealous security engineer who tightened down a system where Availability was broken. That does the business no good. If I can’t access the system like I need to do so when I should be doing so, I might as well not have the system at all. And that’s why Availability is a key part of this security concept.

So if you ever hear anyone talking about CIA or the CIA triad with respect to security, this is what it means.

Going to Cloud? Look at the Shared Responsibility Model

The bottom line here is this: the idea that a CSP takes care of everything for you is a fallacy that really needs to die.

Thompson, Graham. All-in-One CCSK Certificate of Cloud Security Knowledge Exam Guide. Page 3. McGraw Hill. New York: 2020.

I was dealing with a situation lately where a group was looking at licensing a cloud-based resource, but no one had checked the cloud service provider’s (CSP) shared responsibility model. The group assumed the vendor’s model was similar to the bigger vendors. Turns out they were wrong.

One of the “must dos” when looking to on-board a new service offering from a CSP is to check the shared responsibility model. In some cases, a vendor may have a single model for all offerings, but that is not always the case. For example, with the CSP the group was looking at, there were two different service offerings and they had different shared responsibility models.

If you aren’t familiar with the concept of a shared responsibility model, here is the one for Microsoft Azure. Every CSP should have this, though you may have to ask for it. Never assume the CSP is going to take care of something for you. Verify what they will and will not handle with the appropriate shared responsibility model document.

Dealing with Change – Two Resources

As I look at the state of information technology today, I see one constant: rapid change. We all see it. For instance, if you had said two years ago that you knew that generative AI would become a big deal in 2023, most folks would have looked at you like you were crazy. Yet here we are. And I know more, drastic change is still coming. Quantum computing is moving forward. When it gets here in full force, the way we secure the Internet will be obsolete. I’m not exaggerating. Dealing with change is hard. Understanding how to handle and attack change is crucial. While this is the type of post I would normally post at my Goal Keeping DBA blog, given how many folks in IT I see struggling with change led me to post it here. Let me suggest two books that may help.

The first book is Who Moved My Cheese? by Dr. Spencer Johnson. This is a classic and it uses a fable with four characters, two mice and two humans the same size as the mice to describe how we respond to change. It’s a quick read, probably a single sitting. The characters encounter a major change and the rest of the fable is about how those characters handle that change.

Wrapped around the fable is a fictional high school reunion where one of the attendees relates the fable to his friends. Each friend is facing a situation of great change. Before the fable, we’re given a hint into some of the attendees’ situations. After the fable, the author presents the discussion of those friends and how they relate to the fable. This book has helped a lot of folks throughout the years.

The second book is the sequel to Who Moved My Cheese?, which is Out of the Maze. It is also a quick read. This book covers the story of Hem, the character in the fable who resisted the change, Hem. In the fable, we never learn the fate of Hem. Out of the Maze looks at the story of Hem after the events of the first fable. It’s a positive take on the fact that even if we are like Hem, we can eventually come around to dealing with the change and eventually get out of the maze altogether.

If you’re more of a visual learner, I did find an animated summary on YouTube that’s around 12 minutes in length which presents the fable from Who Moved My Cheese? along with additional explanation to help understand the fable better.

Note: The links to the books are Amazon affiliate links.

Tomorrow: Webcast on SQL Server security

Tomorrow, April 16, 2024, I will be giving another webcast; this one will be on SQL Server security. It’s scheduled for 1 PM EDT / 5 PM UTC.

Sign up link

As always, the registration is free. Here’s the abstract:

Data is the lifeblood for almost every organization. As a result, platforms like Microsoft SQL Server are high-value targets for attackers. However, knowing what to do and not do can be daunting.

In this webinar, we’ll walk through a framework to secure your SQL Servers from end-to-end. Starting with the install and walking through surface area, permissions, backups, encryption, and concluding with decommissioning, we’ll cover every area you’ll need to consider for your SQL Server environment. Where they are applicable, we’ll also point out industry good practices and where to find the documentation on them.

By the end of the webinar, you should leave with a plan for where to start, what’s most important, and where to go for more information to ensure you can properly harden and secure the SQL Servers in your organization.

Previous Older Entries