Seven Critical Things To Protect Your Infrastructure and Data

Given some recent happenings in the world, I felt it important to get the word out on a few really key things we need to do/stop doing/do differently as we manage our infrastructure to help prevent data breaches.  This is probably more relevant to IT people, like sysadmins, so here goes…

  1. KEEP A FREAKING INVENTORY OF YOUR SYSTEMS, THEIR IP ADDRESSES, THEIR FUNCTIONS, AND WHO TO CONTACT.  Why is this so hard?  Keep it up to date.  By the way, we know this is hard, because it is the #1 control on the CIS Top 20 Critical Cyber Security Controls.  If you’re all cloud-y, I’m sure you can find a way to stick some inventory management into your Jenkins pipeline.
  2. Monitor the antivirus running on your servers.  Unless the server is a file server, if your AV detects an infection, one that you’re reasonably confident is not a false positive, you should proceed immediately to freak out mode.  While workstations ideally wouldn’t be exposed to viruses, we intuitively know that the activities of employees, like browsing the internet, opening email attachments, and connecting USB drives, and so on, will cause a workstation to be in contact with a steady stream of malware.  And so, seeing AV detection and blocks on workstations gives us a bit of comfort that the controls are working.  You should not feel that level of comfort with AV hits on servers.  Move your servers to different group(s) in the AV console, create custom reports, integrate the console with an arduino to make a light flash or to electrify Sam’s chair – I don’t really care how you are notified, but pay attention to those events and investigate what happened.  It’s not normal and something is very wrong when it does happen.
  3. If you have determined that a server is/was infected with malware, please do not simply install the latest DAT file into your AV scanner and/or ran Malwarebytes and the server and put the system back into production.  I know we are measured by availability, but I promise you that, on average, this approach will cause you far, far, far less pain and downtime than the alternative.  When a server is infected, isolate it from the network, try to figure out what happened, but do not put it back into production.  You might be able to clean the malware with some tool like Malwarebytes, but you have no idea if there is a dropper still present, or what else was changed on the system, or what persistence mechanisms may have been implanted.  Build a new system, restore the data, and move on, while trying to figure out how this happened in the first place.  This is a great advantage of virtualized infrastructure, by the way.
  4. If you have an infected or compromised system in the environment, check other systems for evidence of similar activity.  If the environment uses Active Directory, quickly attempt to determine if any administrative accounts are likely compromised, and if so, it’s time to start thinking about all those great ideas you’ve had… you know the ones about how you would do things differently if you were able to start over?  This is probably the point at which you will want to pull in outside help for guidance, but there is little that can be done to assure the integrity of a compromised domain.  Backups, snapshots, and good logging of domain controllers can help more quickly return to operations, but you will need to be wary about any domain-joined system that wasn’t rebuilt.
  5. Periodically validate that you are collecting logs from all the systems that you think you should be, and ensure you have the ability to access those logs quickly.  Major incidents rarely happen on a Tuesday morning in January.  They usually happen late on the Friday of a long weekend, and if Sally is the only person who has access to the log server and she just left for a 7 day cruise, you’re going to be hurting.
  6. Know who to call when you are in over your head.  If you’re busy trying to figure out if someone stole all your nuclear secrets, the last thing you want to be doing is trying to interview incident response vendors, get quotes, and then approval for a purchase order.  Work that stuff out ahead of time.  Most 3rd party incident response companies offer retainer agreements.
  7. Know when you are in over your head.  The average IT person believes they have far above average knowledge of IT[1], but the tactics malware and attackers use may not make sense to someone not familiar with such tactics.  This, by the way, is why I am a strong advocate for IT, and network/system admins in particular, to spend some time learning about red team techniques.  Note, however, that this can have a significant downside[2].

 

1. Yes, I made that up, but Dunning-Kruger tells me I’m probably right.  Or maybe I am just overconfident in my knowledge of human behavior…

2. Red team is sexy, and exposing sysadmins to those tactics may cause a precipitous drop in the number of sysadmins and a sudden glut of penetration testers. Caveat Emptor.

Hardware Messes As An Opportunity

I’ve been in IT for a long time.  I’ve designed and build datacenters and I’ve created network operations teams.  Not so long ago, the thought of moving my organization’s sensitive data and servers to some 3rd party was a laughable joke to me.  But times have changed, and I hope that I’ve changed some, too.

In the past year, we have seen a spate of significant hardware vulnerabilities, from embedded debug ports, to Meltdown/Spectre, to vulnerable lights out management interfaces, and now the news about TLBleed.  I suspect that each new hardware vulnerability identified creates incentive for other smart people to start looking form more.  And it appears that there is no near term end of hardware bugs to find.

In the aftermath of Meltdown/Spectre, I wrote a bit about the benefits of cloud, specifically that most cloud providers had already implemented mitigations by the time news of the vulnerabilities became public.  There seems to be many benefits of moving infrastructure to cloud, but TLBleed seems like another example of those benefits because we can transfer the capital costs of procuring replacement servers to our providers, if necessary.  (note: I am not convinced TLBleed is an issue that rises to that level of importance) We do, however, need to ensure that the provider has taken the appropriate steps to address the problems.

 

System Restore Points

This is one for the home users…

A while back, I was unfortunate enough to be affected by the Windows 10 build 1803 issue that caused systems with certain SSDs to crash during the update and become unbootable.  Frustratingly, I realized that system restore points are not enabled by default on Windows 10 when I tried to recover.  After a few maddening hours in the recovery console, I decided to reinstall Windows, which is probably not a bad thing to do periodically.

One I was back up and running though, I did enable system restore points – instructions are here, if you aren’t sure how.   I later realized that, most unhelpfully, Windows wasn’t actually creating system restore points automatically.  Fortunately, some helpful people wrote this guide on setting up a scheduled task to create system restore points every day.

Be aware that system restore points ARE NOT BACKUPS.  Restore points will help in some instances, such as when MS forgets to QA a Windows release (or whatever happened with 1803) and you’re left with an unbootable system or introduced some other instability.  Be aware that most destructive malware will disable/delete restore points and they won’t save you from drive failures, stolen devices, or defenestration.  I personally use iDrive for backups since CrashPlan pulled out of the consumer market.  It was really cheap – I think I got a year for about $6 – but that was during “National Back Up Your Crap Month” (or whatever that holiday is called) – and it lets you set an encryption key that prevents (allegedly) iDrive the company from being able to recover your data without the key only you have.

 

*** Edit ***

Well, I got too excited…  Apparently Windows feature upgrades (such as the one that laid waste to my computer) intentionally disables and deletes system restore points as part of the upgrade process.  Details here.  Dammit Microsoft… (Thanks to @galaxis@mastodon.infra.de for pointing this out)

Thoughts about the GDPR

I’ve been a bit of an outspoken critic of the GDPR on Twitter, Mastodon, and on the DefSec podcast.  I’m frequently asked to describe my issues with the regulation, and usually leave my view summarized as “it’s just a poorly written regulation”.

Many of my contacts, particularly those in the EU, do not understand the hoopla leading up to the May 25, 2018 enforcement date.  The prevailing view is that the GDPR mandates a few simple things, such as compelling me to:

  • explain why I am collecting your personal data
  • show you the personal data that I’ve collected about you
  • delete your personal data when I no longer need it
  • protect the personal data I have about you using things like encryption
  • obtain your explicit permission prior to doing anything with your personal data

Those seem like noble goals, and if that’s all the GDPR did, I would not be so critical.  But as with most things in life, the devil is in the details.  The GDPR spans 99 articles and, in total, weighs in at nearly 55,000 words.  Despite the voluminous nature of the regulation, the data protection regulators of the various EU countries formed a Working Party to produce a substantial number of guidance documents on how the GDPR will be implemented an enforced, since the regulation was not specific in these areas.  Also, some individual regulators, such as France’s CNIL, have been releasing still more guidance.

With that as a background, I’ll go into my specific issues with the GDPR:

  1. Most of the consternation in the IT world regarding the GDPR relates to Article 32, which defines the requirements for security and integrity of data processing.  Article 32 is intentionally vague in terms of the protections that need to be applied.  This is almost certainly in part to ensure the regulation is technology agnostic and does not become out of date as technology and threats evolve.  No doubt it’s also due to the fact that companies really don’t like to be told how to run their IT, at the level of specific controls and technologies.  There is a major disconnect between the regulation, the realities of running a business who needs to process personal data, and the realities of the threat landscape.  Perfect data security is an illusion, and breaches will happen.  Many organizations are looking down the barrel of a gun that has bullets that will poke 4% holes in their revenue, hoping the one pointed at them isn’t the one that gets fired.  The GDPR does not provide a safe harbor for organizations handling personal data.  By that I mean that the regulation doesn’t enumerate a certain set of controls that, if in place and operating correctly, would reduce or eliminate the threat of the massive potential fines.  So, organizations have to prioritize their security investments and improvements based on their understanding of the threat landscape.  But, we have plenty of objective evidence that demonstrates many organizations can’t do this effectively.  The GDPR does encourage the creation of a certification standard, but that has not come to fruition.  Many organizations seem to be latching on to ISO27001/2 as fulfilling that certification, but I’m here to dispel any notion that an ISO27k certification is, in any way, evidence that the right controls are in place and operating effectively to protect personal data.
  2. The GDPR requires notification of a data breach to data protection authorities within 72 hours of detection.  As someone who has been involved in far more than my share of investigations, I’ll say that this is not reasonable in most cases.  At least it is not reasonable without providing a lot of half-baked reports, potentially including false positives.  I fear this requirement is going to drive bad behavior… If I’m not actively looking for a breach, I’m much less likely to find that one is happening or has happened.  The GDPR should encourage proactive detection and appropriate investigations.
  3. The GDPR obligates both data controllers (the entity that collects and decides how to process personal data) and data processors (entities engaged by a data controller to perform some processing of personal data) to protect personal data.  This will likely have an adverse effect on IT outsourcing and other IT services because IT suppliers must ensure the data they are entrusted to process is properly protected, regardless of the wishes of their client, the data controller.  This arrangement is inevitably going to lead to significant differences in personal data protection controls between a controller a processor.  Controllers often engage an IT supplier for cost efficiency purposes, meaning that they are unlikely to want to implement a bunch of expensive new controls for something their IT supplier is responsible for performing, while the supplier, being a concentration of regulatory risk across clients, will demand more stringent (and expensive) controls than their client may be willing to tolerate.  I don’t expect that organizations are going to see this as a wake up call to in-source IT, and even if they did, I’m not sure that is good for data subjects in the long run, as IT continues to become more like a utility.  A better model would be to follow the model of nearly every other data protection regulation that exists, where the controller defines the controls required to protect the personal data they process, with penalties for negligence on the part of the supplier, and so on.
  4. The regulation is so obtusely worded that the very people responsible for implementing the mandated protections do not generally understand who it does and does not apply to.  For example, many IT professionals (and indeed EU data subjects at large) seem to believe that the GDPR applies to all organizations anywhere in the world that process the personal data of an EU data subject.  This is not true to the best of my understanding: the GDPR applies to organizations that are present in the EU and process personal data of EU data subjects, and also applies to non-EU organizations that are targeting EU citizens.  This seemingly means that my blog, hosted in Dallas, TX, USA, and which does not specifically target EU data subjects, likely does not need to comply with the GDPR.

What did I miss?  Do you disagree with my observations?

 

Data Breaches and Randomness

The field of information security is a prime example of making decisions under uncertainty.  Generally, there is far more to do than can be done, and therefore we must make priority decisions of what to protect, where to invest, how and who to train, and so on.  We know that we cannot create a perfectly secure system that retains some useful business value, beyond that of a doorstop or paperweight.

I recently started listening to Annie Duke’s book “Thinking in Bets: Making smarter decisions when you don’t have all the facts”.  Like others in the field of behavioral economics, Mrs. Duke cautions us against the phenomenon of outcome bias and hindsight bias.  Basically, we should reward good process, not good outcomes that might be/likely are the result of pure luck.  This quote particularly resonated with me:

“An unwanted result doesn’t make our decision wrong if we thought about the alternatives and probabilities in advance and allocated our resources accordingly.”

This, by the way, is why we should be wary of management teams that purport to be “outcome based”.  This means that management will almost certainly value luck over sound decision making.  As with any random process, a “lucky” executive that was just promoted will very likely come to understand the concept of “regression to the mean”.

In the IT security world, we shame people and organizations that have a breach.  We look at what happened with the benefit of hindsight and conclude that any reasonable person could have foreseen the breach, therefore condemning the person or organization as incompetent or negligent.  Oddly, though, we (often) don’t know whether any of us could have made better decisions than the person or organization involved in a breach.  Possibly the breached organization made reasonable priority and investment decisions but got unlucky.  Or maybe the organization made crappy decisions and the breach really was inevitable.  We never get to consider the contra scenario, where an organization makes crappy security decisions but gets lucky.

Luck shouldn’t be part of our strategy to defend the assets we are charged with protecting, but good process should.  By that logic, we should direct our criticisms at organizations that have bad processes, whether they were breached or not.

Hopefully this is unsurprising and intuitive, at least upon hearing it.  There are two problems with applying this concept to infosec:

  1. There is no objective way to tell whether a breached organization fell victim to crappy planning or had good planning and got unlucky. We can’t rely on the organization itself to help us out there.  In some rare cases, we do get to make informed decisions based on the civil or criminal court proceedings.
  2. Unlike a poker player, as in Mrs. Duke’s book, who gets unlucky during a card game, when modern organizations are unlucky enough to suffer a breach, it is not just the organization itself that is harmed. Quite often, those harmed are customers, business partners, employees, and others.

For those of us who are harmed in data breaches through no fault of our own, we can’t simply accept that the breach organization was just “unlucky”.  We believe that the breach happened is evidence that the organization was not doing enough to protect its systems.  This gets to the heart of fundamental philosophical issue facing organizations in the age of pervasive data: unlike almost any other business risk that an organization faces, the harm from many breaches are not borne by the organization itself.  Organizations are playing poker both with its own chips, and with the chips of the people it stores data on.

Regardless, organizations do not have perfect visibility into threats, nor do they have unlimited budgets, and so long as they handle such data, they will be making decisions on how to protect the data.  Some organizations will find that items that fell “below the line” on the priority list created gaps that lead to a breach.  Others will get lucky.

Laws like the GDPR will help, because the GDPR raises a possible significant fine and civil liability from not properly protecting personal data.  I am skeptical that we will see any noticeable decline in data breaches after the law takes effect, because at the end of the day:

  1. We do not have perfect security
  2. Breaches are the result of effectively random processes

Human Error as a Cause of Data Breaches

Australia recently enacted a new law that requires organizations to disclose breaches of personal data.  The Australian Information Commissioner released its Quarterly Statistics Report for Q1 2018 and one of the findings is that about half of the breaches were caused by “human error”.

It occurs to me that human error is at the root of all these breaches.  Those attributed to human error are instances where (apparently) the last action in the chain of events that lead to the breach was conducted by a human making a mistake.

But after that, it gets more nuanced.  If we start following back in the chain of events, it seems like we will always end up with human error as a cause.  For those that are “malicious or criminal attacks”, doesn’t the person clicking on a malicious link count as an error?  Doesn’t the person who designed the environment in a manner that allowed clicking on a malicious link to lead to a data breach count as human error?  What about the IT person that ignorantly exposed RDP to the Internet with a weak password?  Or the manager that decided to save some money when implementing a system that caused patches to be delayed because the system can’t be taken down during the week?  Aren’t all of those human errors, too?  Why does the fact that some opportunistic criminal took advantage of these “upstream” human errors cause us to think about their causes differently?

Oh, and for those that ARE in the human error category in the report, I suspect that, similarly, the cause of the breach was not necessarily the person that “made the error”, but rather the person that designed the process in such a manner that allowed such errors to lead to a breach.

It seems clear to me that we really only consider the “end nodes” in the chain of events that lead to a data breach, and I suspect we will not make material improvements until we accept that we need to begin dealing with the actual causes of breaches, which happen much earlier in the chain.

Cyber Security Immaturity

It is difficult to make improvements in some system without an understanding of the problem that needs to be addressed.  In the IT security world, we face many problems.  It’s clear to me, though, that one of the main problems we have is, well, agreeing on what the problems are.  A great example is a recent post I read by Rob Graham on the Errata Security Blog regarding the advice to use “strong passwords”.  While the post itself is a good read, I found the comments to be much more interesting.  Granted, the comments only involve a small number of people, but in my experience, they exemplify a broader issue in the IT security world: the objectively “right” approach to address some security challenge is determined by the perspective, experiences, and subjective judgement of each practitioner.  In some ways, that’s a good thing.  I recently listened to the book “Deviate: The Science of Seeing Differently” and in it, the author made the point that breakthroughs often only come when people don’t know that they shouldn’t ask a particular question. In other respects, it seems clear to me that we should not be in a place where there is such disagreement between whether the advice to use strong passwords is good or not.

In this particular instance, I suspect much of the debate stems from two things:

  1. lack of a consistent understanding of modern threats that face password-based authentication systems
  2. lack of a consistent view of what kind of systems we are trying to protect

I suspect, for example, that most people that come from large organizations will view weak passwords as much more of a problem than password reuse, whereas incident responders and those who manage consumer-oriented Internet services will see password reuse as much more problematic.

I hope and expect that this difference of perspective leading to different views is intuitive.  What may not be so intuitive, though, is that our own views on the “objectively right” approach to address some security concern is colored by our knowledge and experience, and may not be a universal truth, and we should always be questioning ourselves, our beliefs, and our approaches to problems.

/LunchTimeThoughts

Treating The Disease of Bad IT Design, Rather Than The Symptoms

I have a lot of opportunities to see and think about how IT security disasters play out.  I talk a lot about how to help avoid these on the Defensive Security Podcast, and I’ve written a good bit here in infosec.engineering.  There are many causes of weaknesses that lead to intrusions and data breaches such as underinvestment, bad processes, malicious or careless insiders, and so on.  I am growing more convinced, though, that a significant factor, particularly in the extent of intrusions, arises from poor design of IT systems.

IT environments in average organizations are exceedingly complex, in terms of the number of entry points and paths that an adversary can use for lateral movement (see Thinking Graphically to Protect Systems).  There is little formalized guidance on how to design an IT environment, and much of the time, there are nearly unlimited ways of connecting and arranging things in a way that “works”, in terms of meeting the core intent of the system.  We are at the mercy of the imagination of the architects who design these systems to foresee the potential for abuse and properly design in concepts like least privilege.  Most of the time, the people working in those design roles aren’t familiar with many of the techniques that adversaries use.

Much of what we do in IT security is treat the symptoms of the underlying disease; the disease being badly designed IT.  We try to apply increasingly sophisticated antivirus software, next-gen firewalls, and so on, to mitigate risks to the environment.  To make our environments more resilient, we need to spend some time tackling the disease.  It’s extremely difficult to make fundamental or large-scale changes to existing IT.  At the same time, IT in most organization is a constantly changing organism, meaning there are likely opportunities to inject more robust design patterns incrementally.  By that, I mean that we are generally always upgrading some component, optimizing or redesigning some aspect of the network, and so on.  There are fortuitous changes happening in the IT landscape, such as the move to cloud, which may present opportunities to make fundamental improvements.  Some organizations end up in the unfortunate position of having an opportunity to start over – such as was the case with Sony Pictures, and many of the victims of the NotPetya worm.

As I previously mentioned, my experience, and the experiences of many colleagues I’ve discussed this with, is that failure modes (of the malicious intruder sort) are often not considered, or are only given passing thought because of a variety of factors including schedule and cost limitations, ignorance of threat actor techniques, and the fact that IT is an art form, and sometimes IT people just want to be artists.

It seems to me that the industry would do well to establish modular IT infrastructure design patterns that are both very specific in terms of configuration, scalable, and laid out in such a way that various building blocks can be linked together to form the foundational IT infrastructure.  There may be building blocks that are effectively “frameworks” (though not in the manner of the NIST Cyber Security Framework) where oddball or specific systems and applications operate standard,  This would become a set of design patterns that are cost efficient, resilient, and modeled after best practices and tuned based on changes in technology and new understanding of deficiencies in the original design.

The idea here is to develop an approach that removes much of the design weaknesses in organizational IT environments by providing an objective set of “tried and true” design patterns that IT people can use, rather than designing a half-assed, difficult to secure environment because those IT people are ignorant of how their custom designs can be exploited.  I see a lot of parallels here to encryption (though I admit it is a tenuous comparison): it’s mostly accepted in the IT world that designing your own custom encryption scheme is a bad idea, and that the most effective approach to encryption is using accepted standards, like AES, that people a lot smarter than the average IT person has designed and demonstrated to be robust.  Also like encryption algorithms, IT environments tend to be vastly complex, and weaknesses difficult to spot to an IT layperson.  We will get the occasional DES and Dual EC DRBG, but that risk seems far preferable to creating something custom that is easy to break.

The move to cloud, virtualization, and infrastructure as code provide an excellent opportunity for such a concept to be used by IT teams with minimal effort, if these design patterns exist as Vagrant/Ansible and SDN style configuration files that can be tailored to meet specific needs, particularly in the area of scale, dispersion across locations, and so on.

Is anyone working on such a thing?  If not, is there any interest in such a thing?

Differentiating IT Risk From Other Business Risk

It’s often said that IT risk is just another type of business risk, not different than the risk of hiring a new person, or the risk of a new product or a new acquisition.

I recently listened to the audiobook “The Undoing Project”, which is the story of Amos Tversky and Daniel Kahneman and their development of the foundational concepts of behavioral economics.  The book is a great read for anyone familiar with their work, though it does not bring any new insight if you’ve already read “Thinking Fast and Slow” and works by Dan Ariely.  On this listen, though, something clicked in me when the author recapped discussions about how people value bets.  Consider this scenario:

You are given two options:

  1. A $500 payment
  2. A 50% chance of a $1000 payment, and a 50% chance of nothing.

Most people given this choice will take the $500 sure thing option.  We require the amount to be a significantly higher in option b to take that bet, even though they stand to make double the money.

Now consider this scenario:

You are given two options:

  1. A $500 fine that you must pay
  2. A 50% chance of a $1000 fine that you must pay, and a 50% chance that you pay nothing

In this scenario, most people become risk seeking and chose option B, even though it may cost them twice as much.  We require the potential loss in option b to be significantly higher to select option a.

How does this relate to business risk?  First, businesses are led by people, and those people have the same bias as above.  I contend that there are (at least) two distinct types of business risk that we need to keep separate in our minds:

First, investment risk.  This is the risk arising from investing in some new “thing” that has the promise of generating more money.  That “thing” could be a new employee, a new product line, an acquisition, and so on.  There is a chance that the venture fails, and the money is lost, but a (hopefully) much larger promise of increasing revenue and/or profits.  This very likely explains why many companies won’t take major bets, often opting for something closer to a “sure thing” payoff.

Second, risk of loss.  This is the risk arising from things like theft, fire, flood, data breaches, and so on.  It’s all downside risk.  This is the second scenario above.  To what extent do business leaders avoid a sure thing loss (the $500 fine) in the form of increased spending on IT security, because they do not full comprehend the actual potential loss?

 

 

Thoughts on Incentives Driving Bad Security Behavior

Harry Truman once said “Give me a one-handed economist! All my economists say, ‘on the one hand…on the other hand…'”  I am quite frustrated by the state of affairs in the IT world, and I gave a presentation on it at the Tactical Edge conference in Colombia.  (Note: this was my first conference presentation and I consider my thoughts on the matter to be half-baked, at best right now, so adjust expectations appropriately).  The premise of my presentation is that we need more sophisticated IT and IT security people, who are able to effectively understand and communicate risk.  In particular, I believe that many IT people, and even many IT security people, do have have imaginations sufficient to envision the ways things can fail, and the extent of harms that can come.

Since giving the presentation, I’ve talked with many people about my ideas and their experiences in various organizations and I’m beginning to realize that there may not be much desire to improve the situation.  In many organizations, performance is judged by accomplishments and efficiency, rather than on some obscure, hard to measure thing like “security”.  Spending too much time on assessing risks slows progress on important projects, and trying to account for the many “bad things” that can happen to a system, but probably won’t, is not efficient.  This situation is a gamble; trading off the perceived remote possibility of a bad thing happening for the certainty that comes with not meeting business objectives.  Viewed through this lens, the less IT managers know about risks, the better off they are, since ignoring known risks moves a person out of the realm of “ignorance” and into the realm of “negligence”.

I’ve had this view that we collectively want and try to improve, but now I am not so sure.

If you’re interested, here is the video and the slides from my presentation.  I am going to make some significant updates to the content in the coming months, as well as improve my presentation abilities, and hopefully deliver this again at some upcoming conference.

Video:

Slides: