The previous articles of this series have focused on the real-life NASA hardware which inspired the fictional equipment found in Andy Weir's novel (and imminent movie) The Martian. Specifically, we looked at many of the components that are used to process water, air, and electrical power in space. This article will be a little different.
Readers of The Martian know that one of the recurring themes in the book deals with fixing broken equipment using whatever is on hand, combined with plenty of ingenuity. Those scenarios have a very real parallel in NASA's day-to-day operations of manned and unmanned spacecraft. Space is an extremely harsh environment and spacecraft components break…a lot. Let's take a look at how NASA deals with these in-flight failures.
Stuff In A Box
It would be difficult to talk about hardware problems in space without mentioning the Apollo 13 mission and the countless miracles performed by mission control to get the crew home alive. In one memorable scene from Ron Howard's 1995 movie about the ordeal, engineers in mission control begin working to reverse rising carbon dioxide levels in the Lunar Module. Someone empties a box of random-looking parts which represent the total resources of the spaceship's crew. The challenge is immediately obvious: use these parts to find a solution or people will die.
In a recent conversation with present-day flight controller Tom Sheene, I asked if the "stuff in a box" scenario still happens. He replied, "All the time… it's the most challenging and rewarding part of my job." Sheene went on to tell me about a custom tool that his team had designed to lubricate the space station's robotic arm, and another that was used by spacewalking astronauts to free a solar array that refused to unfurl.
When these custom tools are being designed, aesthetics takes a back seat to functionality. But no one seems to mind as long as they get the job done. The names given to these tools are equally low-key. Apollo's hacked carbon dioxide scrubber was the "mailbox", and the solar array tool was the "hockey stick". Tools that become a part of the permanent inventory are renamed with more scientific terms and, as with all things NASA, branded with an acronym. Case in point: Sheene's robotic arm tool graduated from "fly swatter" to "BLT" (Ball Screw Lubrication Tool).
While a failed component on the International Space Station (ISS) rarely triggers an immediate life and death battle of wits, the stakes are invariably high. Whatever the failing component may be, it was sent up there for a reason and at great expense. You can't just roll down the window, turn up the radio, and pretend that it isn't squeaking.
Sheene and his colleagues man the Operations Support Officer (OSO – pronounced "Oh So") console in mission control. In the multi-layered symbolism that often surrounds names and acronyms at NASA, the official OSO patch contains a rendition of Ursa Major. The inclusion of this constellation, also known as "The Great Bear" is a subtle nod to the Spanish interpretation of "Oso".
Most flight controllers are tasked with the operation and management of specific systems under their purview, but OSOs are a slightly different breed. While the group does have some hardware assigned to them (namely the interfaces where visiting spaceships dock with the ISS), OSOs are responsible for the maintenance and repair of all of the spaceship's systems. As Sheene puts it, they are the "glorified janitors" of the ISS. He later suggests that it would be more accurate to think of OSOs as NASA's version of Scotty from Star Trek.
Much of what the OSO team does involves routine scheduled maintenance. Sheene notes, "It's just like your car. If you don't do preventive maintenance, like changing the oil, it's ten times worse later on." In a similar fashion, the ISS is chock full of filters, pumps and other expendable items that are on a prescribed Remove-and-Replace (R&R) schedule. Sheene pointed out that the overhead of executing these caretaking tasks can be significant for the crew aboard the ISS. "Sometimes an astronaut will spend their entire day just on maintenance."
In spite of the planned expiration dates of the R&R items, the OSOs do not necessarily replace every part on schedule. Due to the costs involved with some parts, and the difficulty of keeping orbiting supply shelves stocked, select components are kept in use until they fail. Of course, this only happens with non-critical systems and in instances where failure of the R&R part would not cause trickle-down failures in other components.
When the Going Gets Tough
Despite the best efforts of the OSO team, space happens. Unexpected failures and never-before-seen problems are inescapable facets of life with the ISS. These are the situations that often require the seat-of-the-pants ingenuity that NASA is known for (and celebrated in The Martian). While the solution may require a significant dose of improvisation, the path to get there is often well-mapped.
When a problem pops up on the ISS, the OSO team is tasked to solve it. Not that they work in a vacuum, mind you. At a minimum, OSO will work with flight controllers who are responsible for the affected system(s) and the flight director, who has final say on all matters. The complexity of the situation may dictate that astronauts, hardware experts, contractors, and others are brought into the fold as well.
The method for fixing a problem begins with a process called Failure Impact Workaround. The first step of FIW is to gain an understanding of what is actually broken.
Sheene stated that the method for fixing a problem begins with a process called Failure Impact Workaround (FIW). As we discussed the nuances of FIW, Tom used the scenario of a MMOD (Micrometeoroid and Orbital Debris) puncturing the hull of the ISS. The ISS does indeed absorb MMOD hits from time to time. Thankfully, none of the "hell in a handbasket" MMOD damage scenarios that flight controllers routinely train for have come to fruition. Even so, the MMOD example provides broad insight into NASA's crisis management methodology. So, I will carry it forward here.
The first step of FIW is to gain an understanding of what is actually broken. Sheene explains:
"A lot of times we don't actually know what the failure is…the best we can do is say 'Here is the failure signature.' For instance, we'll see that this box, this computer, has completely lost power. So what is the failure? Is the box dead? Is the power source to the box dead? We treat it like your typical tech-support guy on the phone. We have to ask fundamental questions like 'Did you turn the power on?' That sort of thing helps us hone in on the root cause of the issue."
Sheene goes on to explain that MMOD strikes rarely cause just one issue. There are likely to be a number of affected components that are spread across multiple systems. The array of issues could create greater confusion, but it is just as likely to provide additional clues that help pinpoint what went wrong.
"While we're working these cases, the ADCO [Attitude Determination and Control Officer] might tell us 'We're [the ISS] pitching up and yawing.' And maybe the crew told us that they heard a noise in Node 2. Now we know that we lost a computer that's located in that module, the crew heard something there, and we're venting air that's trying to rotate the station. From this, we can be relatively sure that we took an MMOD hit in Node 2 that breached the hull and knocked out the computer. Based on the direction that the ISS is trying to rotate and the location of the bad computer, we also have a good idea where to look for the hole."
The next step in the FIW process is to determine the criticality of each of the issues and prioritize how resources will be allocated to address them. This task is made somewhat easier by a non-negotiable hierarchy of obligations. The top priority is always to look after the well-being of the crew. Sheene explains:
"Our primary goal is to keep the crew safe. If you can, save the vehicle…obviously. But you don't want to put the crew in harm's way to save the station or to somehow isolate the crew from their escape vehicle [i.e. the 3-seat Russian Soyuz ships that remain docked to the ISS].
OSO works with the affected disciplines [flight controllers] and the flight director to figure out what impact is the most important, and that's the one we're going to work on first. Most of the time it's going to be the hull. You might lose multiple systems, but they usually have layers of redundancy built in. Other flight controllers may begin working on a recovery plan for their systems, but OSO is likely going to be focused on patching the hole in the hull first.
The main problem with a hole is that it can get bigger. In these cases, what we'll usually do is get the crew out of that module, seal the hatch, let it vent to vacuum, and think about it for a few days. Once we have a solid plan, we'll repressurize the module and go back in to patch the hole. [note – Sheene is talking about simulation scenarios. Thus far, flight controllers have never needed to seal off any modules on the actual ISS.]
If somehow the crew immediately happened to know exactly where the hole is, that's a no-brainer…you patch the hole right away. But even in that case…the station is so full of stuff that they might spend an hour moving racks and hardware out of the way just to get to the hole."
The final element of FIW involves actually fixing the broken components. In the case of broken rack computers and other common hardware, there are usually spares available on board. The crew pulls out the bad unit and swaps in a new one.
Holes caused by MMOD impacts can be a little tricky to repair based on their size and location. There is a small variety of prefabricated patches to choose from for this task. Rigid patches are used when the hole is on an open and accessible part of the hull. Flexible patches are used when the hole is near a bulkhead or other obstruction. There are also options for using Duxseal or a 2-part epoxy-like mixture to deal with particularly troublesome breaches of the hull.
Inevitably, there will be times when the toolbox on the ISS simply doesn't have a tool for the task at hand. That's when the OSO team goes off-script and works to develop something to get the job done. Mailboxes, fly swatters, and hockey sticks are the results. Not only are the OSOs tasked with developing the solution, they must also instruct the crew on how to assemble and use the new widget. This is usually done with written procedures that are uploaded to the ISS.
The problem is not always urgent. In the case of the robotic arm, Sheene was able to spend time with the Canadian team that designed and manufactured the arm components. Perhaps it is more significant that he was able to get with the astronauts who would use the BLT in space. Sheene provided them with training on the tool before they ever launched. The luxury of such lead time and hands-on training opportunities is rare.
As long as humans continue to occupy space, whether on the ISS or as far away as Mars, there will be technical issues to deal with. It helps to have a plan and a process in place to meet these issues head-on. Yet, it is impossible to anticipate every scenario. As The Martian's Mark Watney illustrates, a cool head and rational, yet innovative thinking can be the most effective tools at your disposal.
Author's Note – I offer my sincere thanks to Tom Sheene for sharing his knowledge and experience as an OSO flight controller.
All images appear courtesy of NASA.