Question

Edit
What software practices are being used in mission-critical industries where safety is paramount? For example nuclear power plant.

Update
Originally this question was: How would you develop software for a nuclear plant? I have changed it to save good answers. I'm also making this question community wiki. Please help to word it better!

Answer 1

Well, not Java. According to the license agreement ^[1] ...

You acknowledge that Licensed Software is not designed or intended for use in the design, construction, operation or maintenance of any nuclear facility

[1] http://www.java.com/en/download/license.jsp

Answer 2

I would use Eiffel with Design by Contract for correctness. Coincidentally it's already used in nuclear plants

DbC is a precondition/postcondition for each method, checked by the runtime during development. It includes class invariants that are checked before and after each method invocation during development. The precondition/postcondition and invariants form an exact specification on a module's interface.

http://dev.eiffel.com

Answer 3

I haven't worked specifically in nuclear production, but I have ample experience in system development where environmental safety (and human safety for that matter) is paramount. A lot of the development I have done in my career has been for use in this type of environment - whether it be Oil & Gas, Hydro-Electric production and could even be used in nuclear facilities, although I've yet to have that final honour - thankfully, perhaps.

The large majority of these types of systems are developed using SCADA ^[1] systems and some form of HMI control system - which is what they call the GUI in industrial systems. This is usually an IDE built on a system designed purely for this purpose - CygNet ^[2], Wonderware ^[3], iFix ^[4] or FactoryLink ^[5] or similar.

Whenever you're coding for this type of environment, your first concern is failsafe. I will simplify to demonstrate my point (at the risk of being chastized by the SCADA community), but a system like this is controlled largely by hardware with safety limits hard-wired, firmware controlled and then software controlled.

The hard-wired limits are the outside boundaries of safety. In the event that firmware or software fails and these limits are breached, the system automatically shuts down. For instance on an oil pipeline this might mean closing a valve on a well to prevent an explosion at one end, or may mean venting excess to atmosphere or a burner if necessary.

Firmware limits are usually predetermined safety limits, considered safe for general use to push the system to.

Software is then used by an operator who will tweak the system to get the best possible performance or to meet other business targets - i.e. most power, coolest operating temperature, optimal performance etc.

In the event that anything fails, the underlying system takes over and operates safely. This means that in the event that the application were to fail catastrophically, the firmware built in the hardware controls can still operate the system safely. In the event that the firmware fails the hardware faults safely - i.e. shutting the system down to prevent environmental or human catastrophy.

[1] http://en.wikipedia.org/wiki/SCADA
[2] http://www.cygnetscada.com/
[3] http://global.wonderware.com/EN/Pages/default.aspx
[4] http://www.gefanuc.com/products/3311
[5] http://www.inotek.com/Catalog/usdata1cn.html

Answer 4

Each CANDU nuclear unit (reactor and turbine-generator) is controlled ^[1] with two independent, 100% redundant computers. An important design concept is that control systems and safety systems are kept completely independent.

Normal operator interface with the computer

The panel closest to the camera, is the panel for one of the computers. The hand-switch selects which computer is controlling. One computer is controlling, one computer is running on standby. The orange lamps show which programs are executing. The CRT is displaying the contents of core memory for specific locations. L-3 MAPPS ^[2] is a current computer supplier.

Where I work, the computers are Varian V72s. Hardware defences include core parity checking logic, restart detection logic and peripheral failure sensing logic. A multiple level interrupt scheme is used where internal interupts are given a higher priority over external interrupts. The two computers communicate through a data link. A program fault on one computer gracefully degrades the control and automatically transfers control to the other computer. Each computer is powered by independent high reliability power supplies. If both computers fail, the unit is shutdown, by the fail-safe insertion of neutron absorbers. In summary, the design concepts include redundancy, independence, graceful degradation and failing safe.

The computers are programmed in Assembler, using absolute coding. (The core location and contents of every word of coding is known by looking at the listing without having to refer to a core map and then use octal arithmetic.) Breakpoint hardware is used for debugging. An executive program runs in core memory which schedules, executes and checks various other control programs and service functions. Core memory is also used for input/ouput validation and alarms. Fast periodic programs like the watchdog timer and reactor power control, remain in core memory and run every half second. Slow periodic programs like boiler level control run every two seconds. Version control is rigourously applied using both software and administrative methods. Changes are tested, installed on the controlling computer, and confirmed safe, before being installed on the standby computer.

[1] http://www.aecl.ca/Assets/Publications/C6-Technical-Summary.pdf?method=1
[2] http://www.mapps.l-3com.com/html/power/candu.html

Answer 5

Over a VPN. As far away as possible

Answer 6

I am not an expert. I just tell you what I heard.

For mission critical systems, the reference language is Ada. The development process is very strict and focuses on a test driven strategy with very small and highly tested (and stressed) routines. To address the potential of a crash, there is not only one system, but multiple redundant systems (in terms of sensors and processing units) which perform a "voting" procedure.

I don't know more than this.

Answer 7

I'd probably use some combination of Windows ME and Visual Basic 6. Then I'd RUN LIKE HELL.

Answer 8

A strange game. The only winning move is not to play.

Answer 9

Not exactly a nuclear power plant but similarly mission critical is software for manned space missions.

The Nasa mission critical software process works off 4 propositions:

The product is only as good as the plan for the product.
The best teamwork is a healthy rivalry.
The database is the software base.
Don't just fix the mistakes -- fix whatever permitted the mistake in the first place.

Consider these stats : the last three versions of the program -- each 420,000 lines long-had just one error each. They must be doing something right.

There is a very good article explaining these propositions here:

"They Write the Right Stuff" ^[1]

Obviously this cost a lot of money!

[1] http://www.fastcompany.com/magazine/06/writestuff.html?page=0%2C0

Answer 10

A formal specification language ^[1] such as Z or Object-Z is a must. The software producing organization should have a high CMMI level as well, 4 or 5.

[1] http://en.wikipedia.org/wiki/Formal%5Fmethods

Answer 11

I remember about a month or so ago there was an article on slashdot about how NASA developed defect free software (that's how it was referred to) - it had a specific example from NASA. They made sure that they had very clear specs (written IIRC using Z), and had lots of testing, etc. You can find one of their documents here ^[1].

I am trying to find the link, but can't see it atm. Will post it later when I can find it.

In general, I would say that the following would be important:

Make sure that you have very specific specifications (use Z or some other formal language)
You can never write too many tests (for such a high risk application)
Make sure that you choose the correct framework / language / development environment (e.g. IIRC the Java license does not even permit the use of Java in nuclear plants)

EDIT: marcc found this ^[2] link ^[3] (second link shows everything on one page), which explains a bit more about how NASA operate, but isn't the link I was looking for.

[1] http://www.hq.nasa.gov/office/codeq/doctree/871913B.pdf
[2] http://www.fastcompany.com/magazine/06/writestuff.html
[3] http://www.fastcompany.com/node/28121/print

Answer 12

Apparently C++ is good enough for managing nuclear warheads. See the Coding Standards for the Joint Strike Fighter ^[1] (PDF).

[1] http://www.research.att.com/~bs/JSF-AV-rules.pdf

Answer 13

There are standards and certifications for software development of safety-critical systems:

Avionics: DO-178B ^[1]
Industrial: IEC 61508 ^[2]
Nuclear: IEC 61513 / IEC 62138
Railway: EN 50128

...and so on. Most of these overlap heavily with variations on CMMI-like processes, restrictions on language usage, requirements for fault analysis, diagnostics, fail safe states, etc.

So, in short, developing software for safety critical systems is not something you need to figure out on your own.

[1] http://en.wikipedia.org/wiki/DO-178B
[2] http://en.wikipedia.org/wiki/IEC%5F61508

Answer 14

You wouldn't believe how much stuff runs on 30-year old software.

But that's besides the point - the answer you're looking for is that nuclear power plants, and pretty much most power plants, and all other kind of objects don't rely on software for their running operations. Think of it, ... how old are some powerplants, how long is their predicted lifetime (and a lot of them exceedes that lifetime) ... do you really wish to make them rely on buggy software on operating systems that change every 10 years ?

No, when it comes to that kind of objects, you have physical controlling mechanisms (from valves up to releys, up to ...), with alarms, then some more valves, then some more alarms and human monitoring, and then maybe some software controlling of processes (but that software still can't avoid the valve) ... you see my point probably.

How that old saying goes ?

If architects builded buildings like programmers build software,
the first woodpecker that came would destroy the civilization.

Answer 15

Very carefully. Or not at all.

Answer 16

Depending on what the nuclear plant does with the material, be careful about what Google software you use in connection with your development work. By agreeing to their Terms of Service, you're also agreeing to:

(iv) not license, sell, provide or distribute the Software for use in connection with chemical, biological, or nuclear weapons or missiles capable of delivering such weapons

Sources:

https://registration.keyhole.com/download_earth_pro.html ^[1]

http://sketchup.google.com/download/license_pro.html ^[2]

http://toolbar.google.com/gmail-helper/terms_mac.html ^[3]

[1] https://registration.keyhole.com/download%5Fearth%5Fpro.html
[2] http://sketchup.google.com/download/license%5Fpro.html
[3] http://toolbar.google.com/gmail-helper/terms%5Fmac.html

Answer 17

You might be interested in this document ^[1], which talks extensively about the current experience (as of 2004).

With regards to the programming languages specifically, here's paragraph 5.5.2 (p. 53):

Most evolutionary I&C; [TR: Instrumentation and Controls] designs use some variant of the C computer language. Overall, there were no reported problems when the C language was used. By contrast, other software languages have had various issues. For Westinghouse implementations, the choice of the PUM-86 computer language proved to be too microprocessor-specific. Because of the limited use of this language, it proved difficult to expand its use across different applications. The lack of familiarity with the language among vendor and plant personnel also contributed to problems, such as reduced sources of support and limited data. Because of similar problems, the PL-1 and PASCAL languages have been replaced by C. ADA was adopted for use in the Temelin - Class 1 E diverse protection system because of its unique characteristics and its history of development and use by the U.S. Military. However, for the above-mentioned reasons, ADA will most likely not be used in future reactor designs.

So, it's C.

[1] http://www.nrc.gov/reading-rm/doc-collections/nuregs/contract/cr6842/cr6842.pdf

Answer 18

You might wish to read the book SafeWare ^[1] by Nancy Leveson, it has some good case studies for software and dealing with preventing hazards.

[1] http://books.google.com/books?id=ZrZQAAAAMAAJ

Answer 19

Can you say what you are trying to learn from answers to this question?

Possibly you're wondering "could we do more to make sure our software doesn't break? What if we wrote software like they do for nuclear power plants"?

If so, then I don't think you've found the correct analogy. The cost of bugs in the systems of a nuclear power plant would be so high that, it's possible that software is not even permitted.

If this is what you're looking for, then I think you should look for examples of software where failure would be very expensive, but would not be life-threatening. Maybe systems that deal in millions of dollars per second, I don't know. But I think you want something achievable.

Chances are the differences aren't so much in QA, as in process, to make sure the bugs never get into the code in the first place.

Answer 20

Use what the regulator allows. No, seriously, you do NOT get to choose sometimes. It's quite possible that people will suggest crazy things like commodity operating systems.

This is the same mess the US SCADA industry is in with little to no security.

So the my money would be on locked down Solaris X ( it has quite nice real-time support, in addition to being like a bank vault.)

Ravenscar Ada springs to mind for the code . As noted you can't use Java. I've used Real-time Java for weapon systems and it works really well; but maybe one day Realtime java will be okay for nuclear plants; in which case it would be good.

Big ups on heavy formal methods, and using a whole-system simulator build in Matlab or similar. No I'm not smoking crack. The flight system guys now use Mathtlab's code generator at least for simulation.

And really heavy testing. Yes Veronica, we will be expecting 100% scenario coverage.

Answer 21

I think I would spend 98.99999997% of my time writing test cases and testing my code. I would spend 1% writing code and the remainder on StackOverflow.

Answer 22

During the landing of the first space shuttle mission (STS-1) all five redundant computers failed (due to a hardware fault). Mission commander John Young took over manual control for the landing.

So the lesson is... always have a manual override.

Answer 23

Don't use Java. It is not approved for use in nuclear power plants.

Answer 24

Erlang ^[1] with reported cases of 99,9999999% of availability would be a serious candidate language.

I'd also count on a very experienced team and a lots of effort on code coverage tests as well on stress tests.

[1] http://www.pragprog.com/articles/erlang

Answer 25

Using a formal method that you can mathematically prove you are not going to fail.

There are methods such as the B- Method ^[1] that are used in safety critical systems, notably the Paris metro.

[1] http://en.wikipedia.org/wiki/B-Method

Answer 26

I'd spent most of the time writing detailed spec, as the guys in NASA do.

I found this article ^[1] very interesting.

[1] http://www.fastcompany.com/node/28121/print

Answer 27

I would only work as telecommute. Better from some other continent.

And for design would definitely recommend a hardware switch that cuts out PC control and puts everything on manual.

I wonder if there are developers that work on nuclear plant software among stackoverflow.com audience.

There are sure such developers somewhere, like guys working in CERN (haven't seen them alive though).

There should also be developers who work with hadron collider. They have likely already made a few bugs there. Though the thing crashed after a few days of operation, there is likely to be a memory leak. I mean, I used to find a few things on my desktop in Germany somewhat shifted from the original position I left them in the direction of Switzerland (micro black hole or whatever they created but did not properly disposed). Scary...

Answer 28

For industrial control systems using PLCs there are tools availible that can analyze every possible state of the software. By using this data (in form of state graphs for example) you can see if there are any dead ends or other strange situations and thus rewrite the program to prevent those states from even existing.

Disclaimer: I really have no idea how nuclear software is made but i belive that such tools would be really helpful for this kind of application.

Answer 29

I'm sure a lot of the "software" actually used in nuclear power plants is on Windows, like most businesses: Excel and Word and Acrobat and Outlook. I'm sure they have boring old CRUD applications for rod inventory.

Nuclear power plants, like most large systems, are going to be made up of a combination of digital and analog controls and embedded and general purpose computers. The components will be programmed in a variety of languages and the choice in each case is going to be dictated by the individual requirements.

Answer 30

Definitely not with agile, and yes with an waterfall process.

In the tools wich I would use, certainly there will be formal tools wich I could verified with some mathematics, like Petri networks.

If the software will able to run in Windows I will write it in Delphi, just because java doesn't allows it.

Answer 31

This posting ^[1] discusses safety critical software with quite a lot of fan-out links.

[1] http://stackoverflow.com/questions/243387/best-language-for-safety-critical-software

Answer 32

CERN uses LabVIEW to control the LHC. If LabVIEW is good enough to recreate black holes, Higgs Bosons and the Big Bang I'm sure it can handle whimply ole' nuclear fision. :)

Answer 33

No garbage collection, no dynamic memory allocation, no multitasking and multi threading. Evrything must be stupid safe.

Answer 34

Design Diversity (n-version programming):

Quoting " Choosing Effective Methods for Diversity ^[1]":

Design diversity is a popular defence against design faults in safety critical systems. Design diversity is at times pursued by simply isolating the development teams of the different versions, but it is presumably better to “force” diversity, by appropriate prescriptions to the teams. There are many ways of forcing diversity.

Quoting " Simulating Specification Errors and Ambiguities in Systems Employing Design Diversity ^[2]":

In n-version programming different software versions, written to the same specification but developed independently execute in parallel. It is imperative that there is no communication between the teams responsible for developing the different versions. Quarantining the different teams is essential such that misunderstandings from one team do not affect the understanding of other teams. But quarantining teams is not always enough uncorrelated faults in distinct versions can lead to identical failures. [...] Many people have written off n-version programming as a dead approach to attaining high integrity software because of the n-version problem.But to our amazement n-version programming is alive and well in several different safety critical domains and it is particularly popular outside of the United States.

Quoting " Research on Diversity and Software Fault Tolerance at the Center for Software Reliability ^[3]":

The use of diversity – doing things differently, in two or more ways, to protect against the failures of single procedures – has been ubiquitous in safety-critical industries for decades. In many of these applications, the benefits have been regarded as ‘obvious’, and it is only in more recent years that there have been formal models and studies of efficacy. [...] More recently (in the past 25 years) there has been considerable interest in the use of diversity in software-based systems. A driver for this research was the need for very highly reliable software, coupled with the realisation that there were severe difficulties in making a single version of a program very reliable (e.g. via reliability growth from extensive testing and debugging) (Miller, Morell et al. 1992; Littlewood and Strigini 1993). The use of multi-version software, developed independently and adjudicated at run-time, seemed a possible way out of the difficulties: early work in the field was probably motivated by an analogy with hardware redundancy. [...] There are some early applications of software diversity that appear to have been successful: examples include critical flight-control computers on Airbus aircraft (Briere and Traverse 1993); various railway signalling and control systems, see e.g. (Hagelin 1988). After experiencing many years of operational use, there seem to be no reports of catastrophic failure of these systems attributable to software design faults.

Quoting " Digital Avionics: A Computing Perspective ^[4]":

Software is intangible, so it cannot exhibit degradation faults. Rather, software failures are necessarily due to design faults that cannot be masked through simple replication due to the lack of failure independence. Design diversity is a popular technique that attempts to overcome this difficulty by employing arrays of redundant components, each with a dissimilar design or implementation. Airbus and Boeing both use design diversity in their flight control systems but in different ways.

Airbus employs a design diversity technique called multiversion programming or N-version programming [3]. In multiversion programming, several system implementations are prepared from the same set of requirements by different developers under the presumption that the designs prepared by each developer will be independent—that is, the probability that one implementation will fail on a particular set of inputs given that another implementation has failed on those inputs is equal to the probably of the implementation’s failing alone. The various implementations are then assembled into a classical redundancy architecture in which they are run in parallel on the same inputs and their outputs are passed into a voter to check agreement. If a design fault is activated in one of the implementations, then, according to the theory, it is unlikely that the other implementations will also possess the fault and should continue to function. Clearly, the assurance that can be placed on multiversion programming rests on the assumption of design independence, and evidence exists that this assumption does not hold for all types of software systems [7].

The Boeing 777 FCS was not developed through multiversion programming but rather by employing diversity in the microprocessor architecture. Boeing compiled the software for the 777 FCS for multiple machine architectures and runs each version in tandem during system operation. This approach allows the 777 FCS to tolerate design faults in a specific microprocessor well as those introduced during compilation. It does not, however, provide any resilience to faults resulting from errors in the common source code from which the versions were built [13].

[1] http://www.springerlink.com/content/ejcg5kruncye34uu/
[2] http://www.cigital.com/papers/download/pnsqc97.pdf
[3] http://www.csr.city.ac.uk/projects/diversity/
[4] http://www.computer.org/portal/web/readynotes/sample-digital-avionics-a-computing-perspective

Answer 35

Remotely, if possible.

Answer 36

I understood that for some space missions in the past CLIPS expert system was used to control launching ..etc but also programming languages like LISP which keeps you in the state of the flow are considered to be safe for safety critical applications.

Answer 37

Not really a practice, but a strong OS might be a start: http://en.wikipedia.org/wiki/VxWorks

Answer 38

I once heard at university that some coders aren't allowed to compile or run their own code. They have to work out for themselves whether it works as it ought to, and only then does it get compiled and tested.

Answer 39

Great post - I enjoyed reading it. Here is a great Canadian Standard for Nuclear Power Plants. N290.14-07 "Qualification of pre-developed software for use in safety related instrumentation and control applications in nuclear power plants". In it, it describes what standards to use when deciding if software can be installed in safety systems. Example if you can get somethng the is IEC sil level 4 then yes its good to go into the safety system.