I have just discovered this report on the reasons for the failure of the Ariane 5 rocket. According to the report, the failure occurred when a 64bit floating point number was assigned to a 16bit integer.
One of the, many, reasons why Ada is a reliable language, is that it uses strong typing, and has proper handling of exceptions. I don't understand how it was possible to write code that attempted this conversion and have it compile correctly.
There's also the question of why an exception handler didn't exist for this condition, which is also peculiar, but, perhaps, more of a failure of the programmer than the language - though an Ada project that left live code with potential exceptions but no exception handler is difficult to imagine.
Any ideas?
http://www-users.math.umn.edu/~arnold/disasters/ariane5rep.html
The answer is easy: it is always possible to explicitly convert, and that seems to have been done with the Ariane 5 code:
-- Overflow is correctly handled for the vertical component
L_M_BV_32 := TBD.T_ENTIER_16S((1.0 / C_M_LSB_BH) *
G_M_INFO_DERIVE(T_ALG.E_BH));
if L_M_BV_32 > 32767 then
P_M_DERIVE(T_ALG.E_BV) := 16#7FFF#;
elseif L_M_BV_32 < -32768 then
P_M_DERIVE(T_ALG.E_BV) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BV) := UC_16S_EN_16NS(TBD.T_ENTIER_16S(L_M_BV_32));
end if;
-- But not for the horizontal one
P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS(TBD.T_ENTIER_16S
((1.0 / C_M_LSB_BH) *
G_M_INFO_DERIVE(T_ALG.E_BH));
Here the T_ENTIER_16S function no doubt converts the floating point to a 16 bit signed value. Note that "ENTIER" is French so the function is likely available from an internal library.
How to handle explicit floating point to integer conversion in Ada is of course available from your favorite Q/A site.
For Ada the conversion is very explicit; the programmers should have thought about the scenario. But maybe they did and thought it should never occur. The calculation probably needs to be performant as well; time is of essence when flying a rocket after all.
That the result of the calculation doesn't require a very high amount of precision is clear from the vertical calculations: it just maxes out the values of the 16 bit signed integer instead of upgrading the integer to 32 bit. Returning negative values of values close to zero is what lead to the failure - it would warp the calculations too much.
Be warned that maxing out a bounded integer value is a stop-gap measure that may not work in every situation. The "final solution" used for this project may not work in your project.
Note that somehow they assumed that returning an almost random value made more sense then throwing an exception in the T_ENTIER_16S function. this is possibly because of performance reasons; the function could simply copy the corresponding bits, saving two if statements to check for overflow / underflow.
Exceptions are nice, but they are of little value for these kind of calculations during runtime. The problem is that the rocket depends on the function to execute correctly. If it doesn't it will fail and crash. The only time an exception is handy is when there is recourse. Otherwise exceptions may just help with analyzing the error after the fact.
In this case an assertion guard statement could have been used within the T_ENTIER_16S. This error would then have been caught during testing, if the input of the test set would be adequate. Unfortunately, at that time, assertions were not available in Ada.
And the inflight code - when the assertions are disabled - could have returned MIN / MAX values instead of a direct bit representation. Anything is better than returning completely unexpected results. That is: if the additional testing for the values isn't an issue with regards to the running time of the function.
In short: it hasn't be pure programming problem, it's more complex problem
The defect on Ariane 5 was not caused by a single cause. Throughout the development and testing processes, there were many stages at which this defect could be identified.
• The software module was reused in a new environment where operating conditions differed from the requirements of the software module. These requirements have not been revised.
• The system has identified and recognized the error. Unfortunately, the specification of the error handling mechanism was inconsistent and caused final destruction.
• An erroneous module has never been properly tested in a new environment — neither at the hardware level, nor at the system integration level. Therefore, the fallacy of development and implementation was not detected.
See diff with correct code:
--- LIRE_DERIVE.ads 000 Tue Jun 04 12:00:00 1996
+++ LIRE_DERIVE.ads Fri Jan 29 13:50:00 2010
@@ -3,10 +3,17 @@
if L_M_BV_32 > 32767 then
P_M_DERIVE(T_ALG.E_BV) := 16#7FFF#;
elsif L_M_BV_32 < -32768 then
P_M_DERIVE(T_ALG.E_BV) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BV) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BV_32));
end if;
-P_M_DERIVE(T_ALG.E_BH) :=
- UC_16S_EN_16NS (TDB.T_ENTIER_16S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH)));
+L_M_BH_32 := TBD.T_ENTIER_32S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH));
+
+if L_M_BH_32 > 32767 then
+ P_M_DERIVE(T_ALG.E_BH) := 16#7FFF#;
+elsif L_M_BH_32 < -32768 then
+ P_M_DERIVE(T_ALG.E_BH) := 16#8000#;
+else
+ P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BH_32));
+end if;
More detailed info:
A) The “Operand Error” error occurred due to an unexpectedly large value of BH (Horizontal Bias) calculated by an internal function based on the value of the “horizontal speed” measured by the sensors on the Platform.
The value of BH served as an indicator of the accuracy of the Platform positioning. the BH value was much larger than expected because the flight path of the Ariane 5 at an early stage was significantly different from the flight path of the Ariane 4 (where this software module was used earlier), which led to a significantly higher "horizontal speed".
The final action, which had fatal consequences, was the termination of the processor. Accordingly, the entire Navigation System has ceased to function. It was technically impossible to resume her actions.
B) However, Ariane 5, unlike the previous model, had a fundamentally different discipline for performing pre-flight actions - so different that the work of the fatal program module after the launch time did not make sense at all. However, the module was reused without any modifications.
The investigation revealed that there were as many as seven variables involved in type conversion operations in this software module. It turned out that the developers conducted an analysis of all operations that could potentially generate an exception for vulnerability.
It was their very conscious decision to add proper protection to four variables, and leave three - including BH - unprotected. The reason for this decision was the conviction that for these three variables the occurrence of an overflow situation is impossible in principle.
This confidence was supported by calculations showing that the expected range of physical flight parameters, on the basis of which the values of these variables are determined, is such that it cannot lead to an undesirable situation. And that was true - but for the trajectory calculated for the Ariane 4 model.
And the rocket of the new generation Ariane 5 was launched on a completely different trajectory, for which no evaluations were performed. Meanwhile, it (together with a high initial acceleration) was such that the “horizontal speed” exceeded the calculated (for Ariane 4) more than five times.
Protection for all seven (including BH) variables was not provided, because the maximum workload of 80% was declared for the IRS computer. Developers had to look for ways to reduce unnecessary computational costs and they weakened protection where a theoretically undesirable situation could not arise. When it arose, such an exception handling mechanism came into effect, which turned out to be completely inadequate.
This mechanism included the following three basic actions.
• Information about the occurrence of an emergency situation must be transmitted via the bus to the onboard computer OBC.
• In parallel, she - along with the entire context - was recorded in the reprogrammable EEPROM memory (which, during the investigation, it was possible to restore and read its contents).
• The IRS processor was supposed to crash.
The last action turned out to be fatal - it was he who happened in a situation that actually was normal (despite the software exception generated due to an unprotected overflow), which led to the catastrophe.
C) Velocity was represented as a 64-bit float. A conversion into a 16-bit signed integer caused an overflow The current velocity of Ariane 5 was too high to be represented as a 16-bit integer Error handling was suppressed for performance reasons
( Protection for all seven (including BH) variables was not provided, because the maximum workload of 80% was declared for the IRS computer. Developers had to look for ways to reduce unnecessary computational costs and they weakened protection where a theoretically undesirable situation could not arise. )
According to a presentation 1,2 by Jean-Jacques Levy (who was part of the team who searched for the source of the problem), the actual source code in Ada that caused the problem was as follows.:
-- Vertical velocity bias as measured by sensor
L_M_BV_32 := TBD.T_ENTIER_32S ((1.0/C_M_LSB_BV) * G_M_INFO_DERIVE(T_ALG.E_BV));
-- Check, if measured vertical velocity bias ban be
-- converted to a 16 bit int. If so, then convert
if L_M_BV_32 > 32767 then
P_M_DERIVE(T_ALG.E_BV) := 16#7FFF#;
elsif L_M_BV_32 < -32768 then
P_M_DERIVE(T_ALG.E_BV) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BV) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BV_32));
end if;
-- Horizontal velocity bias as measured by sensor -- is converted to a 16 bit int without checking P_M_DERIVE
P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS (TDB.T_ENTIER_16S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH)));
The last line (shown here as two lines of text) caused the overflow, where the conversion from 64 bits to 16 bits unsigned is not protected. The code before is protected by testing before the assignment if the number is too big.
The correct code would have been:
L_M_BV_32 := TBD.T_ENTIER_32S ((1.0/C_M_LSB_BV) * G_M_INFO_DERIVE(T_ALG.E_BV));
if L_M_BV_32 > 32767 then
P_M_DERIVE(T_ALG.E_BV) := 16#7FFF#;
elsif L_M_BV_32 < -32768 then
P_M_DERIVE(T_ALG.E_BV) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BV) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BV_32));
end if;
L_M_BH_32 := TBD.T_ENTIER_32S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH));
if L_M_BH_32 > 32767 then
P_M_DERIVE(T_ALG.E_BH) := 16#7FFF#;
elsif L_M_BH_32 < -32768 then
P_M_DERIVE(T_ALG.E_BH) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BH_32));
end if;
in other words, the same overflow check should have been present for the horizontal part of the calculation (E_BH) as was already present for the vertical part of the calculation (E_BV).
'Source: http://moscova.inria.fr/~levy/talks/10enslongo/enslongo.pdf
( On 2019-01-24 can be found at: http://para.inria.fr/~levy/talks/10enslongo/enslongo.pdf )
https://habr.com/ru/company/pvs-studio/blog/306748/
https://ja.wikipedia.org/wiki/%E3%82%AF%E3%83%A9%E3%82%B9%E3%82%BF%E3%83%BC%E3%83%9F%E3%83%83%E3%82%B7%E3%83%A7%E3%83%B3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With