share
Unix & Linuxwhat is better way to sum floating values - unix sum the value up to 8 to 9 scale gives incorrect sum
[0] [4] kawenal6962go-mail.com
[2020-04-15 08:22:00]
[ linux shell-script ]
[ https://unix.stackexchange.com/questions/580167/what-is-better-way-to-sum-floating-values-unix-sum-the-value-up-to-8-to-9-scal ]

i have scenario

where i am having issue because UNIX sum up to 8 to 9 scales gives me incorrect sum how to fix it ?

my command used

awk -F '"?\\|"?' '{T+=$(2)} END {printf "%.2f\n",T}' demofile.txt

This the link to previous question posted Why is there a difference between these two sum commands? [1]

Any better way to deal with it so that i can get accurate sum

by using awk or bc or dc

Demo data

1|"12.8"|demo1
2|"13.5678341234567"|demo1
3|"14.578"|demo1
4|"15.58"|demo1
5|"16.56784"|demo1
6|"17.578"|demo1
7|"18.678"|demo1
8|"19.568890123"|demo1
9|"20.588792"|demo1
Without seeing the input data, it's difficult to even reproduce your issue. - Kusalananda
@Kusalananda i have added testing file - unix.stackexchange.com/questions/579958/… - kawenal6962go-mail.com
@Kusalananda i will add in above question also wait 2 minutes changing - kawenal6962go-mail.com
GNU awk as standard does full IEEE precision arithmetic, accurate to 15 or 16 digits for integers and real numbers. There is an option to extend that precision with -M but you probably don't need it. You realise %.2f rounds to 2 decimal places? Not possible to diagnose this without seeing some input, and a short version of data and results that clearly demonstrates the issue. - Paul_Pedant
@Paul_Pedant - i have posted my full issue in this post please look may this help you to further analysis and provide me the output what need to be done for it unix.stackexchange.com/questions/579958/… - kawenal6962go-mail.com
@Kusalananda please look the question posted previous which is interrelated to this unix.stackexchange.com/questions/579958/… - kawenal6962go-mail.com
which sum do you get and which sum do you expect instead? - frostschutz
(1) ALERT: I GET A FRAUD ALERT FROM MY BROWSER FOR THE SENDSPACE FILE DOWNLOAD. PROCEED AT YOUR OWN RISK. I DON'T NEED THIS, AND I DON'T WANT AN 18MB TEXT FILE TO DEMONSTRATE THE ISSUE. - Paul_Pedant
@Paul_Pedant - look at this post - i have added question with example and command and tested on same file which i sent you unix.stackexchange.com/questions/579958/… - kawenal6962go-mail.com
(2) @kawenal6962go-mail.com Please add a representative sample of the data to the question itself. Data kept on a remote server may be gone tomorrow, which renders your question useless. - Kusalananda
@Kusalananda do i upload again file - kawenal6962go-mail.com
@Kusalananda pls download and test and give me any solution - download file sendspace.com/file/ufwmzu - kawenal6962go-mail.com
@kawenal6962go-mail.com If you can provide a representative sample of the data that still triggers the problematic behavior, please add it to he text of the question. - Kusalananda
@kawenal6962go-mail.com You can not demand solutions, not even by adding "please". Users here solve issues if they are easy to reproduce and if the issue interests them. Likning off to an external site makes it harder for people to care. - Kusalananda
@Kusalananda the file as 18MB data but i will share 10 records how it look in question i am editing question pls look - kawenal6962go-mail.com
(1) the problem does not occur on a smaller file. Use yes 0.1 | head -n 10000000 | awk '{sum+=$1}END{printf "%.5f",sum}' for demonstration instead of a file. - pLumo
@Kusalananda i have edited the question with sample data ... the problem occure on large data set file ... i have shared a sample of 10 records - kawenal6962go-mail.com
(1) I'm with @Paul_Pedant here. My commercially supported antivirus is screaming at me that there's a trojan on the referenced site. - Chris Davies
[+2] [2020-04-15 09:05:58] Paul_Pedant

You don't say the file size (i.e. how many rows you are adding). The download claimed 18.3MB before the site showed up as "Dangerous" and "Fraud Alert". If the average row length is 18, that's a million floats being added, and we don't know the span of values. The total you show in the question is 13.2 digits, so the average value per line is around 7 digits, with unknown variability.

If you keep adding values like 27.865326635297 to a running total that is getting close to 13 whole-number digits, then only the 27.87 (rounded) part is going to make it into the total because the .00532... is outside the 15 or 16 digits result range. Sometimes those errors cancel out, sometimes they don't: Monte Carlo arithmetic.

Check output from awk --version. If it mentions MPFR and MP, your awk is compiled with extended-precision math. You just add -M 113 to your awk command. That's the mantissa length that gets you quadruple precision real arithmetic -- 33 digit accuracy.

www.gnu.org/software/gawk/manual/gawk.html#Arbitrary-Precision-Arithmetic


I downloaded the file yesterday, there are 722277 lines in it. I cannot guarantee it's the same file though. - Kamil Maciorowski
@KamilMaciorowski yes it's a same file bro any another solution - kawenal6962go-mail.com
@KamilMaciorowski gwak -M ... is not working in my PC - kawenal6962go-mail.com
@kawenal6962go-mail.com What stops you from installing gawk? - Kamil Maciorowski
You previously mentioned running dc and bc, so you are not on Windows. What distrib are you on? I can make a dc version to any accuracy you like, but I can't vouch for performance. You would need an awk to convert the file into a stream of dc commands, but not do any numerical work. Easy: if you need it, post again. You might check your own dc man page to see if it conforms -- I heard some versions of dc are fake and still only do IEEE 16-digit accuracy. - Paul_Pedant
There is some likelihood of getting better accuracy in standard awk. The issue is the discrepancy in numerical magnitude between the running total and the individual values. So totalling the rows in groups of 100, and then those in 10,000 (100 x 100), and then those in 1,000,000 (10,000 * 100), should preserve the cumulative accuracy much better. - Paul_Pedant
I mis-quoted the magic for -M. The full preamble is like: awk -M -v PREC=200 ... where -M invokes extended precision and the rest sets the PREC variable to the number of mantissa bit required. - Paul_Pedant
1
[0] [2020-04-15 11:51:31] Paul_Pedant

This is a method based on the dc command (assuming this has adequate accuracy compiled in). It dresses up the second column with dc commands, and works to 60-digit (200-bit) precision.

This runs on the 10 data lines provided previously, plus a couple of extreme values. It shows intermediate sums: to remove these, remove the 'p' just before the \n where awk emits $2.

Paul--) cat awkToDc
#! /bin/bash

function Data { cat <<'EOF'
1|"12.8"|demo1
2|"13.5678341234567"|demo1
3|"14.578"|demo1
4|"15.58"|demo1
5|"16.56784"|demo1
6|"17.578"|demo1
7|"18.678"|demo1
8|"19.568890123"|demo1
9|"20.588792"|demo1
10|"55555555555555555555000000000000"|demo1
11|"20.588792"|demo1
12|"0.000000000000000000077777777777"|demo1
EOF
}

function dataDC {

    AWK='
BEGIN { FS = "\042"; printf ("60 k 0\n"); }
{ printf ("%s + p\n", $2); }
END { printf ("p q\n"); }
'
    awk "${AWK}"
}

Clarification on the emitted dc commands (which are in reverse polish notation):

'60 k' sets the arithmetic precision, and '0' initialises the total.
' +' add the value from $2 to the total. 'p' prints the running total for illustration.
'p q' prints the final total, and quits.

    Data | dataDC | dc

Paul--) ./awkToDc
12.8
26.3678341234567
40.9458341234567
56.5258341234567
73.0936741234567
90.6716741234567
109.3496741234567
128.9185642464567
149.5073562464567
55555555555555555555000000000149.5073562464567
55555555555555555555000000000170.0961482464567
55555555555555555555000000000170.096148246456700000077777777777
55555555555555555555000000000170.096148246456700000077777777777
Paul--) 

Now have four tested techniques (against your test file of 722277 rows), with accuracy ratings.

Using gawk with precision 200-bits, and dc with precision 60-digits, both agree on the same 33-digit total, which I suspect is exact.

25396577843.7560139069641121618832

Using gawk in standard IEEE accuracy (should be 15 or 16 digits) only agrees with the first 12 of those digits. I assume a million additions erode the accuracy as the exponents become more disjoint.

25396577843.7769622802734375

I found a recursive addition algorithm in standard awk too. This initially adds values according to the last 5 digits of NR, to make 100,000 subtotals. Then it totals those, reducing the number of digits to 4, 3, 2, 1, and finally a single total. Each number therefore gets only 60 additions. That result agrees with the first 16 digits of the high-precision ones, which is as good as could be expected.

25396577843.756011962890625


Performance not an issue: my Laptop does a million rows in 13 seconds. - Paul_Pedant
2
[0] [2020-04-16 03:41:42] vonbrand

Check out Kahan summation [1], it tries to keep track of the rounding error and compensates. A must for such huge sums.

[1] https://en.wikipedia.org/wiki/Kahan_summation_algorithm

3
[0] [2020-04-28 18:15:12] Stéphane Chazelas

With cvstool and bc:

$ csvtool -t '|' col 2 A | paste -sd + - | bc
149.5073562464567

4