I am puzzled...


log in

Advanced search

Message boards : Number crunching : I am puzzled...

Author Message
Seahawk
Volunteer tester
Avatar
Send message
Joined: 8 Jan 08
Posts: 914
Credit: 2,962,243
RAC: 42,015
United States
Message 1202091 - Posted: 3 Mar 2012, 14:51:09 UTC

I am puzzled by the behavoir of my Q6600 system overnight. Lastnight when I last looked all the ATI WU's had a remaining time of 7 hours or so. This morning they are all 90+ hours. The CPU WU's are showing 1-3 hours remaining which looks to be the same as lastnight. Everything in the log looks ok to me and work is being done in expected amount of time. Just not sure why the DCF for the GPU went whacko.
____________
I used to be a cruncher like you, then I took an arrow to the knee.

Profile Mike
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 22356
Credit: 29,266,927
RAC: 24,038
Germany
Message 1202093 - Posted: 3 Mar 2012, 14:56:43 UTC

You had one unit running over 16000 seconds.

____________

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,909,639
RAC: 13,531
United Kingdom
Message 1202094 - Posted: 3 Mar 2012, 14:59:52 UTC - in response to Message 1202093.

You had one unit running over 16000 seconds.

And several VLAR running over 25,000 seconds - like task 2334375975.

Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 3960
Credit: 31,820,065
RAC: 10,250
United Kingdom
Message 1202098 - Posted: 3 Mar 2012, 15:03:34 UTC - in response to Message 1202094.

You had one unit running over 16000 seconds.

And several VLAR running over 25,000 seconds - like task 2334375975.

Sounds like the ATI low GPU use Bug,

Claggy

Seahawk
Volunteer tester
Avatar
Send message
Joined: 8 Jan 08
Posts: 914
Credit: 2,962,243
RAC: 42,015
United States
Message 1202099 - Posted: 3 Mar 2012, 15:07:07 UTC

Ok, so they should come back down over time? The system has only been up and crunching for 38 hours since it got overhualed. I know the 2 ATI 6670 GPUs aren't the fastest crunchers but 90 hours is alittle out of the ballpark.
____________
I used to be a cruncher like you, then I took an arrow to the knee.

Seahawk
Volunteer tester
Avatar
Send message
Joined: 8 Jan 08
Posts: 914
Credit: 2,962,243
RAC: 42,015
United States
Message 1202101 - Posted: 3 Mar 2012, 15:09:07 UTC

How do I check for this bug and is it correctable?
____________
I used to be a cruncher like you, then I took an arrow to the knee.

Profile Michel448a
Volunteer tester
Avatar
Send message
Joined: 27 Oct 00
Posts: 1188
Credit: 2,891,544
RAC: 174
Canada
Message 1202103 - Posted: 3 Mar 2012, 15:12:51 UTC
Last modified: 3 Mar 2012, 15:13:33 UTC

each time a gpu one is ticking you can lower 30sec to 1 mins each, but when you get a cpu one ticking, specially big long ones, you can jump 8 to 12 mins (on the estimations for each WU in the queue)

boinc doesnt do estimations for cpu and gpu seperatly. it s 1 estimation for both.


am I right, guys ?
____________

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,909,639
RAC: 13,531
United Kingdom
Message 1202107 - Posted: 3 Mar 2012, 15:21:32 UTC

The Application details for that host are showing a pretty insane APR for the ATI app.

I'll leave it to the ATI specialists to decide how plausible it is.

Profile Mike
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 22356
Credit: 29,266,927
RAC: 24,038
Germany
Message 1202117 - Posted: 3 Mar 2012, 15:44:13 UTC - in response to Message 1202101.

How do I check for this bug and is it correctable?


Not much you can do.
First thing is to restart the computer.
Reduce instances to 3 better would be 2 on your card.

Each time you notice a slow down suspend GPU for 30 seconds and resume again.

____________

janneseti
Send message
Joined: 14 Oct 09
Posts: 79
Credit: 469,720
RAC: 140
Sweden
Message 1202210 - Posted: 3 Mar 2012, 22:31:12 UTC

I have the same problem with ATI GPU.
The APR value showing in the Application details page is insane.
But if you want the remaining time in BM to show more accurately for your GPU there is a way.
In your BOINC folder C:\ProgramData\BOINC\projects\setiathome.berkeley.edu there is a file app_info.xml.
Add a new entry <flops>16000000000</flops> to the <app_version> section.
The value, in my case 16 GFlops, will probably have to be adjusted so it will match your GPU.

Here is a snippet from app_info.xml.

<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>610</version_num>
<avg_ncpus>0.05</avg_ncpus>
<max_ncpus>0.05</max_ncpus>
<flops>16000000000</flops>
<plan_class>ati13ati</plan_class>
<cmdline>-period_iterations_num 20 -instances_per_device 1</cmdline>
<coproc>
<type>ATI</type>
<count>1</count>
</coproc>
<file_ref>
<file_name>MB6_win_x86_SSE3_OpenCL_ATi_HD5_r390.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>MultiBeam_Kernels_r390.cl</file_name>
<copy_file/>
</file_ref>
</app_version>

Profile arkayn
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3542
Credit: 46,067,116
RAC: 30,320
United States
Message 1202230 - Posted: 4 Mar 2012, 0:18:03 UTC - in response to Message 1202210.

Actually you want it in a different format than that.

it should look something like this.
<flops>200.52843553287e09</flops>

That is for my GPU's.
____________

janneseti
Send message
Joined: 14 Oct 09
Posts: 79
Credit: 469,720
RAC: 140
Sweden
Message 1202235 - Posted: 4 Mar 2012, 0:46:04 UTC - in response to Message 1202230.

Actually you can use both formats to give this value.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,909,639
RAC: 13,531
United Kingdom
Message 1202237 - Posted: 4 Mar 2012, 0:50:49 UTC - in response to Message 1202230.
Last modified: 4 Mar 2012, 1:28:57 UTC

Actually you want it in a different format than that.

it should look something like this.
<flops>200.52843553287e09</flops>

That is for my GPU's.

According to Joe Segur, the parser that handles that value can cope with most reasonable numeric and scientific formats.

You do, however, need to keep your wits about you when dealing with numbers as large as that. Not many of us (except the bankers) regularly deal with a couple of hundred billion - of anything, even flops.

If you're used to exponential notation, 200e9 is indeed easier to get right than 200000000000. But then, why not 2e11? For this purpose, the minute fractional bits after the decimal point really don't matter.

There is a nice example of this in today's New Scientist magazine. The website is subscription-only, so I'll have to quote instead of link:

BANKS are under orders to tighten their accounting, but we fear HSBC may be paying too much attention to the small things rather than the gigapounds. Andrew Beggs wanted to know the distance from his home to the bank's nearest branch. He consulted the bank's website, which came up with the answer 0.9904670356841079 miles (1.5940021810076005 kilometres). This is supposedly accurate to around 10-13 metres, much less than the radius of a hydrogen atom. Moisture condensing on the branch's door would bring it several significant digits closer.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,909,639
RAC: 13,531
United Kingdom
Message 1202321 - Posted: 4 Mar 2012, 10:28:57 UTC - in response to Message 1202237.
Last modified: 4 Mar 2012, 10:31:46 UTC

Testing superscripts for DA.

BANKS are under orders to tighten their accounting, but we fear HSBC may be paying too much attention to the small things rather than the gigapounds. Andrew Beggs wanted to know the distance from his home to the bank's nearest branch. He consulted the bank's website, which came up with the answer 0.9904670356841079 miles (1.5940021810076005 kilometres). This is supposedly accurate to around 10-13 metres, much less than the radius of a hydrogen atom. Moisture condensing on the branch's door would bring it several significant digits closer.

Superscripts seem OK, but don't click the Use BBCode tags to format your text link while editing - you'll lose your changes. I'll tell him.

Profile tullio
Send message
Joined: 9 Apr 04
Posts: 3402
Credit: 344,918
RAC: 89
Italy
Message 1202325 - Posted: 4 Mar 2012, 11:00:32 UTC

I can read many articles in New Scientist without even registering. I am a registered non paying guest in Nature magazine and can read all its editorials and even some articles, same in Nature Communications.
Tullio
____________

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,909,639
RAC: 13,531
United Kingdom
Message 1202327 - Posted: 4 Mar 2012, 11:16:40 UTC - in response to Message 1202325.

I can read many articles in New Scientist without even registering. I am a registered non paying guest in Nature magazine and can read all its editorials and even some articles, same in Nature Communications.
Tullio

I tried the section link on the front page for that story (Feedback: Highway exit with no return) before posting, but it told me I had to log in first. Not everybody here will want to create even a guest registration just to read one silly story.

Profile tullio
Send message
Joined: 9 Apr 04
Posts: 3402
Credit: 344,918
RAC: 89
Italy
Message 1202328 - Posted: 4 Mar 2012, 11:28:25 UTC - in response to Message 1202327.

I have just read and printed a short article "Oceans acidifying at unprecedented speed". I grab what I can without registering from New Scientist, but Nature is a more reliable source.
Tullio
____________

Josef W. Segur
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4134
Credit: 1,003,215
RAC: 231
United States
Message 1202411 - Posted: 4 Mar 2012, 18:13:59 UTC - in response to Message 1202107.

The Application details for that host are showing a pretty insane APR for the ATI app.

I'll leave it to the ATI specialists to decide how plausible it is.

So it was indicating 1.4 TeraFlops and has since grown to 2.2 TeraFlops. I'm no ATI specialist but need not be to say that's ridiculously high.

The cause is result_overflow tasks which haven't been marked as runtime_outlier by a sah_validate process, because the BOINC core client API all too often truncates the stderr. Task 2334954485 is a current example though due to be purged soon. It has Run time 39.31, CPU time 30.91, Credit 0.33, but only the first 8 lines of stderr.txt captured and the result_overflow line would be much later. The Validator has always looked for that result_overflow keyword so the assimilated result is marked, and now it is also used to tell BOINC not to include the runtime in its averages.

In January, Matt Arsenault (Milkyway) noted on the boinc_dev list that about 2% of reports had truncated stderr sections, and provided a patch to fix the problem. Dr. Anderson apparently wasn't convinced it needs fixing, and even if he had implemented the patch it would only be effective for those alpha testing BOINC 7.0.x clients.

As a related note, the runtime_outlier detection for Astropulse v6 will be based on information in the uploaded result file rather than the stderr which is supposed to go back in requests to the Scheduler. Perhaps it would be a good idea to make a similar change for SETI@home v7.
Joe


Seahawk
Volunteer tester
Avatar
Send message
Joined: 8 Jan 08
Posts: 914
Credit: 2,962,243
RAC: 42,015
United States
Message 1202422 - Posted: 4 Mar 2012, 18:56:22 UTC

I dropped to 2 task per GPU and the times have dropped from 90 hours to around 30. This is still 5-6 times higher than I have seen most task finish in. I'll just let it keep crunching and watch it.
____________
I used to be a cruncher like you, then I took an arrow to the knee.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,909,639
RAC: 13,531
United Kingdom
Message 1202477 - Posted: 4 Mar 2012, 21:34:16 UTC - in response to Message 1202321.

Superscripts seem OK, but don't click the Use BBCode tags to format your text link while editing - you'll lose your changes. I'll tell him.

It's safe to use again now, though I'm not sure I remember all the page headers being in the formatting pop-up before.

Message boards : Number crunching : I am puzzled...

Copyright © 2014 University of California