AP 7.01 checkpoints wrong and cause lockup

Message boards : AstroPulse : AP 7.01 checkpoints wrong and cause lockup
Message board moderation

To post messages, you must log in.

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 6 Mar 09
Posts: 8
Credit: 72,401
RAC: 0
United States
Message 51565 - Posted: 11 Jul 2014, 18:57:44 UTC

Hi,
I have my checkpoints set at 3 minutes.
When the astro pulse 7.01 task ( opencln vidia 100)writes its first checkpoint, after about 25 minutes runtime or 5 sec cpu time, the progress is reset to 0.900 % (it was at about 1.5%)and no more increase happens.No further checkpoints are written.

It uses 1 nvidia and 0.0896 CPUs
ID: 51565 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 6 Mar 09
Posts: 8
Credit: 72,401
RAC: 0
United States
Message 51566 - Posted: 11 Jul 2014, 19:10:48 UTC

Correction:
2nd Checkpoint written at 11 sec cpu time, progress jumps to 1.801% but otherwise stuck(till the next checkpoint,i guess)
ID: 51566 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 29 May 06
Posts: 1037
Credit: 8,440,339
RAC: 33
United Kingdom
Message 51569 - Posted: 11 Jul 2014, 19:54:15 UTC - in response to Message 51565.  

The Astropulse GPU builds only show progress every 0.9%, up until the first checkpoint if an app doesn't report it's progress Boinc will estimate it,
this might show as progress, then a drop to Zero, before showing real progress:

http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=9136a369d4e15cc727c06b55b50c833e184bf9fc

client: if app doesn't report fraction done, estimate it

http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=34f252870310b18c7cbe3e71573daff6b01e768c

client: if app doesn't report fraction done, estimate fraction done in a way that converges to but never reaches 100%.


Claggy
ID: 51569 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 6 Mar 09
Posts: 8
Credit: 72,401
RAC: 0
United States
Message 51570 - Posted: 11 Jul 2014, 20:09:21 UTC

However, the checkpoint intervals are not 3 minutes(as i set it) but more like 25minutes.
Also, they are not at regular intervals(cpu and real time).

It increases at every checkpoint by 0.901%.
So, how do i know, howmuch progress is really made.

More concerning is the long time between checkpoints.
ID: 51570 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 29 May 06
Posts: 1037
Credit: 8,440,339
RAC: 33
United Kingdom
Message 51572 - Posted: 11 Jul 2014, 21:45:38 UTC - in response to Message 51570.  
Last modified: 11 Jul 2014, 21:49:05 UTC

However, the checkpoint intervals are not 3 minutes(as i set it) but more like 25minutes.

That is, every 3 minutes an app may checkpoint, NOT that it must checkpoint at 3 minutes, apps are programmed to checkpoint at particular points, if the app hasn't reached the point where it may checkpoint, then it can't,
your GPU is very slow, you're just going to have to put up with it only checkpointing every 25 minutes.

and if you keep interrupting it before it gets to the first checkpoint, then of cause it won't make progress

### Restart at 0.00 percent.
state.fold_buf_size_short=65536; state.fold_buf_size_long=262144
GPU device synched
Termination request detected or computations are finished. GPU device synched, exiting...


Claggy
ID: 51572 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 6 Mar 09
Posts: 8
Credit: 72,401
RAC: 0
United States
Message 51573 - Posted: 11 Jul 2014, 23:08:26 UTC

Yes, my gpu is slow.
This is why the software needs to compensate for the vast amount of different gpu speeds, that are out there.
Other projects can do that.
It is important, not to loose to much computing time, when switching occurs.
If this project does not want slow ish gpus, than exclude them.

ALL Glory to the latest,greatest , most expensive cards that exits, f*** the rest.
ID: 51573 · Report as offensive
Josef W. Segur
Volunteer tester

Send message
Joined: 14 Oct 05
Posts: 1137
Credit: 1,848,733
RAC: 0
United States
Message 51576 - Posted: 12 Jul 2014, 4:57:10 UTC

The project doesn't have the resources to make OpenCL GPU apps, so it is using Raistmer's builds. Those definitely don't waste time on non-essential things, and indeed have been somewhat aimed at high performance crunching.

It of course would be possible to add more frequent progress updates contingent on a command line parameter, or perhaps automatic for slower GPUs. This is Beta testing, so requests for enhancement are certainly allowed. But please understand that Raistmer's efforts are volunteered and his view of what's most important may differ from yours.
                                                                   Joe
ID: 51576 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 51618 - Posted: 18 Jul 2014, 8:13:52 UTC - in response to Message 51573.  


ALL Glory to the latest,greatest , most expensive cards that exits, f*** the rest.


I would say it's quite misleading and irritating enough conclusion. Until recently efforts were made to support even pre-OpenCL ATi GPUs (via Brook+), not only almost whole range of OpenCL ones.
Pay more attention to forum discussions or pay to some hired programmer instead to increase that support.
Issue reporting and suggestions are welcomed, whining - not.
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 51618 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 51619 - Posted: 18 Jul 2014, 8:25:17 UTC - in response to Message 51573.  

Yes, my gpu is slow.
This is why the software needs to compensate for the vast amount of different gpu speeds, that are out there.
Other projects can do that.
It is important, not to loose to much computing time, when switching occurs.


Before doing some advance in computations app need to do some preparations for that.
At resume those preparations need to be repeated. This takes time.
If slow GPU can't work uniterrupted long enough it will repeat those preparations only making no progress. In such case it's useless indeed for this project, but not because of GPU is slow, but because user of that GPU can't provide uniterrupted work for this GPU. Make your choice.

Also, to save state (make checkpoint) some data should be returned from GPU to host memory. This is slowest memory channel through all. Hence checkpoint chosen to minimize or fully exclude such transfers. Making them will allow slowest GPU to make some progress (also even more slowly that currently provided they work uniterrupted), but also will slowdown ALL of other types of GPU. Hence it's not viable for generalized app. Separate app for slowest GPUs that can't work uniterrupted definitely technically possible. But Resources for such development should be provided. Average consumer approach will not work for volunteer. One can blame corporation for ugly drivers they release - they sell devices that just trash w/o drivers. But here we don't make profit. Wanna help - you are welcomed. But it's not consumer support forum.
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 51619 · Report as offensive

Message boards : AstroPulse : AP 7.01 checkpoints wrong and cause lockup


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.