app_info for AP500, AP503, MB603 and MB608

Message boards : Number crunching : app_info for AP500, AP503, MB603 and MB608
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 · Next

AuthorMessage
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 893141 - Posted: 9 May 2009, 20:06:31 UTC - in response to Message 893131.  

Now the similarity - when those WU's finally do run, they error with the same exit code -5 = no result file.

I got my first error -5 today, while doing some manual suspend/resume work to get a Beta test run to start before its time.

The full message is:

<core_client_version>6.6.28</core_client_version>
<![CDATA[
<message>
 - exit code -5 (0xfffffffb)
</message>
<stderr_txt>
SETI@home error -5 Can't open file
(work_unit.sah) in read_wu_state() errno=2

File: ..\worker.cpp
Line: 123

</stderr_txt>
]]>

- so nothing to do with result files. [BOINC's message about a missing file is always a consequence of the crash, and says nothing whatsoever about the cause of the crash]

The behaviour of BOINC when suspending/resuming CUDA tasks does still seem to have some problems - I'll try to write it up for boinc_alpha.

Knowing the usual response - Fred, would you be willing to catch some debug logs and screen shots (of BOINC, not the footy) next time there's a good match on?

Re-looking at my results, mine matches your result exactly, Richard.
Anything I can do to help, of course. I could even watch something other than soccer ;). What debug flags / screen-shots are needed?

F.
ID: 893141 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 893167 - Posted: 9 May 2009, 21:16:39 UTC - in response to Message 893141.  

Re-looking at my results, mine matches your result exactly, Richard.
Anything I can do to help, of course. I could even watch something other than soccer ;). What debug flags / screen-shots are needed?

F.

Jorden usually says "Run at least once with these flags on: <work_fetch_debug>, <sched_op_debug> and one run with <debt_debug>", but I think he's away this weekend, so we can make our own rules ;-)

Don't worry about work_fetch or debt_debug for this one: sched_op hardly seems to do anything. I think we're going to need <cpu_sched> and <cpu_sched_debug>.

For screenshots, I was thinking of those "every 2 that start, one reports and the second goes to 'waiting to run'" - something showing those 'waiting to run' on the tasks tab, perhaps with a matching later one showing the errors.
ID: 893167 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 893208 - Posted: 9 May 2009, 23:04:13 UTC - in response to Message 893167.  

Re-looking at my results, mine matches your result exactly, Richard.
Anything I can do to help, of course. I could even watch something other than soccer ;). What debug flags / screen-shots are needed?

F.

Jorden usually says "Run at least once with these flags on: <work_fetch_debug>, <sched_op_debug> and one run with <debt_debug>", but I think he's away this weekend, so we can make our own rules ;-)

Don't worry about work_fetch or debt_debug for this one: sched_op hardly seems to do anything. I think we're going to need <cpu_sched> and <cpu_sched_debug>.

For screenshots, I was thinking of those "every 2 that start, one reports and the second goes to 'waiting to run'" - something showing those 'waiting to run' on the tasks tab, perhaps with a matching later one showing the errors.

OK. I will give that a whirl tomorrow (should be watching the Spanish Grand Prix anyway). I've now upgraded to CUDA2.2 (and had to increase the GPU fan speed considerably) and BOINC 6.6.28 so, if it still happens it will prove that it is not driver or 6.6.23 causing it.

F.
ID: 893208 · Report as offensive
Profile Questor Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 3 Sep 04
Posts: 471
Credit: 230,506,401
RAC: 157
United Kingdom
Message 893470 - Posted: 10 May 2009, 19:03:31 UTC - in response to Message 893208.  

Well I've gone back and given this another go and am getting the same results as you are :-


<![CDATA[
<message>
- exit code -5 (0xfffffffb)
</message>
<stderr_txt>
SETI@home error -5 Can't open file
(work_unit.sah) in read_wu_state() errno=2

File: ..\worker.cpp
Line: 123


</stderr_txt>

Also get <status>-161</status> in the file_info section of client_state.xml which I use to track the errors and do a fix up.

Apologies for the "no result file" red herring. I think I had this problem mixed up with another issue I had.

I've also tried running with a few avg/nax_ncpus values ranging from 0.15 down to 0.04 on a Core 2 6600 / GT9600T (driver 18250) on Windows XP (not exhaustively).

I hadn't previously tried the lower settings but this seems on my set up to produce the same level of -5 errors when suspending as the
higher settings.
0.099 which seemed to be most reliable I found now does still give some -5 errors so it is hard to say what the full influences are.

Another thing I have noticed on just this machine is that some CUDA jobs occasionally hang and the elapsed and completed times continue to clock up into hours but progress doesn't move.
I don't think they are VLAR tasks (I have been doing a fairly coarse rebranding of those since I started getting flooded by them) and stop/starting BOINC kicks them into life again.

I also notice that some versions of app_info.xml in these threads give avg/max_ncpus = 1.000000 settings for the the non cuda apps.
My files do not have this entry. Is this significant or is it just for completeness and the 1.000000 the default setting anyway?

As I seem to be able to recreate the error -5s at will, I am happy to produce some debug output if it is of use.


John.

GPU Users Group



ID: 893470 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 893486 - Posted: 10 May 2009, 19:49:03 UTC

I think I've worked out what's happening here, and I've submitted a bug report. Fred, you have mail (ISP permitting).

It seems likely to happen whenever a running CUDA task is pre-empted, and a new CUDA task started in its place. I think it will happen with every version of BOINC from v6.6.23 onwards, because of a bug with new code added then:

Changes for 6.6.23

- client: for coproc jobs, don't start a job while a quit is pending. Otherwise the new job may fail on memory allocation.

If the job start is delayed to allow the old one to tidy up after itself, the next attempt is treated as a re-start, instead of a new start: that's presumably why the work_unit.sah file isn't copied into the slot directory ready for use.
ID: 893486 · Report as offensive
Profile Questor Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 3 Sep 04
Posts: 471
Credit: 230,506,401
RAC: 157
United Kingdom
Message 893499 - Posted: 10 May 2009, 20:07:32 UTC - in response to Message 893486.  

I wasn't familiar with using the debug output so thought I'd have a go anyway out of interest. I assume that is what you were also seeing Richard?

1. When everything works OK after suspending a task, another one runs OK.

2. Sometimes an attempt to start a new task after a supend fails because the previous task has not exited (or it thinks it hasn't)
When this occurs however it has already changed the sched state of the task from 0 to 2.

When it does manage to run the task it does it as a resume not a start. As the task has never run previously you get
the error and it exits.


John




---------------------------------------------------------------------------------------------------------------------------------------

Here is a suspend that does not error :-


10-May-2009 20:17:07 [SETI@home] task 07fe09ab.14570.6207.11.8.131_1 suspended by user
10-May-2009 20:17:07 [---] [cpu_sched_debug] Request CPU reschedule: result suspended, resumed or aborted by user
10-May-2009 20:17:08 [---] [cpu_sched_debug] schedule_cpus(): start
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] scheduling 07fe09ab.14570.6207.11.8.151_1 (coprocessor job, FIFO)
10-May-2009 20:17:08 [---] [cpu_sched_debug] reserving 1 of coproc CUDA
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] highest debt: -3600.000000 07fe09aa.24575.16841.15.8.21_0
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] scheduling 07fe09aa.24575.16841.15.8.21_0 (CPU job, debt order)
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] highest debt: -7200.000000 01fe09ae.13226.25021.14.8.65_0
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] scheduling 01fe09ae.13226.25021.14.8.65_0 (CPU job, debt order)
10-May-2009 20:17:08 [---] [cpu_sched_debug] Request enforce CPU schedule: schedule_cpus
10-May-2009 20:17:08 [---] [cpu_sched_debug] enforce_schedule(): start
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] want to run: 07fe09ab.14570.6207.11.8.151_1
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] want to run: 07fe09aa.24575.16841.15.8.21_0
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] want to run: 01fe09ae.13226.25021.14.8.65_0
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] processing 07fe09aa.24575.16841.15.8.21_0
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] processing 01fe09ae.13226.25021.14.8.65_0
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] processing 07fe09ab.14570.6207.11.8.151_1
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] 07fe09aa.24575.16841.15.8.21_0 sched state 2 next 2 task state 1
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.75_0 sched state 1 next 1 task state 0
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.61_0 sched state 1 next 1 task state 0
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.72_0 sched state 1 next 1 task state 0
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.74_0 sched state 1 next 1 task state 0
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.31_1 sched state 1 next 1 task state 0
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] 01fe09ae.13226.25021.14.8.65_0 sched state 2 next 2 task state 1
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.129_1 sched state 1 next 1 task state 0
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.131_1 sched state 2 next 1 task state 1
10-May-2009 20:17:08 [SETI@home] [cpu_sched] Preempting 07fe09ab.14570.6207.11.8.131_1 (left in memory)
10-May-2009 20:17:08 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.151_1 sched state 0 next 2 task state 0
10-May-2009 20:17:08 [SETI@home] Starting 07fe09ab.14570.6207.11.8.151_1
10-May-2009 20:17:08 [SETI@home] [cpu_sched] Starting 07fe09ab.14570.6207.11.8.151_1 (initial)
10-May-2009 20:17:08 [SETI@home] Starting task 07fe09ab.14570.6207.11.8.151_1 using setiathome_enhanced version 608
10-May-2009 20:17:08 [---] [cpu_sched_debug] enforce_schedule: end
10-May-2009 20:17:26 [---] Exit requested by user
10-May-2009 20:17:27 [---] [cpu_sched_debug] Request CPU reschedule: exit_tasks


---------------------------------------------------------------------------------------------------------------------------------------



---------------------------------------------------------------------------------------------------------------------------------------

and here is one that does error from the point of suspend onwards :-


10-May-2009 20:33:18 [SETI@home] task 07fe09ab.14570.6207.11.8.135_1 suspended by user
10-May-2009 20:33:18 [---] [cpu_sched_debug] Request CPU reschedule: result suspended, resumed or aborted by user
10-May-2009 20:33:18 [---] [cpu_sched_debug] schedule_cpus(): start
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] scheduling 07fe09ab.14570.6207.11.8.234_0 (coprocessor job, FIFO)
10-May-2009 20:33:18 [---] [cpu_sched_debug] reserving 1 of coproc CUDA
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] highest debt: -3600.000000 07fe09aa.24575.16841.15.8.21_0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] scheduling 07fe09aa.24575.16841.15.8.21_0 (CPU job, debt order)
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] highest debt: -7200.000000 01fe09ae.13226.25021.14.8.65_0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] scheduling 01fe09ae.13226.25021.14.8.65_0 (CPU job, debt order)
10-May-2009 20:33:18 [---] [cpu_sched_debug] Request enforce CPU schedule: schedule_cpus
10-May-2009 20:33:18 [---] [cpu_sched_debug] enforce_schedule(): start
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] want to run: 07fe09ab.14570.6207.11.8.234_0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] want to run: 07fe09aa.24575.16841.15.8.21_0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] want to run: 01fe09ae.13226.25021.14.8.65_0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] processing 07fe09aa.24575.16841.15.8.21_0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] processing 01fe09ae.13226.25021.14.8.65_0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] processing 07fe09ab.14570.6207.11.8.234_0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09aa.24575.16841.15.8.21_0 sched state 2 next 2 task state 1
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.75_0 sched state 1 next 1 task state 0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.61_0 sched state 1 next 1 task state 0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.72_0 sched state 1 next 1 task state 0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.74_0 sched state 1 next 1 task state 0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.31_1 sched state 1 next 1 task state 0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 01fe09ae.13226.25021.14.8.65_0 sched state 2 next 2 task state 1
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.129_1 sched state 1 next 1 task state 0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.131_1 sched state 1 next 1 task state 0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.151_1 sched state 1 next 1 task state 0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.141_0 sched state 1 next 1 task state 0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.154_0 sched state 1 next 1 task state 0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.138_1 sched state 1 next 1 task state 0
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.142_0 sched state 1 next 1 task state 9
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.135_1 sched state 2 next 1 task state 1
10-May-2009 20:33:18 [SETI@home] [cpu_sched] Preempting 07fe09ab.14570.6207.11.8.135_1 (removed from memory)
10-May-2009 20:33:18 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.234_0 sched state 0 next 2 task state 0

**** state of new task 234 = 0 ****

10-May-2009 20:33:18 [---] [cpu_sched_debug] enforce_schedule: end
10-May-2009 20:33:18 [---] [cpu_sched_debug] coproc quit pending, deferring start
10-May-2009 20:33:18 [---] [cpu_sched_debug] Request enforce CPU schedule: coproc quit retry

**** unable to start new task 234 - last task not exited yet??? ****

10-May-2009 20:33:19 [---] [cpu_sched_debug] enforce_schedule(): start
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] want to run: 07fe09ab.14570.6207.11.8.234_0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] want to run: 07fe09aa.24575.16841.15.8.21_0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] want to run: 01fe09ae.13226.25021.14.8.65_0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] processing 07fe09aa.24575.16841.15.8.21_0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] processing 01fe09ae.13226.25021.14.8.65_0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] processing 07fe09ab.14570.6207.11.8.234_0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09aa.24575.16841.15.8.21_0 sched state 2 next 2 task state 1
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.75_0 sched state 1 next 1 task state 0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.61_0 sched state 1 next 1 task state 0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.72_0 sched state 1 next 1 task state 0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.74_0 sched state 1 next 1 task state 0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.31_1 sched state 1 next 1 task state 0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 01fe09ae.13226.25021.14.8.65_0 sched state 2 next 2 task state 1
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.129_1 sched state 1 next 1 task state 0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.131_1 sched state 1 next 1 task state 0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.151_1 sched state 1 next 1 task state 0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.141_0 sched state 1 next 1 task state 0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.154_0 sched state 1 next 1 task state 0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.138_1 sched state 1 next 1 task state 0
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.142_0 sched state 1 next 1 task state 9
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.135_1 sched state 1 next 1 task state 8
10-May-2009 20:33:19 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.234_0 sched state 2 next 2 task state 0

**** State of new task 234 now = 2 ****

10-May-2009 20:33:19 [---] [cpu_sched_debug] enforce_schedule: end
10-May-2009 20:33:19 [---] [cpu_sched_debug] coproc quit pending, deferring start
10-May-2009 20:33:19 [---] [cpu_sched_debug] Request enforce CPU schedule: coproc quit retry

**** Second attempt to restart task failed ****

10-May-2009 20:33:20 [---] [cpu_sched_debug] Request CPU reschedule: application exited
10-May-2009 20:33:20 [---] [cpu_sched_debug] schedule_cpus(): start
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] scheduling 07fe09ab.14570.6207.11.8.234_0 (coprocessor job, FIFO)
10-May-2009 20:33:20 [---] [cpu_sched_debug] reserving 1 of coproc CUDA
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] highest debt: -3600.000000 07fe09aa.24575.16841.15.8.21_0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] scheduling 07fe09aa.24575.16841.15.8.21_0 (CPU job, debt order)
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] highest debt: -7200.000000 01fe09ae.13226.25021.14.8.65_0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] scheduling 01fe09ae.13226.25021.14.8.65_0 (CPU job, debt order)
10-May-2009 20:33:20 [---] [cpu_sched_debug] Request enforce CPU schedule: schedule_cpus
10-May-2009 20:33:20 [---] [cpu_sched_debug] enforce_schedule(): start
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] want to run: 07fe09ab.14570.6207.11.8.234_0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] want to run: 07fe09aa.24575.16841.15.8.21_0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] want to run: 01fe09ae.13226.25021.14.8.65_0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] processing 07fe09aa.24575.16841.15.8.21_0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] processing 01fe09ae.13226.25021.14.8.65_0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] processing 07fe09ab.14570.6207.11.8.234_0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09aa.24575.16841.15.8.21_0 sched state 2 next 2 task state 1
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.75_0 sched state 1 next 1 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.61_0 sched state 1 next 1 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.72_0 sched state 1 next 1 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.74_0 sched state 1 next 1 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.4571.11.8.31_1 sched state 1 next 1 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 01fe09ae.13226.25021.14.8.65_0 sched state 2 next 2 task state 1
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.129_1 sched state 1 next 1 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.131_1 sched state 1 next 1 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.151_1 sched state 1 next 1 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.141_0 sched state 1 next 1 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.154_0 sched state 1 next 1 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.138_1 sched state 1 next 1 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.142_0 sched state 1 next 1 task state 9
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.135_1 sched state 1 next 1 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched_debug] 07fe09ab.14570.6207.11.8.234_0 sched state 2 next 2 task state 0
10-May-2009 20:33:20 [SETI@home] [cpu_sched] Starting 07fe09ab.14570.6207.11.8.234_0(resume)

**** Third attempt -now trying to resume task 234 which has never run yet ****

10-May-2009 20:33:20 [SETI@home] Restarting task 07fe09ab.14570.6207.11.8.234_0 using setiathome_enhanced version 608
10-May-2009 20:33:20 [---] [cpu_sched_debug] enforce_schedule: end
10-May-2009 20:33:21 [---] [cpu_sched_debug] Request CPU reschedule: application exited
10-May-2009 20:33:21 [SETI@home] Computation for task 07fe09ab.14570.6207.11.8.234_0 finished

**** Task 234 exits without doing anything ****


---------------------------------------------------------------------------------------------------------------------------------------

GPU Users Group



ID: 893499 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 893505 - Posted: 10 May 2009, 20:31:58 UTC - in response to Message 893499.  

I wasn't familiar with using the debug output so thought I'd have a go anyway out of interest. I assume that is what you were also seeing Richard?

1. When everything works OK after suspending a task, another one runs OK.

2. Sometimes an attempt to start a new task after a supend fails because the previous task has not exited (or it thinks it hasn't)
When this occurs however it has already changed the sched state of the task from 0 to 2.

When it does manage to run the task it does it as a resume not a start. As the task has never run previously you get
the error and it exits.

John

Yes, that's exactly what I reported - both the change from sched state 0 to sched state 2, and the restart.
ID: 893505 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 893525 - Posted: 10 May 2009, 22:28:39 UTC - in response to Message 893505.  

I wasn't familiar with using the debug output so thought I'd have a go anyway out of interest. I assume that is what you were also seeing Richard?

1. When everything works OK after suspending a task, another one runs OK.

2. Sometimes an attempt to start a new task after a supend fails because the previous task has not exited (or it thinks it hasn't)
When this occurs however it has already changed the sched state of the task from 0 to 2.

When it does manage to run the task it does it as a resume not a start. As the task has never run previously you get
the error and it exits.

John

Yes, that's exactly what I reported - both the change from sched state 0 to sched state 2, and the restart.

I'm trying to work out why the completion of one EDF task on my GTX295 should shove the WU on the other core into "Waiting to tun" so that 2 new tasks can be started simultaneously (and that "Waiting to run" WU then loses its output file).
@Richard: You have mail - I can't make head nor tail of it but it may contribute another element to the eventual fix.

F.
ID: 893525 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 893527 - Posted: 10 May 2009, 22:39:37 UTC - in response to Message 893525.  

@Richard: You have mail.

Received and forwarded to the bug-reporting list.
ID: 893527 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 893529 - Posted: 10 May 2009, 22:41:43 UTC - in response to Message 893527.  
Last modified: 10 May 2009, 22:52:48 UTC

@Richard: You have mail.

Received and forwarded to the bug-reporting list.

Thank you. I guess I really ought to sign up.

F.

[edit]Creating that eMail sent my DCF from .23 to .64 resulting in another EDF and I now have 30+ WU's "Waiting to run" than will fail (if I don't abort them).[/edit]

[edit2]Another feature of this is that when the WU that is pre-empted is close to the end of crunching (as is often the case on my GTX295), the elapsed time recorded against it is close to full crunch-time. If it is a "shorty" - no checkpoint - it restarts from the beginning so the eventual crunch-time is double what it should be which kicks the DCF further out and prolongs the EDF mode. [/edit2]
ID: 893529 · Report as offensive
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 893622 - Posted: 11 May 2009, 11:44:44 UTC - in response to Message 893486.  

I think I've worked out what's happening here, and I've submitted a bug report. Fred, you have mail (ISP permitting).

It seems likely to happen whenever a running CUDA task is pre-empted, and a new CUDA task started in its place. I think it will happen with every version of BOINC from v6.6.23 onwards, because of a bug with new code added then:

Changes for 6.6.23

- client: for coproc jobs, don't start a job while a quit is pending. Otherwise the new job may fail on memory allocation.

If the job start is delayed to allow the old one to tidy up after itself, the next attempt is treated as a re-start, instead of a new start: that's presumably why the work_unit.sah file isn't copied into the slot directory ready for use.


Presumably its less likely to preempt tasks under 6.6.28 than 23. But still its an issue that needs to be addressed.

Richard I cross posted your message over at GPUgrid (hope you don't mind). A bunch of the guys over there are having issues with tasks aborting, possibly unrelated but you never know.
BOINC blog
ID: 893622 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 893663 - Posted: 11 May 2009, 15:27:50 UTC

Recived error:
11/05/2009 19:25:05 SETI@home [error] No app version for result: windows_x86_64 608

with :
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>608</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>0.04</avg_ncpus>
<max_ncpus>0.04</max_ncpus>
<plan_class>cuda</plan_class>
<file_ref>
<file_name>MB_6.08_mod_CUDA_V11_VLARKill_refined.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>cudart.dll</file_name>
</file_ref>
<file_ref>
<file_name>cufft.dll</file_name>
</file_ref>
<file_ref>
<file_name>libfftw3f-3-1-1a_upx.dll</file_name>
</file_ref>
<coproc>
<type>CUDA</type>
<count>1</count>
</coproc>
</app_version>

What is wrong here ??
ID: 893663 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 893675 - Posted: 11 May 2009, 16:36:11 UTC - in response to Message 893663.  
Last modified: 11 May 2009, 16:38:24 UTC

Recived error:
11/05/2009 19:25:05 SETI@home [error] No app version for result: windows_x86_64 608

with :
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>608</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>0.04</avg_ncpus>
<max_ncpus>0.04</max_ncpus>
<plan_class>cuda</plan_class>
<file_ref>
<file_name>MB_6.08_mod_CUDA_V11_VLARKill_refined.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>cudart.dll</file_name>
</file_ref>
<file_ref>
<file_name>cufft.dll</file_name>
</file_ref>
<file_ref>
<file_name>libfftw3f-3-1-1a_upx.dll</file_name>
</file_ref>
<coproc>
<type>CUDA</type>
<count>1</count>
</coproc>
</app_version>

What is wrong here ??



Could it be because the platform is x86_64 and <file_name>MB_6.08_mod_CUDA_V11_VLARKill_refined.exe</file_name> is only for the x86? That's just a guess though. Other than that and not having the flop count it is exactly like mine.


PROUD MEMBER OF Team Starfire World BOINC
ID: 893675 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 893685 - Posted: 11 May 2009, 17:40:01 UTC - in response to Message 893663.  

I would try it without the <platform> tag. I don't think it is contributing anything, anyway, as you are calling a 32-bit App.

F.
ID: 893685 · Report as offensive
Profile KW2E
Avatar

Send message
Joined: 18 May 99
Posts: 346
Credit: 104,396,190
RAC: 34
United States
Message 895622 - Posted: 16 May 2009, 21:30:15 UTC

BuMp.
ID: 895622 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 895633 - Posted: 16 May 2009, 21:46:10 UTC

Rob

there is not need to Bump as it is preserved in FAQ's, Astropulse, Cuda, MultiBeam - Read Only along with a link to this thread.

Regards
Please consider a Donation to the Seti Project.

ID: 895633 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 895659 - Posted: 16 May 2009, 22:24:00 UTC

I had the same problem just remove the _64 from the Platform Tag and all should be well.
Dave
ID: 895659 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 895960 - Posted: 17 May 2009, 16:34:08 UTC - in response to Message 895659.  

Thanks all.
This stripped version works well now on my host (I crunch only MB on it for now).

<app_info>
<app>
<name>setiathome_enhanced</name>
</app>
<file_info>
<name>AK_v8b_win_x64_SSSE3x.exe</name>
<executable/>
</file_info>
<file_info>
<name>MB_6.08_mod_CUDA_V11_VLARKill_refined.exe</name>
<executable/>
</file_info>
<file_info>
<name>cudart.dll</name>
<executable/>
</file_info>
<file_info>
<name>cufft.dll</name>
<executable/>
</file_info>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>603</version_num>
<file_ref>
<file_name>AK_v8b_win_x64_SSSE3x.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>608</version_num>
<avg_ncpus>0.04</avg_ncpus>
<max_ncpus>0.04</max_ncpus>
<plan_class>cuda</plan_class>
<file_ref>
<file_name>MB_6.08_mod_CUDA_V11_VLARKill_refined.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>cudart.dll</file_name>
</file_ref>
<file_ref>
<file_name>cufft.dll</file_name>
</file_ref>
<coproc>
<type>CUDA</type>
<count>1</count>
</coproc>
</app_version>
</app_info>
ID: 895960 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 895974 - Posted: 17 May 2009, 17:11:19 UTC
Last modified: 17 May 2009, 17:12:20 UTC

I have just set up this pc to do Astropulse only using the Intel Opt APP
GenuineIntelIntel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz [x86 Family 6 Model 15 Stepping 11] (2 processors)
It is returning 2 results in around 15hrs each.
Question: Is there a reason why this is quicker than my i7 which is returning 8 results in around 20hrs each.
I realise that it may be because 1 or more of the i7 cores is feeding the Cuda apps but 5hrs is a big difference.
I believe I have the same AP Opt app running on both pc's (ap_5.03r112_SSE3.exe).
Should I run ssse4 app on the i7.
Thank's in advance.
Dave
ID: 895974 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 895990 - Posted: 17 May 2009, 17:53:43 UTC - in response to Message 895974.  

Check around Dave, you will see others have noticed the i7 as being slower though most are compared to Quads. It has to do with the hyperthreading ( basically running two WUs on one core). Though it takes you longer to do each WU, you are doing 8 to their 4. (or 2 in this case.)


PROUD MEMBER OF Team Starfire World BOINC
ID: 895990 · Report as offensive
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 · Next

Message boards : Number crunching : app_info for AP500, AP503, MB603 and MB608


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.