This is not fair


log in

Advanced search

Message boards : Number crunching : This is not fair

1 · 2 · 3 · Next
Author Message
Profile Tim
Volunteer tester
Avatar
Send message
Joined: 19 May 99
Posts: 196
Credit: 236,179,048
RAC: 156,540
Greece
Message 1382939 - Posted: 20 Jun 2013, 6:09:22 UTC

Today I had 3 invalid AP tasks at my top rig. (ID: 6716400)

All 3 tasks were ‘’Completed, can't validate’’ by my rig.

As I saw all the wings were with ATI gpus.

Why didn’t the server send anything to a different Nvidia gpu and trash the wu with ‘’Too many errors (may have bug)’’?

Tim

____________

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6684
Credit: 92,097,094
RAC: 73,813
Australia
Message 1382944 - Posted: 20 Jun 2013, 6:35:49 UTC - in response to Message 1382939.

Doesn't it just P&%# ya off when that happens?

Sadly I've been there far too many times, but what can 1 person do?

Cheers.

Profile Tim
Volunteer tester
Avatar
Send message
Joined: 19 May 99
Posts: 196
Credit: 236,179,048
RAC: 156,540
Greece
Message 1382946 - Posted: 20 Jun 2013, 6:45:16 UTC - in response to Message 1382944.

Doesn't it just P&%# ya off when that happens?

Sadly I've been there far too many times, but what can 1 person do?

Cheers.


We are 2 now :-)
____________

Lionel
Send message
Joined: 25 Mar 00
Posts: 544
Credit: 221,442,944
RAC: 204,817
Australia
Message 1382948 - Posted: 20 Jun 2013, 6:54:02 UTC - in response to Message 1382946.


It's poorly thought out code/operation. They should have thought more than they did.
____________

Sakletare
Avatar
Send message
Joined: 18 May 99
Posts: 131
Credit: 20,831,551
RAC: 5,701
Sweden
Message 1382950 - Posted: 20 Jun 2013, 6:57:13 UTC

Sometimes I wish that the scheduler would send the workunit to different types of applications to safeguard against bugs, especially when there's an error.

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6684
Credit: 92,097,094
RAC: 73,813
Australia
Message 1382953 - Posted: 20 Jun 2013, 7:03:33 UTC - in response to Message 1382946.


We are 2 now :-)

I bet that there are a lot more than just us around here. ;-)

Cheers.

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,032,696
RAC: 118,881
United States
Message 1382961 - Posted: 20 Jun 2013, 7:28:29 UTC

What do yo think about this;

ap_01mr09ad_B1_P1_00224_20130619_24332.wu

3044311467 7008627 20 Jun 2013, 1:05:10 UTC 20 Jun 2013, 1:10:18 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044311468 6797524 20 Jun 2013, 1:05:12 UTC 15 Jul 2013, 1:05:12 UTC In progress --- --- --- AstroPulse v6 Anonymous platform (ATI GPU)
3044318853 7016051 20 Jun 2013, 1:10:24 UTC 20 Jun 2013, 1:33:16 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044348714 6958381 20 Jun 2013, 1:33:24 UTC 20 Jun 2013, 1:38:32 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044354760 6743006 20 Jun 2013, 1:38:43 UTC 20 Jun 2013, 1:44:29 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.01
3044363043 5944441 20 Jun 2013, 1:44:35 UTC 20 Jun 2013, 1:49:43 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (ati_opencl_100)
3044369811 5856725 20 Jun 2013, 1:49:48 UTC 20 Jun 2013, 1:54:57 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (ati_opencl_100)

Why should I even bother? This thing is gonna die. I'm going to run it and then receive an Invalid for my trouble. Whut?

Profile Tim
Volunteer tester
Avatar
Send message
Joined: 19 May 99
Posts: 196
Credit: 236,179,048
RAC: 156,540
Greece
Message 1382965 - Posted: 20 Jun 2013, 8:03:56 UTC - in response to Message 1382961.

What do yo think about this;

ap_01mr09ad_B1_P1_00224_20130619_24332.wu
3044311467 7008627 20 Jun 2013, 1:05:10 UTC 20 Jun 2013, 1:10:18 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044311468 6797524 20 Jun 2013, 1:05:12 UTC 15 Jul 2013, 1:05:12 UTC In progress --- --- --- AstroPulse v6 Anonymous platform (ATI GPU)
3044318853 7016051 20 Jun 2013, 1:10:24 UTC 20 Jun 2013, 1:33:16 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044348714 6958381 20 Jun 2013, 1:33:24 UTC 20 Jun 2013, 1:38:32 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044354760 6743006 20 Jun 2013, 1:38:43 UTC 20 Jun 2013, 1:44:29 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.01
3044363043 5944441 20 Jun 2013, 1:44:35 UTC 20 Jun 2013, 1:49:43 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (ati_opencl_100)
3044369811 5856725 20 Jun 2013, 1:49:48 UTC 20 Jun 2013, 1:54:57 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (ati_opencl_100)

Why should I even bother? This thing is gonna die. I'm going to run it and then receive an Invalid for my trouble. Whut?



Same thing. Server prefer to send wu to ATI and cpu.

I wonder how many of my 500 pending AP wu’s are the same.

Tim

____________

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,032,696
RAC: 118,881
United States
Message 1382966 - Posted: 20 Jun 2013, 8:21:14 UTC - in response to Message 1382965.

It appears a large number of the older machines are having a problem with the new cal_ati app. I also had a problem with the cal_ati app with the 13.1 Legacy driver. There are a few others, but a large number are using that one driver. The app seems to work fine with the older driver 11.12. Interesting...

Workunit 1266285264

3044305625 5095320 20 Jun 2013, 1:00:31 UTC 20 Jun 2013, 1:05:37 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044305626 7024445 20 Jun 2013, 1:00:30 UTC 20 Jun 2013, 1:05:38 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044312144 6909960 20 Jun 2013, 1:05:43 UTC 20 Jun 2013, 1:21:58 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044312145 5462673 20 Jun 2013, 1:05:46 UTC 20 Jun 2013, 1:10:53 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044318942 6991546 20 Jun 2013, 1:11:05 UTC 15 Jul 2013, 1:11:05 UTC In progress --- --- --- AstroPulse v6 Anonymous platform (CPU)
3044334081 5215447 20 Jun 2013, 1:22:10 UTC 20 Jun 2013, 1:27:22 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (ati_opencl_100)
3044341692 6797524 20 Jun 2013, 1:27:41 UTC 15 Jul 2013, 1:27:41 UTC In progress --- --- --- AstroPulse v6 Anonymous platform (ATI GPU)


Profile Tim
Volunteer tester
Avatar
Send message
Joined: 19 May 99
Posts: 196
Credit: 236,179,048
RAC: 156,540
Greece
Message 1382978 - Posted: 20 Jun 2013, 10:09:15 UTC

2 more added again from ATI hosts.

Someone must kick something.

This is a waste of resources.


Tim

____________

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,032,696
RAC: 118,881
United States
Message 1382982 - Posted: 20 Jun 2013, 10:20:55 UTC - in response to Message 1382978.
Last modified: 20 Jun 2013, 10:27:18 UTC

I'm seeing a lot of these...

3044237945 6946917 19 Jun 2013, 23:55:20 UTC 14 Jul 2013, 23:55:20 UTC In progress --- --- --- AstroPulse v6 v6.02
3044237946 6863602 19 Jun 2013, 23:55:21 UTC 20 Jun 2013, 0:00:28 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044244036 5877996 20 Jun 2013, 0:00:31 UTC 20 Jun 2013, 3:39:29 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044507440 6940010 20 Jun 2013, 3:48:58 UTC 20 Jun 2013, 3:55:02 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044529297 6908180 20 Jun 2013, 4:09:19 UTC 20 Jun 2013, 4:14:47 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044554208 6991375 20 Jun 2013, 4:29:18 UTC 20 Jun 2013, 5:05:26 UTC Error while computing 0.00 0.00 --- AstroPulse v6 v6.06 (cal_ati)
3044622244 6797524 20 Jun 2013, 5:28:09 UTC 15 Jul 2013, 5:28:09 UTC In progress --- --- --- AstroPulse v6 Anonymous platform (ATI GPU)

Nasty...
All AstroPulse v6 tasks

Here they come... http://setiathome.berkeley.edu/results.php?hostid=6645126

Profile Tim
Volunteer tester
Avatar
Send message
Joined: 19 May 99
Posts: 196
Credit: 236,179,048
RAC: 156,540
Greece
Message 1382985 - Posted: 20 Jun 2013, 11:00:20 UTC

The list is growing.

Way to go...

Tim

____________

Profile WilliamProject donor
Volunteer tester
Avatar
Send message
Joined: 14 Feb 13
Posts: 1580
Credit: 9,458,926
RAC: 8,660
Message 1382997 - Posted: 20 Jun 2013, 11:58:43 UTC

Oh dear, that looks like the new Brook app is having problems. One for Raistmer.

I fear that changing the scheduler so that it spreads problematic units across different platforms requires a fair bit of coding on David's part. Not something easily set in motion.
____________
A person who won't read has no advantage over one who can't read. (Mark Twain)

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3988
Credit: 109,731,284
RAC: 132,377
United States
Message 1383023 - Posted: 20 Jun 2013, 13:42:38 UTC

This would be the same as the issue where a CPU task it processed & uploaded. The wingmate is a nvidia GPU that trashes a workunit recording 30 spikes and flagging it with -9 overflow. Then it gets sent to a 3rd host on a nvidia GPU that proceeds to do the same thing. So the two nvidia results matched up and the one good CPU result is flagged as invalid.

When this was first noticed, a few years ago iirc, there was a suggestion that something be implemented so specific hardware/software would get flagged and the task sent to something different. So that valid science data could be collected instead of tossed into the bin.
However that would add a lot of complexity to the server backend. Which is already rather complex.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Juha
Volunteer tester
Send message
Joined: 7 Mar 04
Posts: 176
Credit: 139,103
RAC: 3
Finland
Message 1383041 - Posted: 20 Jun 2013, 14:40:20 UTC

It might be worth considering using the reliable hosts mechanism.

Even thought the advertising says it's for accelerating retries that doesn't mean it needs to be used for that. Setting the avg turnaround time to something high and delay bound multiplier to 1.0 wouldn't exclude any good hosts from getting work but it would prevent bad hosts from trashing workunits.

I don't think it would increase server load much (no promises!) so the only question is do we have enough reliable hosts.

msattlerProject donor
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38667
Credit: 572,449,909
RAC: 587,130
United States
Message 1383125 - Posted: 20 Jun 2013, 18:10:13 UTC
Last modified: 20 Jun 2013, 18:15:26 UTC

I just contacted Eric and he says that the cal_ati app has been deprecated and is not currently active or being distributed.

Which means the hosts that have been using it will crunch up whatever work they have cached, but the servers will no longer send any new work for that application.

I assume that it may be brought back after bugfix and further testing, but Eric did not specifically say that.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3988
Credit: 109,731,284
RAC: 132,377
United States
Message 1383153 - Posted: 20 Jun 2013, 19:00:59 UTC - in response to Message 1383125.

I just contacted Eric and he says that the cal_ati app has been deprecated and is not currently active or being distributed.

Which means the hosts that have been using it will crunch up whatever work they have cached, but the servers will no longer send any new work for that application.

I assume that it may be brought back after bugfix and further testing, but Eric did not specifically say that.

He seemed to be rather frustrated with driver version detection issues in BOINC over on beta. So it could be a bit before we see this all released again. If it was due to that kind of issue anyway.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

terencewee*
Send message
Joined: 10 Oct 09
Posts: 53
Credit: 7,022,510
RAC: 0
Malaysia
Message 1383155 - Posted: 20 Jun 2013, 19:06:08 UTC
Last modified: 20 Jun 2013, 19:09:58 UTC

Encountering similar problem, so far 2 completed but can't validate.

This host had processed thousands of valid AP-WUs and for a moment I thought something is wrong with it.

Affected WUs:
1266433106
1266480341

Run a script to sweep-thru and resubmit affected WUs to different platform?

EDIT: Specifically not to (cal_ati) and (ati_opencl_100) as both platforms are encountering computing error.
____________
terencewee*
Sicituradastra.

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4058
Credit: 32,791,372
RAC: 4,922
United Kingdom
Message 1383186 - Posted: 20 Jun 2013, 20:34:47 UTC - in response to Message 1383155.

Encountering similar problem, so far 2 completed but can't validate.

This host had processed thousands of valid AP-WUs and for a moment I thought something is wrong with it.

Affected WUs:
1266433106
1266480341

Run a script to sweep-thru and resubmit affected WUs to different platform?

EDIT: Specifically not to (cal_ati) and (ati_opencl_100) as both platforms are encountering computing error.

The problem with the (ati_opencl_100) plan_class (for Boinc 6 hosts) is that the app is going out to Hosts with really old CAL drivers when OpenCL support was never included,

http://setiathome.berkeley.edu/show_host_detail.php?hostid=5421155

http://setiathome.berkeley.edu/show_host_detail.php?hostid=5798321

These two hosts listed above are running Cat 10.5 (CAL 1.4.636) and Cat 9.7 (CAL 1.4.344), they need at least Cat 11.1 (CAL 1.4.900) for OpenCL support to be included,
but since you can't tell that apart from Cat 10.12 where OpenCL support was only available with the APP edition, and not the Normal edition,
then the minimum needs to be Cat 11.2 (CAL 1.4.1016), and possibly later than that.
(and even that doesn't guarantee that it'll work when sent to every host since you could at the time download the bare driver without Catalyst Control Centre,
and without the OpenCL driver, the OpenCL driver being a smallish additional download)

Claggy

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3386
Credit: 46,223,509
RAC: 7,804
Russia
Message 1383211 - Posted: 20 Jun 2013, 21:31:12 UTC - in response to Message 1382997.
Last modified: 20 Jun 2013, 21:33:14 UTC

Oh dear, that looks like the new Brook app is having problems. One for Raistmer.

I fear that changing the scheduler so that it spreads problematic units across different platforms requires a fair bit of coding on David's part. Not something easily set in motion.

http://setiweb.ssl.berkeley.edu/beta/forum_thread.php?id=2031&postid=46399
http://setiweb.ssl.berkeley.edu/beta/forum_thread.php?id=2031&postid=46400
____________

1 · 2 · 3 · Next

Message boards : Number crunching : This is not fair

Copyright © 2014 University of California