Strange result, how is this possible?

Message boards : Number crunching : Strange result, how is this possible?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5

AuthorMessage
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1084957 - Posted: 8 Mar 2011, 11:11:53 UTC - in response to Message 1084948.  
Last modified: 8 Mar 2011, 11:13:59 UTC

how exactly would the server know that the repairs are made?

Either automatically, 1 WU/day, the server read the <stderr_txt> again if the CUDA_V12 is still there. Still there not more WUs/day. CUDA_V12 away, WU quota again on.
Or manually, the member get a message (PM and EMail, in several languages) and an 'activation-URL' where he must click, if he updated his machine/s. After a short time the project server look again to the <stderr_txt> if it's true.

Ok, are you going to code the sophisticated AI needed to 'automatically' discover bad hosts, check their app status and cut them down until something changes?

You mean artificial intelligence?

You could do it also manually.
Let run the script, bad hosts detected, copy/paste the host IDs to a/n new/other script -> no new WUs. Next time (maybe every day once) only check the last detected bad hosts (then the whole server resources wouldn't needed - if ever). And so on.. (this wouldn't need much human intelligence time)

This are only considerations of a non IT human. ;-)
It's up to the project admins, if they like to delete this bad (Fermi + CUDA_V12) hosts.
I guess IT humans could find maybe an easier way.
ID: 1084957 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1084958 - Posted: 8 Mar 2011, 11:23:49 UTC - in response to Message 1084858.  

Edit - yes, they are discussing this problem on the SETI.Germany forums. Thank you.

I wrote already at July'10 in the S.G forum about the Fermi + CUDA_V12 prob.

S.G : S@h subforum : SETI@home auf einer GTX4xx/5xx GPU

Discussion maybe, but the message still doesn't seem to have got through.

From the previous list, the SETI.Germany team members are:

1754767 Tauern- Apotheke Berlin
5177668 Orgasmann
5231715 Christian Buckatz
5396192 micha123
5444296 Vigilante
5701024 Frank-SETI-Reit

Of those, hosts 5396192 and 5444296 seem to have been taken off-line, the other four have all returned -9 results from the V12 app within the last 24 hours. Not one seems to be successfully crunching SETI with their expensive GPUs.


We don't need to talk about who or what the problem was/is, that still people use the CUDA_v12 app with their GTX4xx-5xx GPUs. The child is fallen already into the fountain. We have now the problem, that the CUDA_V12 app is still in use with the Fermi GPUs.

Well, we had a problem with originally 19 hosts, now fewer.

I don't understand that people install software and don't look if it work smoothly. Normally the BOINC Manager show in the tasks overview 'computation error's, or not? (I guess the people use the currently recommended V6.10.58, so the message overview is easy to find - compared to the currently DEV-Vs (6.12.x) where the message overview is little bit hidden)

These tasks are not errors. There's nothing in message / event log to reveal the problem.

There are people which never read the S@h forum, also not the team forum.
We see now, we can't reach the members over the forum. If we send them a PM, it's not sure that the message will be received/read.
How we could reach them?

They got the apps from somewhere, and took the trouble to install them somehow. If we knew where that was, we might have a chance of the offenders reading there again. It seems unlikely that these particular users got their apps directly from Lunatics, more likely from some re-hosting site. V12 is not longer available from Lunatics - I hope everyone else has taken it down as well.

Computers are smart, why not to let run a script or something on the project server and if a PC make only calculation errors he get a message/warning and don't get tasks until the problem is fixed.

Computers aren't smart. They do what they're told to do, and nothing else.

It's the people who are smart - and smart people to program the *server* computers are in short supply round here.

The <stderr_txt> of the app isn't only shown on the hosts overview, or?
The stock and opt. CUDA app display the GPU series. Also BOINC do this.
If the script see GTX4xx-5xx and CUDA_V12 -> no new tasks.

As above. Need someone to write the script first. And for a strictly limited number of faulty hosts, it's not worth wasting that much time.

If this would need much/all project server resources, shutdown the project servers for the normal/usual work for one/two days and search for the buggy member PCs. After, all PCs under this account ID don't get new WUs. The members get messages in BOINC, EMail (if in the prefs checked) and over PM - and a warning on the first page of the project site (also if it's a 3rd party software, the project must deal with it).

Shut down the project for a couple of days, and devote that much staff time to cleaning up after a few selfish and ignorant people? I doubt it'll happen.

Or, no new CUDA tasks to the special host IDs.

Yes, that would be the best solution. The vigilance of the users here can act as the eyes and ears of the project staff. And when found, the hosts can - and should - be prevented from downloading new tasks. Permanently.

If a project admin know, hey there is a PC which make only errors because of a bad/wrong installation, why not to touch a swtich and don't send WUs? This would need time and manpower, but would safe project resources (server/bandwidth) for well running member PCs.

This way or this way..

There must be a way/solution for to reach the members/switch off buggy PCs.

That, I agree with. But we also need a way to draw the attention of project staff to problems that may be below their radar.
ID: 1084958 · Report as offensive
Profile Miep
Volunteer moderator
Avatar

Send message
Joined: 23 Jul 99
Posts: 2412
Credit: 351,996
RAC: 0
Message 1084959 - Posted: 8 Mar 2011, 11:44:54 UTC - in response to Message 1084951.  

Just curious..

Which BOINC version is now at least needed for well Cr./MB+AP WU?
In past with the previous Cr.-system it was V5.2.6 .

What will happen if I reschedule GPU/CUDA WUs to CPU (if BOINC have too less DLed and the project servers are down). This will influence the granted Cr. for this WUs (+ or -)?


the 'new credit system' is designed so complicated that I don't think anybody has ever bothered to figure out whether it was implemented correctly and is actually working as 'designed'.

Your are very welcome to conduct a statistically sound test about the influence of GPU/CPU rescheduling on the amount of credits rewarded and report back results.

and a german translation:

Das 'neue Kredit System' is so komplex angelegt, dass ich nicht glaube das irgendjemand sich die Muehe gemacht hat, herauszufinden, ob es korrekt implementiert worden ist und tatsaechlich so funktioniert wie beabsichtigt.

Du kannst gerne ein statistisch abgesichertes Experiment ueber den Einfluss von CPU/GPU Verschiebungen auf die Kreditentlohnung durchfuehren und die Ergebnisse zurueckmelden.
Carola
-------
I'm multilingual - I can misunderstand people in several languages!
ID: 1084959 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1085316 - Posted: 9 Mar 2011, 15:13:48 UTC - in response to Message 1084958.  
Last modified: 9 Mar 2011, 15:43:33 UTC

In past I wrote to maybe 10 members, their hosts were wingmen of my and they had a Fermi GPU with CUDA_V12 app. I didn't wrote down which members this were.


You think there are not more hosts out there with Fermi and CUDA_V12 app?

I looked little bit to a few of my wingmen of my E7600 + GTX260 (results which are marked as "Fertig, Bestätigung nicht eindeutig" - 'Finish confirmation is not clear' ?) and found a few strange PCs.


CUDA_V12 and Fermi:

2x GTX470: hostid=5472266 (Terry O'Rourke)

GTX460: hostid=1754767 (already mentioned here in this thread, Tauern- Apotheke Berlin)

GTX460: hostid=3099502 (anonymous)

[EDIT: I see now, this three hosts are already mentioned here.
Maybe I'll look later to more upper mentioned results ;-)]


I don't know.. maybe we could make a thread where the members could mention wingmen with CUDA_V12 and Fermi GPU (with the upper mentioned technic) and only one write a PM to them.
But this work only if not anonymous host.

Or an admin act and then no new CUDA WUs to this host?


Yes, maybe someone with knowledge could make a script (if possible at all) for to detect the number all Fermi hosts with CUDA_V12 app.
It would be interesting to know if it would be worth to do all this work.



BTW.

What could be here the probs?

Stock MB 6.10 + Fermi and lot of errors:

GTX580: hostid=5328832 (yyama@home)

2x GTX460: hostid=5525141 (Anatath)


GTX295 with x32f: hostid=5672562 (Pollux)

9600GT with stock MB 6.09: hostid=5456056 (Peter Csorgits)

9600GT with stock MB 6.09: hostid=4794619 (anonymous)


Maybe a thread also for this kind of probs?



I let run nVIDIA driver 190.38 and stock MB 6.09 on my machines and AFAIK they had never a wrong (error) -9 overflow result.
I don't understand why this could happen at others.
ID: 1085316 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1085321 - Posted: 9 Mar 2011, 15:32:48 UTC - in response to Message 1084959.  
Last modified: 9 Mar 2011, 15:33:45 UTC

I don't know if it would be possible.

I see with correct sent/calculated WUs the granted Cr. vary for a 0.44x AR WU from 100 to 130 or something.
This is not stable.
And without a stable point, how I could compare?
ID: 1085321 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1085324 - Posted: 9 Mar 2011, 15:39:56 UTC

I suppose the "fix" for this would be to do more checking on the validation side. Right now I am guessing that informational messages are not read during validation. So the process would have to spend more time looking at the results.

During the check if it were to find the -9 message in only one of the two results require another result without the message.

However, this will not correct the issue of two GPU's with -9 validating.

Requiring -9 GPU tasks to be validated against a CPU would work server side, but not always. As tasks can be reassigned on the client & the server records them as the type they were sent out as.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1085324 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1085340 - Posted: 9 Mar 2011, 16:14:15 UTC - in response to Message 1085321.  

Just curious..

Which BOINC version is now at least needed for well Cr./MB+AP WU?
In past with the previous Cr.-system it was V5.2.6 .

What will happen if I reschedule GPU/CUDA WUs to CPU (if BOINC have too less DLed and the project servers are down). This will influence the granted Cr. for this WUs (+ or -)?

I don't know if it would be possible.

I see with correct sent/calculated WUs the granted Cr. vary for a 0.44x AR WU from 100 to 130 or something.
This is not stable.
And without a stable point, how I could compare?

Well, you have a fast enough machine to do some original research, as Carola was suggesting. Here are two exercises I did on my slower machines.

First, is a run I did at SETI Beta as the "New Credit" scheme was first being tested. This graph shows 1,150 VHAR 'shorties' (previous credit between 22.11 and 23.43 inclusive), and the credit granted during the experimental period of 19 May to 9 June 2010.


(Direct link)

Second, a comparison of claimed (old-style) and granted ("new credit") for a mix of all ARs (980 tasks), recorded over the weekend of 28/29 August 2010.


(Direct link)

Those are examples of the kind of evidence you can supply - it will take more of us collecting and submitting real facts before any notice is taken.
ID: 1085340 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1085378 - Posted: 9 Mar 2011, 20:03:13 UTC - in response to Message 1085316.  
Last modified: 9 Mar 2011, 20:26:44 UTC

I looked through all 'Pending'/'Finish confirmation is not clear' results of my E7600 + GTX260 machine and found one not already mentioned host..

GTX460 + CUDA_V12: hostid=5231806 (S@NL - NightHawk)
ID: 1085378 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1085388 - Posted: 9 Mar 2011, 20:24:37 UTC - in response to Message 1085340.  
Last modified: 9 Mar 2011, 20:44:22 UTC

Thanks, but I'm not smarter now.

I read out of this, that shorties should get ~ 22 Cr. .

I thought to an overview of a mix of different ARs and granted Cr., but AFAIK this isn't possible because the new Cr.-system vary a lot (more factors are involved for to calculate). It's not like in past like 0.44x AR WU and ~ 82 - 84 Cr. .

I guess only the dev of the new Cr.-system know the answer what will happen if I reschedule GPU/CUDA WUs to CPU.
ID: 1085388 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1085464 - Posted: 9 Mar 2011, 23:06:42 UTC

The solution to V12 on Fermi cards is "SETI@home v7". Assuming the project brings it here as a new application, all "SETI@home Enhanced" applications such as V12 will not be eligible to get any work.

The project is losing far more work to radar RFI, bad "tape" files, etc. than the ~200 per day being trashed by the V12 on Fermi hosts. Modifications to the Validator for this purpose are unlikely, though there are going to be changes related to v7 anyhow so if someone thinks up a practical modification now would be a good time to propose it. I can't think of any, but will note that the validator code does check for "result_overflow" in the stderr and sets a flag so queries to the SaH Science Database could select only overflowed results or vice versa.

My guess is that the "zone RFI" removal should take care of the corrupted assimilated results. Don't forget there was a huge incidence of those when users running stock started installing Fermi cards.

For those who have produced good results and had them discarded I can only offer sympathy. Murphy's Law strikes again, nobody has yet figured out a way to repeal it.
                                                                Joe
ID: 1085464 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1085485 - Posted: 9 Mar 2011, 23:58:40 UTC - in response to Message 1085464.  

... if someone thinks up a practical modification now would be a good time to propose it.


If two GPU hosts report "-9 result_overflow" mark the results "Completed, validation inconclusive"
and send third task to CPU host for confirmation.

If CPU says "it is not overflow" set/raise this CPU-result's "vote_weight" in quorum to e.g. 1.5 or 2 (this one result will vote twice)
and send one more (hopefully last) task.

(this will address also hot/overclocked GPUs and software flaws)


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1085485 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1085666 - Posted: 10 Mar 2011, 13:50:39 UTC - in response to Message 1085485.  
Last modified: 10 Mar 2011, 14:03:44 UTC

Here is a host, with a 'mix' of nVIDIA 200 and 400
series, producing over 700 errors between 3 and 6 march, but later/newer results are valid.
I don't think it's a good idea to use these, completely different cards, in one
host.
The errors, do point to (one of the) graphic-cards.

Later results, validated today, 10 march 2011, are OK! (But for howlong?)
Stangely, other results from 2 march, reported today, are also OK?!

Maybe cards were added, in between?
ID: 1085666 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1085683 - Posted: 10 Mar 2011, 14:17:15 UTC - in response to Message 1085666.  

You can used mixed cards in a single host (provided you check that the driver has installed properly for each card - so re-install the driver every time you add a new card, to be on the safe side).

What would be a bad idea would be to use fancy tricks that only apply to one set of hardware - e.g. the 400 (Fermi) series cards have extra hardware to cope with multiple tasks running at once, and the older 200 series cards don't. So stick to one-at-a-time per card, in mixed hosts like that.
ID: 1085683 · Report as offensive
Erdmann

Send message
Joined: 10 Jan 11
Posts: 12
Credit: 719,494
RAC: 3
United States
Message 1087857 - Posted: 17 Mar 2011, 20:22:28 UTC - in response to Message 1084958.  
Last modified: 17 Mar 2011, 20:23:03 UTC

I agree with the system not sending WU's to units that either can't crunch correctly or abandons all the WU's sent. I seem to have a couple of wingmen that for some reason, haven't completed a WU, and gotten credit do to erroring out on 100% of the files, or abandoning them. I just saw one machine that has over 6000 WU's, and I stopped counting after 300 showed abandoned.

Every machine will burp at times, but these crunchers aren't doing Seti, or themselfs any service by staying connected.
ID: 1087857 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1088154 - Posted: 18 Mar 2011, 18:40:22 UTC - in response to Message 1087857.  

Uhhhh, I believe those marked abandoned are from him doing a detach/attach so maybe he is working on the problem. Did you check to see if he has any posts here or on the Q&A board asking for help?


PROUD MEMBER OF Team Starfire World BOINC
ID: 1088154 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9960
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1088238 - Posted: 18 Mar 2011, 22:34:18 UTC - in response to Message 1087857.  

I agree with the system not sending WU's to units that either can't crunch correctly or abandons all the WU's sent. I seem to have a couple of wingmen that for some reason, haven't completed a WU, and gotten credit do to erroring out on 100% of the files, or abandoning them. I just saw one machine that has over 6000 WU's, and I stopped counting after 300 showed abandoned.

Every machine will burp at times, but these crunchers aren't doing Seti, or themselfs any service by staying connected.


This is the reason I am no longer crunching for seti, I really don't believe incorect results should be entered into the database, makes all my work worthless.
ID: 1088238 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5

Message boards : Number crunching : Strange result, how is this possible?


 
©2026 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.