Message boards :
Number crunching :
Strange result, how is this possible?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5
| Author | Message |
|---|---|
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5
|
how exactly would the server know that the repairs are made? You mean artificial intelligence? You could do it also manually. Let run the script, bad hosts detected, copy/paste the host IDs to a/n new/other script -> no new WUs. Next time (maybe every day once) only check the last detected bad hosts (then the whole server resources wouldn't needed - if ever). And so on.. (this wouldn't need much human intelligence time) This are only considerations of a non IT human. ;-) It's up to the project admins, if they like to delete this bad (Fermi + CUDA_V12) hosts. I guess IT humans could find maybe an easier way.
|
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
Edit - yes, they are discussing this problem on the SETI.Germany forums. Thank you. Well, we had a problem with originally 19 hosts, now fewer. I don't understand that people install software and don't look if it work smoothly. Normally the BOINC Manager show in the tasks overview 'computation error's, or not? (I guess the people use the currently recommended V6.10.58, so the message overview is easy to find - compared to the currently DEV-Vs (6.12.x) where the message overview is little bit hidden) These tasks are not errors. There's nothing in message / event log to reveal the problem. There are people which never read the S@h forum, also not the team forum. They got the apps from somewhere, and took the trouble to install them somehow. If we knew where that was, we might have a chance of the offenders reading there again. It seems unlikely that these particular users got their apps directly from Lunatics, more likely from some re-hosting site. V12 is not longer available from Lunatics - I hope everyone else has taken it down as well. Computers are smart, why not to let run a script or something on the project server and if a PC make only calculation errors he get a message/warning and don't get tasks until the problem is fixed. Computers aren't smart. They do what they're told to do, and nothing else. It's the people who are smart - and smart people to program the *server* computers are in short supply round here. The <stderr_txt> of the app isn't only shown on the hosts overview, or? As above. Need someone to write the script first. And for a strictly limited number of faulty hosts, it's not worth wasting that much time. If this would need much/all project server resources, shutdown the project servers for the normal/usual work for one/two days and search for the buggy member PCs. After, all PCs under this account ID don't get new WUs. The members get messages in BOINC, EMail (if in the prefs checked) and over PM - and a warning on the first page of the project site (also if it's a 3rd party software, the project must deal with it). Shut down the project for a couple of days, and devote that much staff time to cleaning up after a few selfish and ignorant people? I doubt it'll happen. Or, no new CUDA tasks to the special host IDs. Yes, that would be the best solution. The vigilance of the users here can act as the eyes and ears of the project staff. And when found, the hosts can - and should - be prevented from downloading new tasks. Permanently. If a project admin know, hey there is a PC which make only errors because of a bad/wrong installation, why not to touch a swtich and don't send WUs? This would need time and manpower, but would safe project resources (server/bandwidth) for well running member PCs. That, I agree with. But we also need a way to draw the attention of project staff to problems that may be below their radar. |
Miep Send message Joined: 23 Jul 99 Posts: 2412 Credit: 351,996 RAC: 0 |
Just curious.. the 'new credit system' is designed so complicated that I don't think anybody has ever bothered to figure out whether it was implemented correctly and is actually working as 'designed'. Your are very welcome to conduct a statistically sound test about the influence of GPU/CPU rescheduling on the amount of credits rewarded and report back results. and a german translation: Das 'neue Kredit System' is so komplex angelegt, dass ich nicht glaube das irgendjemand sich die Muehe gemacht hat, herauszufinden, ob es korrekt implementiert worden ist und tatsaechlich so funktioniert wie beabsichtigt. Du kannst gerne ein statistisch abgesichertes Experiment ueber den Einfluss von CPU/GPU Verschiebungen auf die Kreditentlohnung durchfuehren und die Ergebnisse zurueckmelden. Carola ------- I'm multilingual - I can misunderstand people in several languages! |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5
|
In past I wrote to maybe 10 members, their hosts were wingmen of my and they had a Fermi GPU with CUDA_V12 app. I didn't wrote down which members this were. You think there are not more hosts out there with Fermi and CUDA_V12 app? I looked little bit to a few of my wingmen of my E7600 + GTX260 (results which are marked as "Fertig, Bestätigung nicht eindeutig" - 'Finish confirmation is not clear' ?) and found a few strange PCs. CUDA_V12 and Fermi: 2x GTX470: hostid=5472266 (Terry O'Rourke) GTX460: hostid=1754767 (already mentioned here in this thread, Tauern- Apotheke Berlin) GTX460: hostid=3099502 (anonymous) [EDIT: I see now, this three hosts are already mentioned here. Maybe I'll look later to more upper mentioned results ;-)] I don't know.. maybe we could make a thread where the members could mention wingmen with CUDA_V12 and Fermi GPU (with the upper mentioned technic) and only one write a PM to them. But this work only if not anonymous host. Or an admin act and then no new CUDA WUs to this host? Yes, maybe someone with knowledge could make a script (if possible at all) for to detect the number all Fermi hosts with CUDA_V12 app. It would be interesting to know if it would be worth to do all this work. BTW. What could be here the probs? Stock MB 6.10 + Fermi and lot of errors: GTX580: hostid=5328832 (yyama@home) 2x GTX460: hostid=5525141 (Anatath) GTX295 with x32f: hostid=5672562 (Pollux) 9600GT with stock MB 6.09: hostid=5456056 (Peter Csorgits) 9600GT with stock MB 6.09: hostid=4794619 (anonymous) Maybe a thread also for this kind of probs? I let run nVIDIA driver 190.38 and stock MB 6.09 on my machines and AFAIK they had never a wrong (error) -9 overflow result. I don't understand why this could happen at others. |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5
|
I don't know if it would be possible. I see with correct sent/calculated WUs the granted Cr. vary for a 0.44x AR WU from 100 to 130 or something. This is not stable. And without a stable point, how I could compare?
|
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57
|
I suppose the "fix" for this would be to do more checking on the validation side. Right now I am guessing that informational messages are not read during validation. So the process would have to spend more time looking at the results. During the check if it were to find the -9 message in only one of the two results require another result without the message. However, this will not correct the issue of two GPU's with -9 validating. Requiring -9 GPU tasks to be validated against a CPU would work server side, but not always. As tasks can be reassigned on the client & the server records them as the type they were sent out as. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
|
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
Just curious.. I don't know if it would be possible. Well, you have a fast enough machine to do some original research, as Carola was suggesting. Here are two exercises I did on my slower machines. First, is a run I did at SETI Beta as the "New Credit" scheme was first being tested. This graph shows 1,150 VHAR 'shorties' (previous credit between 22.11 and 23.43 inclusive), and the credit granted during the experimental period of 19 May to 9 June 2010. (Direct link) Second, a comparison of claimed (old-style) and granted ("new credit") for a mix of all ARs (980 tasks), recorded over the weekend of 28/29 August 2010. (Direct link) Those are examples of the kind of evidence you can supply - it will take more of us collecting and submitting real facts before any notice is taken. |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5
|
I looked through all 'Pending'/'Finish confirmation is not clear' results of my E7600 + GTX260 machine and found one not already mentioned host.. GTX460 + CUDA_V12: hostid=5231806 (S@NL - NightHawk) |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5
|
Thanks, but I'm not smarter now. I read out of this, that shorties should get ~ 22 Cr. . I thought to an overview of a mix of different ARs and granted Cr., but AFAIK this isn't possible because the new Cr.-system vary a lot (more factors are involved for to calculate). It's not like in past like 0.44x AR WU and ~ 82 - 84 Cr. . I guess only the dev of the new Cr.-system know the answer what will happen if I reschedule GPU/CUDA WUs to CPU.
|
|
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0
|
The solution to V12 on Fermi cards is "SETI@home v7". Assuming the project brings it here as a new application, all "SETI@home Enhanced" applications such as V12 will not be eligible to get any work. The project is losing far more work to radar RFI, bad "tape" files, etc. than the ~200 per day being trashed by the V12 on Fermi hosts. Modifications to the Validator for this purpose are unlikely, though there are going to be changes related to v7 anyhow so if someone thinks up a practical modification now would be a good time to propose it. I can't think of any, but will note that the validator code does check for "result_overflow" in the stderr and sets a flag so queries to the SaH Science Database could select only overflowed results or vice versa. My guess is that the "zone RFI" removal should take care of the corrupted assimilated results. Don't forget there was a huge incidence of those when users running stock started installing Fermi cards. For those who have produced good results and had them discarded I can only offer sympathy. Murphy's Law strikes again, nobody has yet figured out a way to repeal it. Joe |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0
|
... if someone thinks up a practical modification now would be a good time to propose it. If two GPU hosts report "-9 result_overflow" mark the results "Completed, validation inconclusive" and send third task to CPU host for confirmation. If CPU says "it is not overflow" set/raise this CPU-result's "vote_weight" in quorum to e.g. 1.5 or 2 (this one result will vote twice) and send one more (hopefully last) task. (this will address also hot/overclocked GPUs and software flaws) Â - ALF - "Find out what you don't do well ..... then don't do it!" :)Â |
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0
|
Here is a host, with a 'mix' of nVIDIA 200 and 400 series, producing over 700 errors between 3 and 6 march, but later/newer results are valid. I don't think it's a good idea to use these, completely different cards, in one host. The errors, do point to (one of the) graphic-cards. Later results, validated today, 10 march 2011, are OK! (But for howlong?) Stangely, other results from 2 march, reported today, are also OK?! Maybe cards were added, in between?
|
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
You can used mixed cards in a single host (provided you check that the driver has installed properly for each card - so re-install the driver every time you add a new card, to be on the safe side). What would be a bad idea would be to use fancy tricks that only apply to one set of hardware - e.g. the 400 (Fermi) series cards have extra hardware to cope with multiple tasks running at once, and the older 200 series cards don't. So stick to one-at-a-time per card, in mixed hosts like that. |
|
Erdmann Send message Joined: 10 Jan 11 Posts: 12 Credit: 719,494 RAC: 3
|
I agree with the system not sending WU's to units that either can't crunch correctly or abandons all the WU's sent. I seem to have a couple of wingmen that for some reason, haven't completed a WU, and gotten credit do to erroring out on 100% of the files, or abandoning them. I just saw one machine that has over 6000 WU's, and I stopped counting after 300 showed abandoned. Every machine will burp at times, but these crunchers aren't doing Seti, or themselfs any service by staying connected. |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0
|
Uhhhh, I believe those marked abandoned are from him doing a detach/attach so maybe he is working on the problem. Did you check to see if he has any posts here or on the Q&A board asking for help? PROUD MEMBER OF Team Starfire World BOINC |
Bernie Vine Send message Joined: 26 May 99 Posts: 9960 Credit: 103,452,613 RAC: 328
|
I agree with the system not sending WU's to units that either can't crunch correctly or abandons all the WU's sent. I seem to have a couple of wingmen that for some reason, haven't completed a WU, and gotten credit do to erroring out on 100% of the files, or abandoning them. I just saw one machine that has over 6000 WU's, and I stopped counting after 300 showed abandoned. This is the reason I am no longer crunching for seti, I really don't believe incorect results should be entered into the database, makes all my work worthless. |
©2026 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.