Panic Mode On (97) Server Problems?

Author	Message
kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1688843 - Posted: 7 Jun 2015, 7:50:08 UTC - in response to Message 1688840. Meow! The kitties have 157 to play with............... You lucky sod!... :) I would have to agree. Could I suggest that the work unit numbers to get shifted to own thread. I feel this is no bearing on how the servers perform. Don't get me wrong I am pleased that people are getting these work units The servers have been performing very well over the past week and a half not more. ?? "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1688843 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80	Message 1688858 - Posted: 7 Jun 2015, 8:23:10 UTC Last modified: 7 Jun 2015, 8:23:33 UTC Oh i got 1 AP. Nice. With each crime and every kindness we birth our future. ID: 1688858 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 1688865 - Posted: 7 Jun 2015, 8:40:45 UTC - in response to Message 1688843. Meow! The kitties have 157 to play with............... You lucky sod!... :) I would have to agree. Could I suggest that the work unit numbers to get shifted to own thread. I feel this is no bearing on how the servers perform. Don't get me wrong I am pleased that people are getting these work units The servers have been performing very well over the past week and a half not more. ?? Thinking about it may be a week and a half was a bit generous. Servers have definitely been keeping the ready to send the buffer full for the last 3 or so days. I forgot about when we almost ran out of work ID: 1688865 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1688867 - Posted: 7 Jun 2015, 9:01:19 UTC - in response to Message 1688865. Meow! The kitties have 157 to play with............... You lucky sod!... :) I would have to agree. Could I suggest that the work unit numbers to get shifted to own thread. I feel this is no bearing on how the servers perform. Don't get me wrong I am pleased that people are getting these work units The servers have been performing very well over the past week and a half not more. ?? Thinking about it may be a week and a half was a bit generous. Servers have definitely been keeping the ready to send the buffer full for the last 3 or so days. I forgot about when we almost ran out of work The GPUs got a bit shy, but with 9 rigs going, the CPU buffer is always doing pretty well. I can run the GPUs dry and still have days worth of work on the CPUs. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1688867 ·

Cactus Bob Send message Joined: 19 May 99 Posts: 209 Credit: 10,924,287 RAC: 29	Message 1688869 - Posted: 7 Jun 2015, 9:10:21 UTC Well I got 14 of the AP puppies. The most I have had ever. Mind you I just returned a couple months ago after being AWAL for several years. This is on 1 machine so 157 on 9 machines seems close. I haven't tweaked anything to get more AP's but maybe It would be worth a shot. May the AP force be with you Bob ------------------------ Sig files are overrated, maybe Sometimes I wonder, what happened to all the people I gave directions to? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SETI@home classic workunits 4,321 SETI@home classic CPU time 22,169 hours ID: 1688869 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1688872 - Posted: 7 Jun 2015, 9:33:26 UTC My APs are buried in cache. But, I'll turn them in sooner than most. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1688872 ·

JaundicedEye Send message Joined: 14 Mar 12 Posts: 5375 Credit: 30,870,693 RAC: 1	Message 1688901 - Posted: 7 Jun 2015, 12:59:06 UTC I managed to snag 80 APs before they ran out. I disagree about workload distribution not being on this thread. It should be the function of the scheduling server to distribute work efficiently. If that is not occurring for whatever reason it's germane to this thread. Perhaps a re-think of the scheduling program to take into account the number of errors produced by a user and apportion the work units accordingly. i.e. 0 errors = full distribution, 25% error rate = 25% reduction in WUs delivered to those machines, etc. Just an idea. "Sour Grapes make a bitter Whine." <(0)> ID: 1688901 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1688904 - Posted: 7 Jun 2015, 13:19:53 UTC Last modified: 7 Jun 2015, 13:26:12 UTC What I think ... If you get an error, you should get a test file (with known results), and until you return a valid result for the "test" you don't get more work, you just get "test" files. Send in a valid test WU, then you get work. Basically, Prove to me you can provide scientific results, I don't want crap! ID: 1688904 ·

wiesel111 Send message Joined: 5 Jan 08 Posts: 9 Credit: 1,227,675 RAC: 1	Message 1688918 - Posted: 7 Jun 2015, 14:22:41 UTC - in response to Message 1688144. Last modified: 7 Jun 2015, 14:25:53 UTC wow I got a AP from computer 7567951 I did some checking and found computers 7567951 , 7567912 ,7568646 ,7568596 and 7567941 Computer information is identical right down to the floating point and integer speeds hmmmm interesting add 2 more 7568519 and 7568642 Add 8 more computers: 7568512, 7567924, 7568508, 7568640, 7568643, 7568513, 7568516, 7568501 So 15 identical computers so far... Here are some more: 7568511, 7568499, 7567886, 7568508, 7567913 Now we have 20 identical... 21 + "7568637" Here a small update of the identical computer. One was doubled (7568508), but I find some more: 7572890, 7567931, 7568597, 7568503, 7569519, 7572867 After the new ap's from today I found 8 more (7567911, 7567922, 7567938, 7568510, 7568518, 7568644,7569479, 7572925) All 35 identical in chronological order of building: 7567886, 7567911, 7567912, 7567913, 7567922, 7567924, 7567931, 7567938, 7567941, 7567951, 7568499, 7568501, 7568503, 7568504, 7568508, 7568510, 7568511, 7568512, 7568513, 7568516, 7568518, 7568519, 7568596, 7568597, 7568637, 7568640, 7568642, 7568643, 7568644, 7568646, 7569479, 7569519, 7572867, 7572890, 7572925 ID: 1688918 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1688926 - Posted: 7 Jun 2015, 14:44:18 UTC - in response to Message 1688808. I don't know if this can be considered exploiting the system, but it's not really a new concept (I've known about it for several years, myself). So with APs being split presently, I tried snagging as many as I could get. When asking for 2.6M seconds of work, I got 6 consecutive "no work available" replies. I changed the 10-day cache down to 1-day, and the next request resulted in getting a single AP. And so did the second, and third. I figured now that I've got something, I could go back to 10-day and just let it continue filling, but that resulted in three consecutive "no work available" replies. Dropped back to 1-day, and got three more consecutive replies, each with one AP. Once I got to a point where BOINC decided my 1-day cache was full, I changed it to two days, to keep the "requesting work for [x] seconds" value fairly low, and again, success. I don't know what the rough cut-off value is, but I know that it is quite difficult to get work when asking for 2.6M seconds of work on an empty cache, unless there are 10+ tapes being split, but asking for smaller amounts of work seems to end up with a higher success-rate. I'm not sure it has any effect. Looking over the logs while I was asleep. I managed 92 new AP while they were going out & I set a 10 day cache for my venues. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1688926 ·

Seahawk Volunteer tester Send message Joined: 8 Jan 08 Posts: 937 Credit: 8,157,029 RAC: 5	Message 1688927 - Posted: 7 Jun 2015, 14:46:55 UTC I got 10 wu_0 ap buried in 190 MB. Its been so long since I've seen a wu_0 that I had to look several times to make sure. I used to be a cruncher like you, then I took an arrow to the knee. ID: 1688927 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1689005 - Posted: 7 Jun 2015, 19:31:08 UTC - in response to Message 1688926. I'm not sure it has any effect. Looking over the logs while I was asleep. I managed 92 new AP while they were going out & I set a 10 day cache for my venues. But did you already have something in your cache (MBs)? I had more success getting something on an empty cache by requesting a small amount of work at a time, rather than 2.6M seconds. I remember when there are 10+ tapes available, it often took 10-20 "no work available" replies before I'd start to get some APs, but once they started coming in, the cache would fill moderately quickly. I managed to snag 80 APs before they ran out. I disagree about workload distribution not being on this thread. It should be the function of the scheduling server to distribute work efficiently. If that is not occurring for whatever reason it's germane to this thread. Perhaps a re-think of the scheduling program to take into account the number of errors produced by a user and apportion the work units accordingly. i.e. 0 errors = full distribution, 25% error rate = 25% reduction in WUs delivered to those machines, etc. Just an idea. Here's an idea: how about the quota system? It's already in-place, it just needs to have the minimum value be able to go down to 1, and more importantly, be enforced (I'm about 98% sure it is not enforced at all presently). But that's a rant I've had a few times in another thread already. I'll stop now. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1689005 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1689007 - Posted: 7 Jun 2015, 19:36:15 UTC - in response to Message 1689005. Here's an idea: how about the quota system? It's already in-place, it just needs to have the minimum value be able to go down to 1, and more importantly, be enforced (I'm about 98% sure it is not enforced at all presently). I'm pretty sure that some older incarnations of Boinc do not work with the limits in place on the servers at all. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1689007 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1689039 - Posted: 7 Jun 2015, 23:18:43 UTC - in response to Message 1689007. Here's an idea: how about the quota system? It's already in-place, it just needs to have the minimum value be able to go down to 1, and more importantly, be enforced (I'm about 98% sure it is not enforced at all presently). I'm pretty sure that some older incarnations of Boinc do not work with the limits in place on the servers at all. I think I've heard that. Old builds like 5.10.45 can get thousands of WUs presently, and I think it is due to the fact that those really old builds don't send a list of what they have back to the server during scheduler contacts. I think that started in the 6.x series. I know 6.2.19 sends its list during contact. But other than that, I don't think the quota system is enforced. Shortly after APv7 was released, my "max tasks per day" was in the mid-40s, and I got close to 70 APs in the course of an hour or two. If the quota system was enforced.. I shouldn't have been able to get more than "max tasks per day." That's what I'm basing my theory off of, so.. I could be wrong. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1689039 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1689042 - Posted: 8 Jun 2015, 0:12:29 UTC - in response to Message 1689039. ... I don't think the quota system is enforced. Shortly after APv7 was released, my "max tasks per day" was in the mid-40s, and I got close to 70 APs in the course of an hour or two. If the quota system was enforced.. I shouldn't have been able to get more than "max tasks per day." That's what I'm basing my theory off of, so.. I could be wrong. If that was a single core system, perhaps you're right. Otherwise, the "max tasks per day" is multiplied by the number of cores for CPU tasks, and for GPU is further multiplied by the project's "gpu_multiplier" setting. I fully agree that quota system is woefully inadequate for hosts which turn in a lot of results as successfully processed which are subsequently found to be invalid. I can remember at least two extended discussions on the boinc_dev mailing list where various arguments were presented concerning possible adjustments. Those were probably before GPU processing became a major factor. In any case, IMO it would take concern from multiple projects to convince Dr. Anderson a change should be made. Joe ID: 1689042 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1689046 - Posted: 8 Jun 2015, 1:08:38 UTC - in response to Message 1689042. ... I don't think the quota system is enforced. Shortly after APv7 was released, my "max tasks per day" was in the mid-40s, and I got close to 70 APs in the course of an hour or two. If the quota system was enforced.. I shouldn't have been able to get more than "max tasks per day." That's what I'm basing my theory off of, so.. I could be wrong. If that was a single core system, perhaps you're right. Otherwise, the "max tasks per day" is multiplied by the number of cores for CPU tasks, and for GPU is further multiplied by the project's "gpu_multiplier" setting. I fully agree that quota system is woefully inadequate for hosts which turn in a lot of results as successfully processed which are subsequently found to be invalid. I can remember at least two extended discussions on the boinc_dev mailing list where various arguments were presented concerning possible adjustments. Those were probably before GPU processing became a major factor. In any case, IMO it would take concern from multiple projects to convince Dr. Anderson a change should be made. Joe Okay, yeah, that's kind of what I thought was going on in the other thread that I was discussing/ranting about this in. The quotas should be per application, not per device/cores. So that makes it even scarier for those runaway machines that spew out thousands upon thousands of -9 overflow tasks, because the quota will not go below 33, and if they have multiple GPUs * gpu_multiplier = entirely useless to even have a quota system in the first place. Thanks for clearing that one up.. I've been trying to wrap my head around the problem to try to suggest reasonable solutions, and I needed more information. So those are my suggestions: make the quotas per application, not device, and let it go down to 1. That won't fix the problem on runaway machines, but it will do some serious damage control and essentially keep them from being a problem. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1689046 ·

Ulrich Metzner Volunteer tester Send message Joined: 3 Jul 02 Posts: 1256 Credit: 13,565,513 RAC: 13	Message 1689453 - Posted: 9 Jun 2015, 11:35:07 UTC As of 9 Jun 2015, 11:30:04 UTC Replica seconds behind master Offline 0m :( Aloha, Uli ID: 1689453 ·

WezH Volunteer tester Send message Joined: 19 Aug 99 Posts: 576 Credit: 67,033,957 RAC: 95	Message 1689498 - Posted: 9 Jun 2015, 15:04:12 UTC - in response to Message 1688918. After the new ap's from today I found 8 more (7567911, 7567922, 7567938, 7568510, 7568518, 7568644,7569479, 7572925) All 35 identical in chronological order of building: 7567886, 7567911, 7567912, 7567913, 7567922, 7567924, 7567931, 7567938, 7567941, 7567951, 7568499, 7568501, 7568503, 7568504, 7568508, 7568510, 7568511, 7568512, 7568513, 7568516, 7568518, 7568519, 7568596, 7568597, 7568637, 7568640, 7568642, 7568643, 7568644, 7568646, 7569479, 7569519, 7572867, 7572890, 7572925 Last 5 are not identical to first 30. Anyway, first 30 (feels like classroom) identical computers has currently 537 AP units in progress. Almost 8% of all workunits in field. I did see 20 Valid WU's and 558 WU's 203 (0xcb) EXIT_ABORTED_VIA_GUI. Is there a problem with GTX660 driver, ver 350.12 on every machine? Or what? ID: 1689498 ·

JaundicedEye Send message Joined: 14 Mar 12 Posts: 5375 Credit: 30,870,693 RAC: 1	Message 1689505 - Posted: 9 Jun 2015, 15:21:50 UTC Is there a problem with GTX660 driver, ver 350.12 on every machine? From what I have seen on the threads.....YES. "Sour Grapes make a bitter Whine." <(0)> ID: 1689505 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1689510 - Posted: 9 Jun 2015, 15:36:46 UTC - in response to Message 1689498. Last modified: 9 Jun 2015, 15:37:20 UTC Is there a problem with GTX660 driver, ver 350.12 on every machine? Or what? Ver 350.12 is OpenCl 1.2 If you want to crunch APs with this version you need to modify your apps with the ones provided by Raistmer. Otherwise you need to roll back to an earlier version that still has OpenCL 1.1 Think that is Ver 347.88. Zalster ID: 1689510 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.