Panic Mode On (94) Server Problems?

Author	Message
HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1630894 - Posted: 21 Jan 2015, 23:14:13 UTC - in response to Message 1630879. I'm only running one computer. Using 2 cores of an old Q8200 CPU for CPU tasks, and 2 cores feeding a single Mid-range GPU, ATI HD7870. Look at the RAC folks, and ask yourselves why it beats so many multi GPU monster computers :-) Do you cheat? Do you? Honestly I would expect that GPU to pump out a bit more. I would guess that the old Core 2 Quad might be slowing down the GPU a bit. At least that is what I found out after moving from my Core 2 to an i5-4670 CPU with the same HD6870. I expect it to keep on rising (if AP's are available), and stop at around 45-50 thousand in RAC. I guesstimate mine to to out somewhere around & that your GPU should, at least, put out 6.8-7.2k more than mine. Which would put your machine in range of what you are thinking as well. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1630894 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1630953 - Posted: 22 Jan 2015, 0:28:11 UTC - in response to Message 1630875. I find your RAC interesting because of the fact it is based predominately on AP work. Have you been able to keep the system busy with just AP work considering the whole AP database mess lately? My take on this .... AP work can be considered the 'best bang for the buck' with respect to actual credit awarded per unit of time. Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1630953 ·

OTS Volunteer tester Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0	Message 1630970 - Posted: 22 Jan 2015, 0:42:57 UTC - in response to Message 1630953. New message from Eric K. under Technical News. ID: 1630970 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1631039 - Posted: 22 Jan 2015, 1:54:28 UTC Well there goes my 'consecutive valid tasks' count... Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1631039 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1631076 - Posted: 22 Jan 2015, 2:46:12 UTC - in response to Message 1631039. Well there goes my 'consecutive valid tasks' count... Yeah, well, now you're stuck with one of those "guaranteed to fail" MB WUs, too. Join the crowd! :^) ID: 1631076 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1631105 - Posted: 22 Jan 2015, 4:12:29 UTC - in response to Message 1631076. Well there goes my 'consecutive valid tasks' count... Yeah, well, now you're stuck with one of those "guaranteed to fail" MB WUs, too. Join the crowd! :^) You know, I saw that. I noticed the stock app (_7) shows an autocorr count, and I've been seeing you guys talking about some batches that just will not run with autocorr, and all of you are doing GPUs and stuff, but I've been wondering if the Lunatics CPU app will do it right or not. shrug Probably not, but oh well, I suppose. Is what it is. Not that it really matters on that machine anyway. I'm pretty sure the MBs that it just got a little while ago are going to end up erroring-out for taking 2x more than the estimate. (I think it's because of what I did for a pile of APs two weeks ago and now the MBs it got have really really short estimates on them (like 1:02:34 for a shorty when those usually take 3-4 hours). I thought about adjusting those, too, but then I'll constantly be chasing this issue.. I just need to let it sort itself out now that there's nothing worth trying to save (consecutive valid tasks count.)) Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1631105 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1631113 - Posted: 22 Jan 2015, 4:37:42 UTC - in response to Message 1631105. ... but I've been wondering if the Lunatics CPU app will do it right or not. shrug Probably not, but oh well, I suppose. Is what it is. Nope, it won't. Take a look at 3901319917, which was one of mine that ran to completion (an hour and 10 minutes on the CPU) before I wised up and started aborting them when I spotted them. ... now that there's nothing worth trying to save (consecutive valid tasks count.)) Actually, I don't think getting an Invalid will break your MB valid task streak, if you choose not to abort it. However, it will probably waste over an hour of your CPU's time getting it done. Certainly up to you as to which is more important. Look on the bright side, though. At least you're not going to be running it on an Android device, like the poor _2 on that WU, which took over 18 hours to achieve absolutely nothing! ID: 1631113 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1631143 - Posted: 22 Jan 2015, 6:47:57 UTC I thought tasks that get marked as invalid would reset "consecutive valid" back to zero? And I'm also going to be facing a few "maximum elapsed time exceeded (-177)" with the recent batch of MBs that I downloaded, so I know those will reset consecutive valid as well. Like I said, when I edited the client_state to get realistic estimates for some APs to avoid that same error, I guess my DCF got scaled way down in the process (I edited the estimates to make them ~3x what they were assigned as.. so it makes sense now that the estimates on new tasks are about 1/3 what they should be). I need to just let this problem fix itself, and that's going to mean getting some errors. I can probably micro-manage and gradually shift the estimates until they can sort themselves out without generating errors, but that's too much effort for the reward of keeping the consecutive valid value from being reset. Wouldn't matter if that "doomed to fail" task is going to reset it anyway. There's no getting out of that one, other than resetting the project and having the lost tasks resent using the stock app (if the stock app does in fact do it properly). Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1631143 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1631155 - Posted: 22 Jan 2015, 7:58:49 UTC - in response to Message 1630843. OK, who stepped on the Cricket. Both my links show no activity but downloads are working. Anyone know how to find the router they're using? Since the switch to the colo the primary router for SETI@home is inr-211 with the backup being inr-210. If you look at an other ports on those routers, or any of the UCB routers actually, none of them are updating. The network graphs are generated by the campus IS/IT department for their use & the use of their colo customers. Since it is an infrastructure monitoring tool is is most likely down because they disabled it for some purpose. Such as moving equipment or such. IIRC it was down for about a day a few months ago as well. LOL....I am actually relieved to find that out. I just got home from work and the first thing I did was refresh the Cricket graph. I was a bit dismayed and thought that the project was dead in the water. After checking further, I was happy to find that MB work is flowing and the rigs have their caches full again. Meow! "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1631155 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1631288 - Posted: 22 Jan 2015, 17:36:19 UTC - in response to Message 1631143. I thought tasks that get marked as invalid would reset "consecutive valid" back to zero? And I'm also going to be facing a few "maximum elapsed time exceeded (-177)" with the recent batch of MBs that I downloaded, so I know those will reset consecutive valid as well. Like I said, when I edited the client_state to get realistic estimates for some APs to avoid that same error, I guess my DCF got scaled way down in the process (I edited the estimates to make them ~3x what they were assigned as.. so it makes sense now that the estimates on new tasks are about 1/3 what they should be). I need to just let this problem fix itself, and that's going to mean getting some errors. Not necessarily. The "maximum elapsed time exceeded (-177)" errors are based on rsc_fpops_bound, not rsc_fpops_est, and for MB tasks the bound is set at 20x the est. Also, DCF is not used in the calculation of the time limit so although the estimated run time has been affected by the excursion in DCF, the limit has not. I can probably micro-manage and gradually shift the estimates until they can sort themselves out without generating errors, but that's too much effort for the reward of keeping the consecutive valid value from being reset. Wouldn't matter if that "doomed to fail" task is going to reset it anyway. There's no getting out of that one, other than resetting the project and having the lost tasks resent using the stock app (if the stock app does in fact do it properly). All app versions are doing those tasks properly, the autocorr parameters say not to do that search and they don't. The stock CPU SaH v7 app happens to show an Autocorr count of zero under those conditions, that's a correct value and does not imply that Autocorr processing was done. The reason for those correctly processed tasks being judged invalid is the SaH v7 Validator code requires a best_autocorr for tasks which don't overflow. Joe ID: 1631288 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1631346 - Posted: 22 Jan 2015, 19:09:33 UTC Well, MB seems to be flowing just fine, but I wish the Crickets would wake up..... "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1631346 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1631384 - Posted: 22 Jan 2015, 20:28:33 UTC Last modified: 22 Jan 2015, 21:13:28 UTC I caught one of those blips on the Status Page and saved them for comparison to figure out what is going on. When wrong info is displayed on Server Status Page and HaveLand pages I found that it is not random info, but info from wrong data fields that is showing on the pages. When page displays WRONG info: (Field) = (What is Shown) Results ready to send = Results out in the field Current result creation rate ... Seems Correct Results out in the field = Workunits waiting for validation Results received in last hour ... Correct Result turnaround time ... Correct Results returned and awaiting validation = (Not Sure) MB shows 0, should be 2,772,932 AP shows 58, should be 2,039,924 Workunits waiting for validation = Workunits waiting for assimilation Workunits waiting for assimilation = Shows 0, Definately showing different field Workunit files waiting for deletion = Shows 0, Could be showing different field Result files waiting for deletion = Workunits waiting for db purging Workunits waiting for db purging = Results waiting for db purging Results waiting for db purging = Results returned and awaiting validation On the small numbers (i.e. 0-58) it is hard to tell which values should be where. To me it looks like info is randomly being read wrong, or worse, data is being written wrong into the database. EDIT: Forgot Results in Field ID: 1631384 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1631399 - Posted: 22 Jan 2015, 21:01:07 UTC - in response to Message 1631288. Not necessarily. The "maximum elapsed time exceeded (-177)" errors are based on rsc_fpops_bound, not rsc_fpops_est, and for MB tasks the bound is set at 20x the est. Thanks, Joe. You always have the answers that explain everything. The first of those MBs ran through to completion properly. The original estimate was 1:04:23 or something very close to that, and it ended up taking over 7 hours instead (normal for that machine). Also, DCF is not used in the calculation of the time limit so although the estimated run time has been affected by the excursion in DCF, the limit has not. That's also good to know. So _bound is more or less static, and _est is determined by a combination of APR and DCF? Since the above task finished, the estimates for the rest of the cache have increased quite a bit, but still not to where they should be. Currently, they're showing in the 2:35:00 range--up from 1:04:00, but not quite near 7:30:00 yet. It'll take a few more tasks for that to happen. But at the same time, I thought that if a task took more than 10 or 20% longer than the average estimates, the remaining tasks would all have their estimates changed to what that long-running task took. If less than 10%, then do something like add 10% to all the others. You've explained it before..long, long ago. I was just thinking the rest of the cache should have all become 7:30:00 upon completion of that first one. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1631399 ·

OTS Volunteer tester Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0	Message 1631419 - Posted: 22 Jan 2015, 22:02:53 UTC - in response to Message 1631402. I wonder if poor Eric is feeling a lot like Clifford Stoll back in the late 80s. A PhD in astronomy and all he seems to do is spend his time working on computers. Must be getting a little frustrating. ID: 1631419 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1631477 - Posted: 23 Jan 2015, 2:08:08 UTC - in response to Message 1631399. I thought that if a task took more than 10 or 20% longer than the average estimates, the remaining tasks would all have their estimates changed to what that long-running task took. If less than 10%, then do something like add 10% to all the others. You've explained it before..long, long ago. I was just thinking the rest of the cache should have all become 7:30:00 upon completion of that first one. This seems to have corrected itself when the second MB task completed. All estimates look right now, and I allowed new tasks and filled the 3.5-day cache and all the new tasks have correct estimates, as well. Back to letting that machine be on auto-pilot for a while now, I suppose. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1631477 ·

Aurora Borealis Volunteer tester Send message Joined: 14 Jan 01 Posts: 3075 Credit: 5,631,463 RAC: 0	Message 1631500 - Posted: 23 Jan 2015, 4:11:27 UTC Cricket is alive. Someone found an ink cartridge. ID: 1631500 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304	Message 1631523 - Posted: 23 Jan 2015, 5:21:57 UTC - in response to Message 1631500. Cricket is alive. Someone found an ink cartridge. Cause for celebration. Still getting random weirdness on the Haveland graphs though. Grant Darwin NT ID: 1631523 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1631531 - Posted: 23 Jan 2015, 6:13:14 UTC I'm curious to see what the splitters will do when they get to "tape" 24se13ad. My records show that my machines already processed 287 tasks from that file just last March. They were all MB v7 tasks, however. I didn't get any APs from it. Looks like it should be the next "tape" in line, but I think I'll have to wait until morning to see what happens. ID: 1631531 ·

S@NL Etienne Dokkum Volunteer tester Send message Joined: 11 Jun 99 Posts: 212 Credit: 43,822,095 RAC: 0	Message 1631545 - Posted: 23 Jan 2015, 7:15:02 UTC Something's alive. AP's started validating again... ID: 1631545 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1631549 - Posted: 23 Jan 2015, 7:23:40 UTC - in response to Message 1631545. Something's alive. AP's started validating again... Holy crap, did they ever.... And the kitties have some crickets to chase again! Meow! "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1631549 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.