Panic Mode On (94) Server Problems?

Message boards : Number crunching : Panic Mode On (94) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 22 · Next

AuthorMessage
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1630894 - Posted: 21 Jan 2015, 23:14:13 UTC - in response to Message 1630879.  

I'm only running one computer. Using 2 cores of an old Q8200 CPU for CPU tasks, and 2 cores feeding a single Mid-range GPU, ATI HD7870.
Look at the RAC folks, and ask yourselves why it beats so many multi GPU monster computers :-)



Do you cheat?

Do you?

Honestly I would expect that GPU to pump out a bit more. I would guess that the old Core 2 Quad might be slowing down the GPU a bit. At least that is what I found out after moving from my Core 2 to an i5-4670 CPU with the same HD6870.

I expect it to keep on rising (if AP's are available), and stop at around 45-50 thousand in RAC.

I guesstimate mine to to out somewhere around & that your GPU should, at least, put out 6.8-7.2k more than mine. Which would put your machine in range of what you are thinking as well.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1630894 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1630953 - Posted: 22 Jan 2015, 0:28:11 UTC - in response to Message 1630875.  

I find your RAC interesting because of the fact it is based predominately on AP work. Have you been able to keep the system busy with just AP work considering the whole AP database mess lately? My take on this .... AP work can be considered the 'best bang for the buck' with respect to actual credit awarded per unit of time.

Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1630953 · Report as offensive
OTS
Volunteer tester

Send message
Joined: 6 Jan 08
Posts: 369
Credit: 20,533,537
RAC: 0
United States
Message 1630970 - Posted: 22 Jan 2015, 0:42:57 UTC - in response to Message 1630953.  

New message from Eric K. under Technical News.
ID: 1630970 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1631039 - Posted: 22 Jan 2015, 1:54:28 UTC

Well there goes my 'consecutive valid tasks' count...
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1631039 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1631076 - Posted: 22 Jan 2015, 2:46:12 UTC - in response to Message 1631039.  

Well there goes my 'consecutive valid tasks' count...

Yeah, well, now you're stuck with one of those "guaranteed to fail" MB WUs, too. Join the crowd! :^)
ID: 1631076 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1631105 - Posted: 22 Jan 2015, 4:12:29 UTC - in response to Message 1631076.  

Well there goes my 'consecutive valid tasks' count...

Yeah, well, now you're stuck with one of those "guaranteed to fail" MB WUs, too. Join the crowd! :^)

You know, I saw that. I noticed the stock app (_7) shows an autocorr count, and I've been seeing you guys talking about some batches that just will not run with autocorr, and all of you are doing GPUs and stuff, but I've been wondering if the Lunatics CPU app will do it right or not. *shrug* Probably not, but oh well, I suppose. Is what it is.




Not that it really matters on that machine anyway. I'm pretty sure the MBs that it just got a little while ago are going to end up erroring-out for taking 2x more than the estimate. (I think it's because of what I did for a pile of APs two weeks ago and now the MBs it got have really really short estimates on them (like 1:02:34 for a shorty when those usually take 3-4 hours). I thought about adjusting those, too, but then I'll constantly be chasing this issue.. I just need to let it sort itself out now that there's nothing worth trying to save (consecutive valid tasks count.))
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1631105 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1631113 - Posted: 22 Jan 2015, 4:37:42 UTC - in response to Message 1631105.  

... but I've been wondering if the Lunatics CPU app will do it right or not. *shrug* Probably not, but oh well, I suppose. Is what it is.

Nope, it won't. Take a look at 3901319917, which was one of mine that ran to completion (an hour and 10 minutes on the CPU) before I wised up and started aborting them when I spotted them.

... now that there's nothing worth trying to save (consecutive valid tasks count.))

Actually, I don't think getting an Invalid will break your MB valid task streak, if you choose not to abort it. However, it will probably waste over an hour of your CPU's time getting it done. Certainly up to you as to which is more important.

Look on the bright side, though. At least you're not going to be running it on an Android device, like the poor _2 on that WU, which took over 18 hours to achieve absolutely nothing!
ID: 1631113 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1631143 - Posted: 22 Jan 2015, 6:47:57 UTC

I thought tasks that get marked as invalid would reset "consecutive valid" back to zero?

And I'm also going to be facing a few "maximum elapsed time exceeded (-177)" with the recent batch of MBs that I downloaded, so I know those will reset consecutive valid as well.

Like I said, when I edited the client_state to get realistic estimates for some APs to avoid that same error, I guess my DCF got scaled way down in the process (I edited the estimates to make them ~3x what they were assigned as.. so it makes sense now that the estimates on new tasks are about 1/3 what they should be). I need to just let this problem fix itself, and that's going to mean getting some errors.

I can probably micro-manage and gradually shift the estimates until they can sort themselves out without generating errors, but that's too much effort for the reward of keeping the consecutive valid value from being reset. Wouldn't matter if that "doomed to fail" task is going to reset it anyway. There's no getting out of that one, other than resetting the project and having the lost tasks resent using the stock app (if the stock app does in fact do it properly).
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1631143 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1631155 - Posted: 22 Jan 2015, 7:58:49 UTC - in response to Message 1630843.  

OK, who stepped on the Cricket. Both my links show no activity but downloads are working. Anyone know how to find the router they're using?

Since the switch to the colo the primary router for SETI@home is inr-211 with the backup being inr-210. If you look at an other ports on those routers, or any of the UCB routers actually, none of them are updating.
The network graphs are generated by the campus IS/IT department for their use & the use of their colo customers. Since it is an infrastructure monitoring tool is is most likely down because they disabled it for some purpose. Such as moving equipment or such.
IIRC it was down for about a day a few months ago as well.

LOL....I am actually relieved to find that out. I just got home from work and the first thing I did was refresh the Cricket graph. I was a bit dismayed and thought that the project was dead in the water. After checking further, I was happy to find that MB work is flowing and the rigs have their caches full again.

Meow!
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1631155 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1631288 - Posted: 22 Jan 2015, 17:36:19 UTC - in response to Message 1631143.  

I thought tasks that get marked as invalid would reset "consecutive valid" back to zero?

And I'm also going to be facing a few "maximum elapsed time exceeded (-177)" with the recent batch of MBs that I downloaded, so I know those will reset consecutive valid as well.

Like I said, when I edited the client_state to get realistic estimates for some APs to avoid that same error, I guess my DCF got scaled way down in the process (I edited the estimates to make them ~3x what they were assigned as.. so it makes sense now that the estimates on new tasks are about 1/3 what they should be). I need to just let this problem fix itself, and that's going to mean getting some errors.

Not necessarily. The "maximum elapsed time exceeded (-177)" errors are based on rsc_fpops_bound, not rsc_fpops_est, and for MB tasks the bound is set at 20x the est. Also, DCF is not used in the calculation of the time limit so although the estimated run time has been affected by the excursion in DCF, the limit has not.

I can probably micro-manage and gradually shift the estimates until they can sort themselves out without generating errors, but that's too much effort for the reward of keeping the consecutive valid value from being reset. Wouldn't matter if that "doomed to fail" task is going to reset it anyway. There's no getting out of that one, other than resetting the project and having the lost tasks resent using the stock app (if the stock app does in fact do it properly).

All app versions are doing those tasks properly, the autocorr parameters say not to do that search and they don't. The stock CPU SaH v7 app happens to show an Autocorr count of zero under those conditions, that's a correct value and does not imply that Autocorr processing was done. The reason for those correctly processed tasks being judged invalid is the SaH v7 Validator code requires a best_autocorr for tasks which don't overflow.
                                                                  Joe
ID: 1631288 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1631346 - Posted: 22 Jan 2015, 19:09:33 UTC

Well, MB seems to be flowing just fine, but I wish the Crickets would wake up.....
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1631346 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1631384 - Posted: 22 Jan 2015, 20:28:33 UTC
Last modified: 22 Jan 2015, 21:13:28 UTC

I caught one of those blips on the Status Page and saved them for comparison to figure out what is going on.

When wrong info is displayed on Server Status Page and HaveLand pages I found that it is not random info, but info from wrong data fields that is showing on the pages.

When page displays WRONG info:
(Field) = (What is Shown)

Results ready to send = Results out in the field
Current result creation rate ... Seems Correct
Results out in the field = Workunits waiting for validation
Results received in last hour ... Correct
Result turnaround time ... Correct
Results returned and awaiting validation = (Not Sure)
MB shows 0, should be 2,772,932
AP shows 58, should be 2,039,924
Workunits waiting for validation = Workunits waiting for assimilation
Workunits waiting for assimilation = Shows 0, Definately showing different field
Workunit files waiting for deletion = Shows 0, Could be showing different field
Result files waiting for deletion = Workunits waiting for db purging
Workunits waiting for db purging = Results waiting for db purging
Results waiting for db purging = Results returned and awaiting validation

On the small numbers (i.e. 0-58) it is hard to tell which values should be where.

To me it looks like info is randomly being read wrong, or worse, data is being written wrong into the database.

EDIT: Forgot Results in Field
ID: 1631384 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1631399 - Posted: 22 Jan 2015, 21:01:07 UTC - in response to Message 1631288.  

Not necessarily. The "maximum elapsed time exceeded (-177)" errors are based on rsc_fpops_bound, not rsc_fpops_est, and for MB tasks the bound is set at 20x the est.

Thanks, Joe. You always have the answers that explain everything.

The first of those MBs ran through to completion properly. The original estimate was 1:04:23 or something very close to that, and it ended up taking over 7 hours instead (normal for that machine).

Also, DCF is not used in the calculation of the time limit so although the estimated run time has been affected by the excursion in DCF, the limit has not.


That's also good to know. So _bound is more or less static, and _est is determined by a combination of APR and DCF? Since the above task finished, the estimates for the rest of the cache have increased quite a bit, but still not to where they should be. Currently, they're showing in the 2:35:00 range--up from 1:04:00, but not quite near 7:30:00 yet. It'll take a few more tasks for that to happen.

But at the same time, I thought that if a task took more than 10 or 20% longer than the average estimates, the remaining tasks would all have their estimates changed to what that long-running task took. If less than 10%, then do something like add 10% to all the others. You've explained it before..long, long ago. I was just thinking the rest of the cache should have all become 7:30:00 upon completion of that first one.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1631399 · Report as offensive
OTS
Volunteer tester

Send message
Joined: 6 Jan 08
Posts: 369
Credit: 20,533,537
RAC: 0
United States
Message 1631419 - Posted: 22 Jan 2015, 22:02:53 UTC - in response to Message 1631402.  

I wonder if poor Eric is feeling a lot like Clifford Stoll back in the late 80s. A PhD in astronomy and all he seems to do is spend his time working on computers. Must be getting a little frustrating.
ID: 1631419 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1631477 - Posted: 23 Jan 2015, 2:08:08 UTC - in response to Message 1631399.  

I thought that if a task took more than 10 or 20% longer than the average estimates, the remaining tasks would all have their estimates changed to what that long-running task took. If less than 10%, then do something like add 10% to all the others. You've explained it before..long, long ago. I was just thinking the rest of the cache should have all become 7:30:00 upon completion of that first one.

This seems to have corrected itself when the second MB task completed. All estimates look right now, and I allowed new tasks and filled the 3.5-day cache and all the new tasks have correct estimates, as well. Back to letting that machine be on auto-pilot for a while now, I suppose.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1631477 · Report as offensive
Aurora Borealis
Volunteer tester
Avatar

Send message
Joined: 14 Jan 01
Posts: 3075
Credit: 5,631,463
RAC: 0
Canada
Message 1631500 - Posted: 23 Jan 2015, 4:11:27 UTC

Cricket is alive. Someone found an ink cartridge.
ID: 1631500 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1631523 - Posted: 23 Jan 2015, 5:21:57 UTC - in response to Message 1631500.  

Cricket is alive. Someone found an ink cartridge.

Cause for celebration.

Still getting random weirdness on the Haveland graphs though.
Grant
Darwin NT
ID: 1631523 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1631531 - Posted: 23 Jan 2015, 6:13:14 UTC

I'm curious to see what the splitters will do when they get to "tape" 24se13ad. My records show that my machines already processed 287 tasks from that file just last March. They were all MB v7 tasks, however. I didn't get any APs from it. Looks like it should be the next "tape" in line, but I think I'll have to wait until morning to see what happens.
ID: 1631531 · Report as offensive
Profile S@NL Etienne Dokkum
Volunteer tester
Avatar

Send message
Joined: 11 Jun 99
Posts: 212
Credit: 43,822,095
RAC: 0
Netherlands
Message 1631545 - Posted: 23 Jan 2015, 7:15:02 UTC

Something's alive. AP's started validating again...
ID: 1631545 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1631549 - Posted: 23 Jan 2015, 7:23:40 UTC - in response to Message 1631545.  

Something's alive. AP's started validating again...

Holy crap, did they ever....
And the kitties have some crickets to chase again!
Meow!
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1631549 · Report as offensive
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (94) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.