Panic Mode On (95) Server Problems?

Message boards : Number crunching : Panic Mode On (95) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 22 · Next

AuthorMessage
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36809
Credit: 261,360,520
RAC: 489
Australia
Message 1640070 - Posted: 11 Feb 2015, 7:48:12 UTC

Oh well, my main rig's GPU's are now on their backup work and the 2nd rig's GPU's have about 8hrs work left before it may have to do the same. :-(

Cheers.
ID: 1640070 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1640071 - Posted: 11 Feb 2015, 7:51:29 UTC - in response to Message 1640062.  
Last modified: 11 Feb 2015, 7:54:00 UTC

@Grant, I was referring to the fact that there was over 3 million results waiting to be deleted there was no backlog in the servers weren't running behind. I think the purging totals are higher moment because there are/were shorter work units going through. As I write
11 Feb 2015, 7:20:03 UTC looks like parts of the server running about 3 hours behind in displaying information.

I looked at the graphs at the time you posted, and there were no WUs at all waiting to be deleted- none. Hence my question.
The only thing that came close to the 3 million you mentioned were the number of results (WUs) that were in progress at the time.
And they are what they sound like- all the WUs people have downloaded, the system waiting for results to be returned. They aren't there waiting for deletion.
Grant
Darwin NT
ID: 1640071 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1640074 - Posted: 11 Feb 2015, 7:55:07 UTC - in response to Message 1640070.  

Oh well, my main rig's GPU's are now on their backup work and the 2nd rig's GPU's have about 8hrs work left before it may have to do the same. :-(

Cheers.

Got a couple of hours of GPU work left myself.
Add to the present Scheduler woes & frozen server status, the forums are very slow to load at present.
Grant
Darwin NT
ID: 1640074 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1640116 - Posted: 11 Feb 2015, 9:35:06 UTC - in response to Message 1640074.  

In the last 30min or so both of my machines have been able to report the work they had, took over a minute for the Scheduler to finally respond & accept them.
However requests for work, even when the Scheduler does respond result in "Project has no tasks available" messages.
Another half hour & 1 system will be out of GPU work, another half hour or so after that & the other system will be out of GPU work as well.
Luckily the CPUs have quite a few VLARs lined up, so they should keep going for most of the night.
Grant
Darwin NT
ID: 1640116 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1640120 - Posted: 11 Feb 2015, 9:51:08 UTC

Well there's two things I wanted to point out as interesting:

1) "MB results/WUs waiting for DB purge" did not drop after the maintenance.

2) A little after midnight (Berkeley time (UTC-8)), there is the fairly-predictable slight surge on the cricket graph that only lasts for a short period of time. In addition to that, I have noticed that the RTS buffer has dropped a bit. So that means there ARE, in fact, WUs being assigned, but something is very, very significantly hindering DB queries.

Methinks that is why most people get scheduler time-outs, or long delays before a reply, and the successful replies result in "no tasks available," because of A) whatever is impeding the DB, also impedes the feeder's ability to run its query to get a new batch of WUs to assign, which leads to.. B) what little there was in the feeder (even if it was a full helping of 200 (? is that what it still is these days?)) when the successful scheduler contact went through, had already been assigned.

Hopefully whatever query/operation is being done on the DB finishes up before long. I wonder if it's another massive ~16 TB backup operation? Last week, that operation didn't seem to hinder performance much, if at all, but it was going out over the network and limited by either the gigabit link itself, or the ability to write the data at the destination. Maybe this week, the same thing is being done, but on a local volume, so it can read/write much faster because I'm sure it is at least (in one way or another) a 5-disk stripe.

Oh well, I guess we'll have to see how this plays out.

btw.. if you couldn't tell by now, I like trying to sort-out logic puzzles with minimal information.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1640120 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1640132 - Posted: 11 Feb 2015, 10:37:27 UTC - in response to Message 1640120.  
Last modified: 11 Feb 2015, 10:42:30 UTC

I have a feeling they are doing a database merge or manipulation of data on the master AB db causing the slowdown. As long as they still have the original WU results somewhere they will be able to rebuild everything from there once they sort the data out.

My old computer grabbed 3 WUs 3 hours ago so yes some work is making it out. The buffer has dropped at least 5,000 plus the resends so there is a little trickle happening. But likely those are ending up in the bottom of someone's 10 day cache :(

I imagine the 0.3-0.6 creation rate is just the resends that are being generated from timeouts.
ID: 1640132 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31012
Credit: 53,134,872
RAC: 32
United States
Message 1640172 - Posted: 11 Feb 2015, 14:38:56 UTC

ID: 1640172 · Report as offensive
OTS
Volunteer tester

Send message
Joined: 6 Jan 08
Posts: 371
Credit: 20,533,537
RAC: 0
United States
Message 1640190 - Posted: 11 Feb 2015, 16:02:59 UTC - in response to Message 1640172.  

Cricket back to 120 Mbps - hope in Mudville - at least for MBs?
ID: 1640190 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1640213 - Posted: 11 Feb 2015, 16:52:09 UTC

I'm not getting scheduler time-outs, but I'm also not getting any WU's assigned to any of my seven computers! (but only on production: Beta's fine...) (yes, one computer tried at least three times...)
.

Hello, from Albany, CA!...
ID: 1640213 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22535
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1640217 - Posted: 11 Feb 2015, 16:57:13 UTC

All the hundreds of tasks sitting around to report have reported, now just waiting for some nice shiny new tasks crunch....

(And as has already been reported the Crickets have come back to life)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1640217 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1640258 - Posted: 11 Feb 2015, 18:38:04 UTC - in response to Message 1640217.  

Both my systems now have some GPU work.
Most requests for work result in "Project has no tasks available" messages. Given the length of the outage & the lack of any AP work I think that will be the case for a fair while yet as the feeder struggles to meet the demand.
Grant
Darwin NT
ID: 1640258 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1640337 - Posted: 11 Feb 2015, 20:46:25 UTC - in response to Message 1640258.  

Both my systems now have some GPU work.
Most requests for work result in "Project has no tasks available" messages. Given the length of the outage & the lack of any AP work I think that will be the case for a fair while yet as the feeder struggles to meet the demand.

Well, as of 2042 UTC, all my machines have their full quota of MB tasks, except for my Celeron J1900 which I switched from Linux to Windows 10 on Monday. It's not done enough GPU jobs yet to have its performance quantified (currently its GPU is running at 70 or 80% of a single CPU core, but the CPU is reportedly running at 2.4 GHz, for a 2 GHz part!).
ID: 1640337 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1640720 - Posted: 12 Feb 2015, 15:54:23 UTC

When I saw that the crickets jumped up to ~400 again, I was hoping AP was running. No such luck, I see. Oh well.
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1640720 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1640816 - Posted: 12 Feb 2015, 18:00:51 UTC - in response to Message 1640720.  

When I saw that the crickets jumped up to ~400 again, I was hoping AP was running. No such luck, I see. Oh well.

Oh, meow.
When Eric does get things sorted and AP starts to validate again, the kitties are gonna have a fine day....LOL.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1640816 · Report as offensive
Profile JaundicedEye
Avatar

Send message
Joined: 14 Mar 12
Posts: 5375
Credit: 30,870,693
RAC: 1
United States
Message 1640884 - Posted: 12 Feb 2015, 19:46:21 UTC - in response to Message 1640816.  

Oh, meow.
When Eric does get things sorted and AP starts to validate again, the kitties are gonna have a fine day....LOL.


Have you counted your kitties today? There was some loose talk at Rockie's Cafe yesterday concerning a 'catserole'........

:Dg

"Sour Grapes make a bitter Whine." <(0)>
ID: 1640884 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1641070 - Posted: 13 Feb 2015, 3:30:15 UTC
Last modified: 13 Feb 2015, 3:33:46 UTC

So....Why are VLARS being sent to my GPUs?
I see them on All 3 of my machines. Beware, the VLARS are coming!
ID: 1641070 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1641082 - Posted: 13 Feb 2015, 4:29:39 UTC - in response to Message 1641070.  
Last modified: 13 Feb 2015, 4:46:14 UTC

Yup, got a whole bunch of those as well.

Haven't tried processing any of them yet.

Will have to wait and see what they do.


Edit..

yup, they are acting just like they did on Beta...

Going to take longer time to process each of those.

Looks like the majority are coming from 28no12ab and 28no12ad there are a couple from some others.
ID: 1641082 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1641109 - Posted: 13 Feb 2015, 6:23:10 UTC - in response to Message 1641070.  
Last modified: 13 Feb 2015, 6:23:50 UTC

So....Why are VLARS being sent to my GPUs?
I see them on All 3 of my machines. Beware, the VLARS are coming!

Noticed that myself. Will be interesting to see how my GTX 750Tis handle them, as they do better at processing longer running WUs than they do the shorties.

EDIT- I also noticed some interesting recurring spikes in the network traffic.
Grant
Darwin NT
ID: 1641109 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1641132 - Posted: 13 Feb 2015, 7:35:38 UTC - in response to Message 1641109.  
Last modified: 13 Feb 2015, 7:49:37 UTC

.. Will be interesting to see how my GTX 750Tis handle them, as they do better at processing longer running WUs than they do the shorties.

Ok, I got bored.
Suspended al the other tasks to get the VLARs running.
End result- longer run times than the estimates- 33min estimated, actual run times around 40 min. Also slows down processing of other GPU WUs if they are running while a VLAR is being done (I've got 2 * GTX 750Tis & run 2 WUs at a time.) However no effect on other programmes, no screen lag, display stuttering etc.


Interestingly, GPU load is increased.
On my Win7 system, GPU load is generally around mid to high 80%. With the VLARs it hits 95-99% often, but it is very bursty- without the VLARs with the lower GPU utilisation it does vary, but not as much, or by nearly as large a range as with the VLARs. Also the drop in GPU utilisation corresponds with large jumps in Memory Controller load.
On my Vista system, GPU load is usually in the high 90s, so the slowdown caused by the VLARs is more pronounced. Also where as on this system the variation in GPU load is usually very sight, the VLARs cause similar drops on GPU utilisation as on the Win7 system although the over all GPU utilisation remains higher.
These GPU utilisation drops also coincide with Memory Controller load increases.


EDIT- I spoke too soon.
33min estimated, more like 60min actual run time.
Usually with the longer running WUs a 30min estimated time will be done in 25min or less.
Grant
Darwin NT
ID: 1641132 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1641135 - Posted: 13 Feb 2015, 7:45:02 UTC - in response to Message 1640884.  

Oh, meow.
When Eric does get things sorted and AP starts to validate again, the kitties are gonna have a fine day....LOL.


Have you counted your kitties today? There was some loose talk at Rockie's Cafe yesterday concerning a 'catserole'........

:Dg

LOL....Yes, all kitties are accounted for. I saw the pic and banter...
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1641135 · Report as offensive
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (95) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.