Panic Mode On (93) Server Problems?

Message boards : Number crunching : Panic Mode On (93) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 24 · Next

AuthorMessage
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1616457 - Posted: 19 Dec 2014, 22:22:05 UTC - in response to Message 1616451.  

They ever move tapes from Main to Beta for crunching? Just asking....
ID: 1616457 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1616460 - Posted: 19 Dec 2014, 22:25:12 UTC - in response to Message 1616457.  

They ever move tapes from Main to Beta for crunching? Just asking....

I would think the same few tapes would be permanently re-used on Beta.. if the same data has been processed a few hundred times, you KNOW what the results should be when testing new apps.

I could be wrong though.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1616460 · Report as offensive
Profile JaundicedEye
Avatar

Send message
Joined: 14 Mar 12
Posts: 5375
Credit: 30,870,693
RAC: 1
United States
Message 1616476 - Posted: 19 Dec 2014, 23:34:38 UTC

Eight AP's left, stocking the cache with MB since a fix is obviously not in the works for a few days(?). Still worthwhile science to be done without AP.

May we all live in interesting times.........

"Sour Grapes make a bitter Whine." <(0)>
ID: 1616476 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1616494 - Posted: 20 Dec 2014, 0:29:34 UTC

Well, if the slapping around doesn't work, it may be time to take it out behind the shed...

Here is a 26my14ab from 3 Days ago;
name: ap_26my14ab_B0_P1_00238_20141216_05061.wu
application: AstroPulse v7
created: 16 Dec 2014, 14:44:12 UTC
http://setiathome.berkeley.edu/workunit.php?wuid=1652212000

It's been around for a while.
ID: 1616494 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1616553 - Posted: 20 Dec 2014, 4:44:01 UTC - in response to Message 1616494.  

AP in progress & returned per hour continue to decline.
AP Awaiting validation continues to grow. AP assimilators are disabled.
Grant
Darwin NT
ID: 1616553 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 1616565 - Posted: 20 Dec 2014, 5:56:50 UTC - in response to Message 1616553.  

AP in progress & returned per hour continue to decline.
AP Awaiting validation continues to grow. AP assimilators are disabled.

And now back up and poking along. I think I've about given up on trying to read the tea leaves on the SSP. Seems like it doesn't resemble reality when things go sideways ...
ID: 1616565 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1616627 - Posted: 20 Dec 2014, 14:17:07 UTC

I just watched Inside a Google data center. Talk about 'server problems'... Wouldn't we all wish that the colo looked like that?
ID: 1616627 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1616644 - Posted: 20 Dec 2014, 15:40:40 UTC

Oh no, I have loads of "*no14ac*" tasks again, of which the 13 to 20 range cause severe stuttering when they run on my AMD HD7870. What is it with these November tasks that causes this problem? The older tasks, 21 to 26 give no problems.
ID: 1616644 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1616649 - Posted: 20 Dec 2014, 16:05:53 UTC - in response to Message 1616620.  
Last modified: 20 Dec 2014, 16:18:02 UTC

Well, AP started flowing here pretty good, but that stopped at the moment when three splitters started working on the same file: 25my14ae

At that moment, AP delivery became molasses delivery again.

Now it's up to Four Splitters on 26se14as, with the creation rate down to 0.5585/sec and (6) completed channels. I think it was up to .8 or .9 when there were just 2 splitters on a single file earlier. Last night someone was testing things and disabled 4 of 7 splitters. The three remaining still jumped on the same file but the creation rate was near .5, close to where it is now. Seems when 3 or 4 splitters jump on the same file the creation rate is the same as if the other splitters were disabled.

Now my one machine that was building a cache is out of work again. The other 2 still have work available...for now.
ID: 1616649 · Report as offensive
WezH
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 576
Credit: 67,033,957
RAC: 95
Finland
Message 1616651 - Posted: 20 Dec 2014, 16:09:52 UTC - in response to Message 1616620.  
Last modified: 20 Dec 2014, 16:10:38 UTC

Well, AP started flowing here pretty good, but that stopped at the moment when three splitters started working on the same file: 25my14ae

At that moment, AP delivery became molasses delivery again.


That one is gone now, but we have new troublemaker, 26se14as.

4 splitters working on it...

Edit: TBar was faster.
"Please keep Your signature under four lines so Internet traffic doesn't go up too much"

- In 1992 when I had my first e-mail address -
ID: 1616651 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1616668 - Posted: 20 Dec 2014, 16:45:47 UTC - in response to Message 1616651.  

Well, AP started flowing here pretty good, but that stopped at the moment when three splitters started working on the same file: 25my14ae

At that moment, AP delivery became molasses delivery again.


That one is gone now, but we have new troublemaker, 26se14as.

4 splitters working on it...

At 16:30:05 UTC, the display for 26se14as shows four channels in progress, and the current result creation rate as 1.1107/sec

I think it's too simplistic to equate 'channel in progress' with 'splitter is working', and to equate multiple splitters with slow working in every case.
ID: 1616668 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1616688 - Posted: 20 Dec 2014, 17:40:05 UTC

Despite the barely-crawling-along AP creation rate, my cache continues to slowly grow (I guess it also helps that I complete ~6 APs/day, which is molasses-slow compared to you guys that do them in 30 minutes on a GPU).
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1616688 · Report as offensive
Profile cliff
Avatar

Send message
Joined: 16 Dec 07
Posts: 625
Credit: 3,590,440
RAC: 0
United Kingdom
Message 1616710 - Posted: 20 Dec 2014, 19:03:16 UTC

Hi Folks,
Well the S@H server problems seem to be contagious:-( Just had a 2nd CPU drop dead on me..

Now on my 3rd FX9370.. So now I have to RMA another one..

So def NO CPU tasks for some time to come..

Its damn strange I can run my GPU's at 50C or more but the dratted CPU croaks at less than 40C core temps.. and for limited periods.

As for AP's, Since they became available I've had 4 total, all the rest are MB..

Regards,
Cliff,
Been there, Done that, Still no damm T shirt!
ID: 1616710 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 1616719 - Posted: 20 Dec 2014, 19:23:18 UTC - in response to Message 1616688.  

Despite the barely-crawling-along AP creation rate, my cache continues to slowly grow (I guess it also helps that I complete ~6 APs/day, which is molasses-slow compared to you guys that do them in 30 minutes on a GPU).

Even on my notebook where APs take about 2t25m, running 2 at a time, I keep running out of APs.
ID: 1616719 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1616736 - Posted: 20 Dec 2014, 20:09:08 UTC

OK, how's this for a theory about the 'many splitters work the same tape' issue?

I think the algorithm in practice is:

"When a splitter finishes a tape, start work on what was the next tape in sequence when the split started"

It should be "start work on what is the next tape now".

Example: I've been watching the MB column for my data distribution charts. MB has just completed 26no14af, which was the last tape in the last batch loaded. There are currently two splitters working on 03oc14aa, which is the first tape in the current batch: we would have preferred it to start on 24oc14ac, which is the next unstarted tape in the batch.

Corroboration, or contradiction, anyone?
ID: 1616736 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1616742 - Posted: 20 Dec 2014, 20:36:06 UTC - in response to Message 1616736.  

OK, how's this for a theory about the 'many splitters work the same tape' issue?

I think the algorithm in practice is:

"When a splitter finishes a tape, start work on what was the next tape in sequence when the split started"

It should be "start work on what is the next tape now".

Example: I've been watching the MB column for my data distribution charts. MB has just completed 26no14af, which was the last tape in the last batch loaded. There are currently two splitters working on 03oc14aa, which is the first tape in the current batch: we would have preferred it to start on 24oc14ac, which is the next unstarted tape in the batch.

Corroboration, or contradiction, anyone?

That would agree with what I've been able to determine. Seems the longest running file is the one that attracts the splitters. I have a WU from ap_26se14as dated the 19th. As far as I can tell that makes it the longest running file. Same as with the previous splitter attractor, it had been running for 3 days.

Still kinda strange the way the creation rate seems to slow down after a while without any noticeable changes, it's back down to .5/sec now.
ID: 1616742 · Report as offensive
WezH
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 576
Credit: 67,033,957
RAC: 95
Finland
Message 1616744 - Posted: 20 Dec 2014, 20:41:29 UTC - in response to Message 1616736.  

I don't have answers to Your questions, Richard, but I do have my own question:

Did this 3/4 splitters working on same tape happen before AP database crash?
"Please keep Your signature under four lines so Internet traffic doesn't go up too much"

- In 1992 when I had my first e-mail address -
ID: 1616744 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1616747 - Posted: 20 Dec 2014, 20:46:13 UTC - in response to Message 1616744.  

I don't have answers to Your questions, Richard, but I do have my own question:

Did this 3/4 splitters working on same tape happen before AP database crash?

I monitor MB rather than AP. But I think the answer is yes.
ID: 1616747 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1616748 - Posted: 20 Dec 2014, 20:53:47 UTC

And then, when the surplus splitter finished a channel (rather than the whole tape) on 03oc14aa, it started the first channel on 24oc14ac as required.
ID: 1616748 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1616762 - Posted: 20 Dec 2014, 21:45:35 UTC
Last modified: 20 Dec 2014, 22:07:00 UTC

Just a little reminder. I've been running the cache at 2 days because that is Less than the amount of CPU VLARs I have. I was hoping I would be sent enough GPU APs to run my machine. That isn't happening. So, I raised the cache up a day and was immediately sent CPU APs that I don't need. I have days of CPU work and about an hour of GPU work, so, the server sends me CPU work. Brilliant. Here is the numbers the server was working with;
Sat Dec 20 16:28:01 2014 | SETI@home | [sched_op] Starting scheduler request
Sat Dec 20 16:28:01 2014 | SETI@home | Sending scheduler request: To fetch work.
Sat Dec 20 16:28:01 2014 | SETI@home | Requesting new tasks for CPU and ATI
Sat Dec 20 16:28:01 2014 | SETI@home | [sched_op] CPU work request: 169169.51 seconds; 0.00 devices
Sat Dec 20 16:28:01 2014 | SETI@home | [sched_op] ATI work request: 1299211.44 seconds; 0.00 devices
Sat Dec 20 16:28:04 2014 | SETI@home | Scheduler request completed: got 4 new tasks
Sat Dec 20 16:28:04 2014 | SETI@home | [sched_op] Server version 705
Sat Dec 20 16:28:04 2014 | SETI@home | Project requested delay of 303 seconds
Sat Dec 20 16:28:04 2014 | SETI@home | [sched_op] estimated total CPU task duration: 158476 seconds
Sat Dec 20 16:28:04 2014 | SETI@home | [sched_op] estimated total ATI task duration: 0 seconds

It SHOULD have sent those to the GPUs, but instead sent them to the CPUs. When I looked at the Project tab I found that instead of asking for GPU tasks, which I'm just about out of, it was just sitting there without any time deferral. As soon as I raised the cache so CPU work was needed, BOOM, it requested work.

It sure would be nice if the scheduler was concerned about the GPU work as much as it's concerned about the CPU work.

More CPU work while I have 2 GPU tasks left;
Sat Dec 20 17:03:59 2014 | SETI@home | Sending scheduler request: To fetch work.
Sat Dec 20 17:03:59 2014 | SETI@home | Requesting new tasks for CPU and ATI
Sat Dec 20 17:03:59 2014 | SETI@home | [sched_op] CPU work request: 799965.62 seconds; 0.00 devices
Sat Dec 20 17:03:59 2014 | SETI@home | [sched_op] ATI work request: 2596991.90 seconds; 0.00 devices
Sat Dec 20 17:04:01 2014 | SETI@home | Scheduler request completed: got 4 new tasks
Sat Dec 20 17:04:01 2014 | SETI@home | [sched_op] Server version 705
Sat Dec 20 17:04:01 2014 | SETI@home | Project requested delay of 303 seconds
Sat Dec 20 17:04:01 2014 | SETI@home | [sched_op] estimated total CPU task duration: 158419 seconds
Sat Dec 20 17:04:01 2014 | SETI@home | [sched_op] estimated total ATI task duration: 0 seconds
ID: 1616762 · Report as offensive
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 24 · Next

Message boards : Number crunching : Panic Mode On (93) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.