Panic Mode On (80) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (80) Server Problems?

Previous · 1 · 2 · 3 · 4 · 5 . . . 25 · Next
Author Message
clive G1FYE
Volunteer moderator
Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 23,054,144
RAC: 2
United Kingdom
Message 1321923 - Posted: 30 Dec 2012, 3:35:11 UTC

I know this latest mess seemed to start about the same time as the AP splitters got `up to speed`
So, instead of switching some of them off to see if that fixes anything,
its been done so why flog a dead horse again,
lets try something crazy and turn off the multibeam splitters instead,
I know it sounds insane, but sometimes you just got to take a walk on the wild side to get a look at the problem from a diferent angle,
fault finding in a complex system is a git to do,
and mostly all you can hope to do is make the problem react to something you did even if it can not be seen directly where the problem is at least it did something different than last ten times you tried to poke it with a sharp stick and missed.
The project is kind of stuffed anyway so what have we got to loose.

Ok, so i may be mad or so far from the truth or real problem that i may be nearer to finding ET cos i am so far out i end up being closer to them,
whatever,
just a bit of frustrated head scratching kind of idea,

Team kizb
Send message
Joined: 8 Mar 01
Posts: 219
Credit: 3,709,162
RAC: 0
Germany
Message 1321988 - Posted: 30 Dec 2012, 8:16:34 UTC

Things seem to be working better now, I woke up this morning and all my uploads had finally completed and I had 102 to crunch.
____________
My Computers:
Blue Offline
Green Offline
Red Offline

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 6972
Credit: 57,215,787
RAC: 22,297
Germany
Message 1322031 - Posted: 30 Dec 2012, 9:57:20 UTC
Last modified: 30 Dec 2012, 10:40:45 UTC

Maybe the maxed out internet connection of SAH is because of the stock AP GPU app for NV and ATI?

Now all GPUs out there can crunch AP WUs.

The app work on all systems correct?

Maybe the AP WUs fail (or the results are not equal with the wingman's results) and need to send to an other PC.

If this happen not only one time .. - you can imagine how the internet connection is maxed out, because of send again and again the same AP WUs to different PCs?

Just an idea.

(8 MB/AP WU)


[EDIT: 27 % of the AP WUs in my BOINC are > x_1 (x_2 & x_3).]


* Best regards! :-) * Sutaru Tsureku, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. *
____________
BR



>Das Deutsche Cafe. The German Cafe.<

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 6972
Credit: 57,215,787
RAC: 22,297
Germany
Message 1322118 - Posted: 30 Dec 2012, 11:13:10 UTC - in response to Message 1322031.
Last modified: 30 Dec 2012, 11:17:16 UTC

Sutaru Tsureku wrote:
(...)
[EDIT: 27 % of the AP WUs in my BOINC are > x_1 (x_2 & x_3).]


OK, I looked to all x_2 and x_3 AP WUs and found following wingmen which PCs make very much or only errors with the stock AP GPU app:

http://setiathome.berkeley.edu/show_host_detail.php?hostid=5304693
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5369208
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5465293
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5810180
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6028483
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6201705
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6247733
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6616302
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6704517
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6757607
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6795698

Why all this PCs make errors with the stock AP GPU app?

If there are much more PCs out there like the above examples - no wonder that the SAH internet connection is continuously maxed out ..


Two with a wrong CPU app?
http://setiathome.berkeley.edu/show_host_detail.php?hostid=2750818
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6708061


* Best regards! :-) * Sutaru Tsureku, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. *
____________
BR



>Das Deutsche Cafe. The German Cafe.<

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1080
Credit: 31,089,831
RAC: 84,548
United States
Message 1322148 - Posted: 30 Dec 2012, 12:51:09 UTC - in response to Message 1321988.

Things seem to be working better now, I woke up this morning and all my uploads had finally completed and I had 102 to crunch.

Just as I was about to go to bed, I noticed this new error;
12/29/2012 10:59:52 PM | SETI@home | Computation for task 08oc12ab.18183.8656.7.10.195_1 finished
12/29/2012 10:59:52 PM | SETI@home | Starting task 08oc12ab.18183.8656.7.10.76_0 using setiathome_enhanced version 609 (cuda23) in slot 3
12/29/2012 10:59:54 PM | SETI@home | Started upload of 08oc12ab.18183.8656.7.10.195_1_0
12/29/2012 11:00:10 PM | | Project communication failed: attempting access to reference site
12/29/2012 11:00:10 PM | SETI@home | Temporarily failed upload of 08oc12ab.18183.8656.7.10.195_1_0: can't resolve hostname
12/29/2012 11:00:10 PM | SETI@home | Backing off 3 min 22 sec on upload of 08oc12ab.18183.8656.7.10.195_1_0
12/29/2012 11:00:11 PM | | Internet access OK - project servers may be temporarily down.
12/29/2012 11:03:04 PM | SETI@home | Started upload of 08oc12ab.18183.8656.7.10.195_1_0
12/29/2012 11:03:20 PM | | Project communication failed: attempting access to reference site
12/29/2012 11:03:20 PM | SETI@home | Temporarily failed upload of 08oc12ab.18183.8656.7.10.195_1_0: can't resolve hostname
12/29/2012 11:03:20 PM | SETI@home | Backing off 4 min 19 sec on upload of 08oc12ab.18183.8656.7.10.195_1_0
12/29/2012 11:03:21 PM | | Internet access OK - project servers may be temporarily down.
12/29/2012 11:03:29 PM | SETI@home | Started upload of 08oc12ab.18183.8656.7.10.195_1_0
12/29/2012 11:03:45 PM | | Project communication failed: attempting access to reference site
12/29/2012 11:03:45 PM | SETI@home | Temporarily failed upload of 08oc12ab.18183.8656.7.10.195_1_0: can't resolve hostname
12/29/2012 11:03:45 PM | SETI@home | Backing off 13 min 17 sec on upload of 08oc12ab.18183.8656.7.10.195_1_0
12/29/2012 11:03:46 PM | | Internet access OK - project servers may be temporarily down.
12/29/2012 11:06:44 PM | SETI@home | Started upload of 08oc12ab.18183.8656.7.10.195_1_0
12/29/2012 11:07:15 PM | | Project communication failed: attempting access to reference site
12/29/2012 11:07:15 PM | SETI@home | Temporarily failed upload of 08oc12ab.18183.8656.7.10.195_1_0: can't resolve hostname
12/29/2012 11:07:15 PM | SETI@home | Backing off 16 min 32 sec on upload of 08oc12ab.18183.8656.7.10.195_1_0
12/29/2012 11:07:16 PM | | Internet access OK - project servers may be temporarily down.
12/29/2012 11:09:33 PM | SETI@home | Started upload of 08oc12ab.18183.8656.7.10.195_1_0
12/29/2012 11:10:21 PM | SETI@home | Finished upload of 08oc12ab.18183.8656.7.10.195_1_0
12/29/2012 11:10:21 PM | SETI@home | Sending scheduler request: To fetch work.
12/29/2012 11:10:21 PM | SETI@home | Reporting 1 completed tasks, requesting new tasks for CPU and NVIDIA and ATI
12/29/2012 11:10:23 PM | SETI@home | Computation for task ap_26no12ad_B1_P0_00062_20121227_08468.wu_0 finished
12/29/2012 11:10:23 PM | SETI@home | Starting task ap_25no12ad_B6_P1_00044_20121227_21707.wu_0 using astropulse_v6 version 604 (ati_opencl_100) in slot 2
12/29/2012 11:10:25 PM | SETI@home | Started upload of ap_26no12ad_B1_P0_00062_20121227_08468.wu_0_0
12/29/2012 11:11:02 PM | SETI@home | Finished upload of ap_26no12ad_B1_P0_00062_20121227_08468.wu_0_0
12/29/2012 11:11:15 PM | SETI@home | Scheduler request completed: got 1 new tasks
.....

Since then, all the Uploads have completed in less than around 30 seconds. It's almost as if someone gave Bruno the reboot. So far all is well.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5566
Credit: 51,637,488
RAC: 44,463
Australia
Message 1322403 - Posted: 30 Dec 2012, 21:29:40 UTC - in response to Message 1322148.
Last modified: 30 Dec 2012, 21:29:58 UTC

Still getting lots of Scheduler timeouts & the occasional no header or data, but not nearly as many as before.
The upload problem appears to be no more- uploads start within a couple of seconds & are at 10-15kB/s.
____________
Grant
Darwin NT.

EdwardPF
Volunteer tester
Send message
Joined: 26 Jul 99
Posts: 228
Credit: 42,445,420
RAC: 50,737
United States
Message 1322504 - Posted: 31 Dec 2012, 3:26:00 UTC
Last modified: 31 Dec 2012, 3:27:44 UTC

my 2 cents ...

It looked like a "shortie" storm to me ... all my local computers were running SHORT WU's

One of my computers had 100 WU's all running in 4 Min estimate time down from the usual 24 Min.

That would be a .... 6x i/o load increase??

Maybe a bad tape or 2 or 3 ...

Ed F

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5566
Credit: 51,637,488
RAC: 44,463
Australia
Message 1322506 - Posted: 31 Dec 2012, 3:43:00 UTC - in response to Message 1322504.

It looked like a "shortie" storm to me ... all my local computers were running SHORT WU's

That just exacerbated the problems that already existed before the shorties started coming through.

____________
Grant
Darwin NT.

Rolf
Send message
Joined: 16 Jun 09
Posts: 114
Credit: 7,800,622
RAC: 1
Switzerland
Message 1322549 - Posted: 31 Dec 2012, 9:38:43 UTC

Very good news!
Everything works as it "should":
- Uploads without timeouts
- Downloads as requested
31.12.2012 10:32:19 | SETI@home | Sending scheduler request: To fetch work.
31.12.2012 10:32:19 | SETI@home | Requesting new tasks for ATI
31.12.2012 10:34:10 | SETI@home | Scheduler request completed: got 20 new tasks
31.12.2012 10:34:12 | SETI@home | Started download of 09oc12ad.8746.67.12.10.98

Great last day of this year. Let the next year start like this!

clive G1FYE
Volunteer moderator
Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 23,054,144
RAC: 2
United Kingdom
Message 1322644 - Posted: 31 Dec 2012, 13:26:47 UTC - in response to Message 1322549.

Great last day of this year. Let the next year start like this!

You are dreaming, next year starts like this :( NEWS
at least the servers will not be sending us any weird and uninteligable messages ......

Profile Alex Storey
Volunteer tester
Avatar
Send message
Joined: 14 Jun 04
Posts: 533
Credit: 1,578,457
RAC: 505
Greece
Message 1322658 - Posted: 31 Dec 2012, 14:16:47 UTC - in response to Message 1322644.

...next year starts like this :( NEWS


If next year started any differently, I'd think I had entered the Twilight Zone!

Situation Normal;)

Happy New Year everybody!

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1080
Credit: 31,089,831
RAC: 84,548
United States
Message 1322671 - Posted: 31 Dec 2012, 15:03:14 UTC
Last modified: 31 Dec 2012, 15:14:42 UTC

Even though the Upload Stalls appear to have been corrected, transfer stalls are still a pain. This morning I woke up to a page of 'Ready to Reports'. I had six stalled downloads with a 32 minute wait time and 22 files waiting to be reported and replaced. The machine had gone through over 20% of it's GPU cache in a few hours waiting on stalled downloads. I don't think there should be a transfer Timeout of over around 10 minutes. After 10 minutes the stalled activity starts becoming a problem. Maybe a rework of the transfer timeouts is in order? I kinda like timeouts of 2, 4, 6, and 10, with 10 minutes being the maximum timeout period.

Fortunately, everything corrected itself rather quickly after the Retry button was used a few times...

Profile CLYDE
Volunteer tester
Avatar
Send message
Joined: 9 Aug 99
Posts: 801
Credit: 17,675,720
RAC: 34,252
United States
Message 1322693 - Posted: 31 Dec 2012, 16:06:00 UTC
Last modified: 31 Dec 2012, 16:07:15 UTC

Seti@Home = SNAFU

Just a little fun.

HAPPY NEW YEAR EVERYONE!!!
____________

BarryAZ
Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 11,478,609
RAC: 5,135
United States
Message 1322721 - Posted: 31 Dec 2012, 17:16:02 UTC - in response to Message 1322693.

One of the troublesome things for me with this project is that when the scheduler gets sick (which it does periodically) one of its 'sick modes' can be obstructive of other projects in terms of reporting, updating, uploading, etc. That is, the scheduler sometimes in its failure mode holds the workstation in 'reporting mode' exclusively (no other project can communicate with the workstation) for as much as 10 minutes. Ideally when the scheduler is in 'I'm confused' mode, it would simply issue a quick time out (say at 1 minute or 2 minutes) instead of putting things on hold for 10 minutes.

There have been times when (if I have no SETI work on a workstation to work on or report), that after several minutes, I simply detach so other projects can report without SETI in obstructive mode. When I do that, eventually I'll rejoin that workstation to SETI. However, 'eventually' is defined as a solid week for SETI -- and that often doesn't happen. Looks like it might not happen at all this coming month with the nth effort to correct lab electrical issues, plus the nth effort to correct the air conditioning in the server closet.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8278
Credit: 45,027,516
RAC: 13,616
United Kingdom
Message 1322742 - Posted: 31 Dec 2012, 17:51:59 UTC - in response to Message 1322721.

One of the troublesome things for me with this project is that when the scheduler gets sick (which it does periodically) one of its 'sick modes' can be obstructive of other projects in terms of reporting, updating, uploading, etc. That is, the scheduler sometimes in its failure mode holds the workstation in 'reporting mode' exclusively (no other project can communicate with the workstation) for as much as 10 minutes. Ideally when the scheduler is in 'I'm confused' mode, it would simply issue a quick time out (say at 1 minute or 2 minutes) instead of putting things on hold for 10 minutes.

Actually, the server doesn't hold on to anything - it simply doesn't send a reply at all. The timeout is how long your client is prepared to wait - and (in recent versions), it's configurable.

If you're running v6.12.27 or later, check out <http_transfer_timeout> in client configuration - options.

Note that this will affect uploads/downloads as well, and that sometimes both scheduler contacts and data transfers do eventually work after a long pause. Use at your own discretion.

EdwardPF
Volunteer tester
Send message
Joined: 26 Jul 99
Posts: 228
Credit: 42,445,420
RAC: 50,737
United States
Message 1322974 - Posted: 1 Jan 2013, 3:13:54 UTC
Last modified: 1 Jan 2013, 3:18:44 UTC

are we up??

Ed F

[edit] I guess the post got here ok but the graph looks like we are down

Ed F

12/31/2012 10:17:05 PM | SETI@home | Reporting 2 completed tasks, requesting new tasks for NVIDIA GPU
12/31/2012 10:17:11 PM | | Project communication failed: attempting access to reference site
12/31/2012 10:17:11 PM | SETI@home | Scheduler request failed: Failure when receiving data from the peer
12/31/2012 10:17:12 PM | | Internet access OK - project servers may be temporarily down.

clive G1FYE
Volunteer moderator
Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 23,054,144
RAC: 2
United Kingdom
Message 1323003 - Posted: 1 Jan 2013, 4:01:07 UTC

Its broken,
Yup the cricket graph has run out of green ink.
the cricket is dead.
Is that the servers way of saying `happy new year` #'"^&:(*&!.......

WinterKnight
Volunteer tester
Send message
Joined: 18 May 99
Posts: 8220
Credit: 21,849,058
RAC: 11,039
United Kingdom
Message 1323023 - Posted: 1 Jan 2013, 4:31:39 UTC - in response to Message 1323003.

Its broken,
Yup the cricket graph has run out of green ink.
the cricket is dead.
Is that the servers way of saying `happy new year` #'"^&:(*&!.......

But the servers stopped speaking at 02:50, couldn't last the course at the New Years Party.

clive G1FYE
Volunteer moderator
Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 23,054,144
RAC: 2
United Kingdom
Message 1323026 - Posted: 1 Jan 2013, 4:39:23 UTC

The SSP still looks good so eye dunO whats up with it all.
Its the milenium bug just a bit late
or the unix bug or excel
or

Josef W. Segur
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4143
Credit: 1,005,763
RAC: 271
United States
Message 1323033 - Posted: 1 Jan 2013, 4:56:37 UTC - in response to Message 1323026.

The SSP still looks good so eye dunO whats up with it all.
Its the milenium bug just a bit late
or the unix bug or excel
or

The SSP hasn't updated since 2:50:23 UTC, subtracting the 8 hour Berkeley offset gives 18:50:23 PST. That matches the time when Cricket fell quite closely.

Maybe it's the end of the world a little late?
Joe

Previous · 1 · 2 · 3 · 4 · 5 . . . 25 · Next

Message boards : Number crunching : Panic Mode On (80) Server Problems?

Copyright © 2014 University of California