Panic Mode On (20) Server problems

Message boards : Number crunching : Panic Mode On (20) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 · Next

AuthorMessage
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 918537 - Posted: 16 Jul 2009, 20:42:22 UTC


Since today morning the GPU cruncher is only idle.
(since maybe 12 hours)

And had a ~ 4 day WU cache.

I have now ~ 1,600 results ready for UL.
But they can't go home.

Maybe every few minutes one result.

Very well.. if this will continue like this.. in few days (one week?) the PC will request new work.

Ohh well..


This well UL traffic will continue? How long?

OTOH. The UL traffic will be better in future.. or not.. the GPU cruncher want to send ~ 800 results to Berkeley / day. [normal ARs]
If this will not be possible, I can switch off the GPU cruncher..

ID: 918537 · Report as offensive
Profile jay_e

Send message
Joined: 6 Apr 03
Posts: 62
Credit: 1,072,112
RAC: 0
United States
Message 918591 - Posted: 17 Jul 2009, 0:26:27 UTC - in response to Message 918168.  

So far, I've waited since Sunday.

Don't know why that should be the case.
From what i can recall on Sat, Sun & most of Monday there were no problems with uploads. Late Monday (for reasons unknown) the upload server went off line & it only came back online a couple of hours ago (if that).


Yes, I understand about traffic - just want to know what the average number of days one should wait for a job to upload.

Even when things are congested, most uploads will go through in a few hours. If it's really bad it might take 12-24hrs. Usually thay go through on the first attempt.



Hi Grant,

Thanks for the info!!
One WU made it through overnight.

Now I know that I should try to force the uploads....

I got the rest to go by using the BOINC Manager advanced view:
Advanced-> "Do Network Communication" Over and over and over.
For every three or four cyles, maybe one WU made it.

I run Seti@gome on another two laptops in two cities -
one: cable-modem - the other: DSL.
Both had the problem of WU not uploading.

Same solution worked: "Do Network Communication" over and over.


Jay
ID: 918591 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13751
Credit: 208,696,464
RAC: 304
Australia
Message 918635 - Posted: 17 Jul 2009, 5:34:28 UTC - in response to Message 914643.  


Things appear to be working ok at the moment, but the network traffic for the past 18 hours or so looks rather odd to say the least.
Outbound traffic- a couple of short, sharp drops, then nice & steady until it took a huge dive for a few hours. Came back up again, but still a few sharp drops here & there.
As for inbound traffic- started off ok & gradually increased, then it became a sine wave of gradually increaasing amplitude. Leveled off for a short while but then started to become a bit jagged.
Grant
Darwin NT
ID: 918635 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 918637 - Posted: 17 Jul 2009, 5:45:33 UTC
Last modified: 17 Jul 2009, 5:56:25 UTC

Grrrr...................

Uploads failing again. Bandwidth is maxxed out again but AP work is NOT being generated or sent out. So what gives????

[edit]In fact never did get all my uploads in. They seem to error out instantly and go back into the delayed time backoff mode.
Boinc....Boinc....Boinc....Boinc....
ID: 918637 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19093
Credit: 40,757,560
RAC: 67
United Kingdom
Message 918639 - Posted: 17 Jul 2009, 6:27:30 UTC - in response to Message 918637.  

Grrrr...................

Uploads failing again. Bandwidth is maxxed out again but AP work is NOT being generated or sent out. So what gives????

[edit]In fact never did get all my uploads in. They seem to error out instantly and go back into the delayed time backoff mode.

See Eric K's post 918637.
ID: 918639 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13751
Credit: 208,696,464
RAC: 304
Australia
Message 918640 - Posted: 17 Jul 2009, 6:30:51 UTC - in response to Message 918639.  


Inbound traffic just hit 32.48Mb/s. I think we have a new record.
Grant
Darwin NT
ID: 918640 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13751
Credit: 208,696,464
RAC: 304
Australia
Message 918669 - Posted: 17 Jul 2009, 11:05:49 UTC - in response to Message 918640.  


Boy that upload server is copping a hammering.
Usual rate is around 45-50,000 results per hour. It's been averaging 100,000 for 14 hours now (hitting a peak of 250,590).
Grant
Darwin NT
ID: 918669 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 918675 - Posted: 17 Jul 2009, 11:15:58 UTC - in response to Message 918669.  
Last modified: 17 Jul 2009, 11:18:03 UTC

Boy that upload server is copping a hammering.
Usual rate is around 45-50,000 results per hour. It's been averaging 100,000 for 14 hours now (hitting a peak of 250,590).

It always happens. I think they keep a special stash of "shorty only" tapes on the shelf, so they can slip a few on after an outage and really gum the works up.

Edit - did you see that blip earlier when the main database was doing over 1,000 queries a second? That was the validators playing catchup.
ID: 918675 · Report as offensive
Joseph Monk

Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 918683 - Posted: 17 Jul 2009, 11:27:41 UTC

I now have a nice backlog of work, enough to last me a few days at least, and uploads seem to be going in on a regular basis... just wish I could get some AP units so my CPUs get busy too.
ID: 918683 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 918713 - Posted: 17 Jul 2009, 13:13:53 UTC - in response to Message 918639.  
Last modified: 17 Jul 2009, 13:14:42 UTC

Grrrr...................

Uploads failing again. Bandwidth is maxxed out again but AP work is NOT being generated or sent out. So what gives????

[edit]In fact never did get all my uploads in. They seem to error out instantly and go back into the delayed time backoff mode.

See Eric K's post 918637.


. . . WK - THAT Link seems to go to 'POST' [must be because of ALL the 'Moving' around of Posts / Messages DOH!]
BOINC Wiki . . .

Science Status Page . . .
ID: 918713 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 918715 - Posted: 17 Jul 2009, 13:30:24 UTC - in response to Message 918713.  
Last modified: 17 Jul 2009, 13:34:41 UTC

Grrrr...................

Uploads failing again. Bandwidth is maxxed out again but AP work is NOT being generated or sent out. So what gives????

[edit]In fact never did get all my uploads in. They seem to error out instantly and go back into the delayed time backoff mode.

See Eric K's post 918637.


. . . WK - THAT Link seems to go to 'POST' [must be because of ALL the 'Moving' around of Posts / Messages DOH!]

Then it was perhaps this one (918297). (They only differ in two digits ;-)

Gruß,
Gundolf
[edit]Ohhh, and Eric K embedded the live diagram, very naughty :-) [/edit]
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours
ID: 918715 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 918716 - Posted: 17 Jul 2009, 13:52:38 UTC - in response to Message 918715.  

[edit]Ohhh, and Eric K embedded the live diagram, very naughty :-) [/edit]

I think we can trust Eric to know how much of the Cricket Admin's (and the campus's) bandwidth SETI can afford to use - you can't embed that graph by mistake, it takes a fair amount of effort. It's only regenerated once every four minutes, so it isn't exactly live streaming. Even if it was, it would have precisely zero effect on the Hurricane Electric link carrying our data.
ID: 918716 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 918718 - Posted: 17 Jul 2009, 14:12:25 UTC

I will attempt to explain what is happening here again. Only about 25% of my uploads are going through. The rest end up with................

Upload pending, retry in xx:xx:xx

I know that this is normal during a recovery phase. Now 3 days since the normal Tuesday outage are we still in a recovery phase?

I check my computers with Boinc View. It shows 50 or so in this backoff state. I highlight all of them and tell boinc view to retry the uploads. 1 in 4 is uploaded, the rest INSTANTLY go into a deeper backoff. Normally at this point in time all my work is uploaded and I am not experiencing any back off's.

So I continue to massage the retry option in boinc view. Within a few minutes they are all uploaded again. My point is that if uploads are so easily done now manually, why were these work units in a back off state to begin with? They should have uploaded themselves on the first try.

Even now the transfers tab is loading up with retry's again. Yet if I retry manually 25% will upload and again in a few minutes all transfers are done.

So my basic contention is that 75% of uploads are now failing that normally get through fine at this point in the recovery period.

Eric's change has dramatically improved the uploads but now I see a different problem.
Boinc....Boinc....Boinc....Boinc....
ID: 918718 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 918720 - Posted: 17 Jul 2009, 14:20:06 UTC - in response to Message 918718.  

We're still showing historically high data rates (86 Mbs out, 26 Mbs in), so some retries are to be expected: Eric's change means that they now happen instantly instead of after 21 seconds.

This isn't recovery, this is an old-fashioned shorty storm, aided and abetted by CUDA.

ID: 918720 · Report as offensive
Profile Vistro
Avatar

Send message
Joined: 6 Aug 08
Posts: 233
Credit: 316,549
RAC: 0
United States
Message 918758 - Posted: 17 Jul 2009, 16:55:41 UTC

My parents just (reluctantly) let me install SETI in the editing suite, but the server claims it has no jobs.

and it won't look at cuda. But I think that might be a reb00t thing.

ID: 918758 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 918766 - Posted: 17 Jul 2009, 17:57:29 UTC - in response to Message 918675.  

Boy that upload server is copping a hammering.
Usual rate is around 45-50,000 results per hour. It's been averaging 100,000 for 14 hours now (hitting a peak of 250,590).

It always happens. I think they keep a special stash of "shorty only" tapes on the shelf, so they can slip a few on after an outage and really gum the works up.

I think they have a new assistant helping select the next tapes: his name is Murphy.

ID: 918766 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 918767 - Posted: 17 Jul 2009, 18:02:45 UTC - in response to Message 918718.  

Even now the transfers tab is loading up with retry's again. Yet if I retry manually 25% will upload and again in a few minutes all transfers are done.

So my basic contention is that 75% of uploads are now failing that normally get through fine at this point in the recovery period.

Eric's change has dramatically improved the uploads but now I see a different problem.

The change improves uploads by getting rid of the ones that can't be serviced promptly instead of having the overhead of trying to hang on and hope the server can get to them in time.

This works because every time an upload succeeds, there is a little less traffic.

Eventually, enough will get through that the traffic will drop to the point that every upload will succeed, and then life will be good.
ID: 918767 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 918768 - Posted: 17 Jul 2009, 18:04:36 UTC - in response to Message 918766.  

Boy that upload server is copping a hammering.
Usual rate is around 45-50,000 results per hour. It's been averaging 100,000 for 14 hours now (hitting a peak of 250,590).

It always happens. I think they keep a special stash of "shorty only" tapes on the shelf, so they can slip a few on after an outage and really gum the works up.

I think they have a new assistant helping select the next tapes: his name is Murphy.

Either that, or a grad student with a typical student sense of humour.
ID: 918768 · Report as offensive
Profile Heflin

Send message
Joined: 22 Sep 99
Posts: 81
Credit: 640,242
RAC: 0
United States
Message 918770 - Posted: 17 Jul 2009, 18:06:19 UTC - in response to Message 918718.  

It shows 50 or so in this backoff state. I highlight all of them and tell boinc view to retry the uploads. 1 in 4 is uploaded, the rest INSTANTLY go into a deeper backoff.


So You are the reason that others upload request are failing.
The fail and backoff process is there for a reason.
Manually forcing retries REPEATEDLY just hammers the servers making it worse for everyone.
As long as you have more work to process, chill out and let the backoff process work

SETI@home since 1999
"Set it, and Forget it!"
ID: 918770 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 918772 - Posted: 17 Jul 2009, 18:08:02 UTC - in response to Message 918770.  
Last modified: 17 Jul 2009, 18:08:53 UTC

It shows 50 or so in this backoff state. I highlight all of them and tell boinc view to retry the uploads. 1 in 4 is uploaded, the rest INSTANTLY go into a deeper backoff.


So You are the reason that others upload request are failing.
The fail and backoff process is there for a reason.
Manually forcing retries REPEATEDLY just hammers the servers making it worse for everyone.
As long as you have more work to process, chill out and let the backoff process work

There are two "saving graces" for a case like this:

1) There aren't enough of us worrying over this to hit the retry button over and over.

2) Eventually, all of his work will upload, and he'll be out of the way again.

Edit: unfortunately, we have to accept human behavior for what it is. It isn't a violation of the rules, but it isn't exactly good form either.
ID: 918772 · Report as offensive
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 · Next

Message boards : Number crunching : Panic Mode On (20) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.