Panic Mode On (78) Server Problems?

Message boards : Number crunching : Panic Mode On (78) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 22 · Next

AuthorMessage
Cherokee150

Send message
Joined: 11 Nov 99
Posts: 192
Credit: 58,513,758
RAC: 74
United States
Message 1306212 - Posted: 14 Nov 2012, 21:05:53 UTC - in response to Message 1306107.  

Regarding my earlier post, Message 1306107, I will also add that I did notice the massive increase in tasks sent me in the days leading up to all our current problems. At the time, I thought it might be due to the shorties, but now I realize the schedulers were sending me all those tasks regardless of the estimated run time.

Perhaps I'm making an incorrect assumption here, but it would seem to me that, if the schedulers were sending thousands of hosts nearly three times the correct number of units, it would cause some of the problems we have been getting.

There have always been problems with getting uploads and downloads, and there have also been many times APs were sent out. The real disaster seems to be something newly introduced, along with the newly imposed limits of 100/CPU-100/GPU.

Please, let's get some discussion going on this observation of mine. Can anyone else find a similar history in your logs? It might not be readily apparent to the really big crunchers because you usually have problems keeping your caches full. What about others, though?

Thanks! :)
ID: 1306212 · Report as offensive
Profile Brother Frank

Send message
Joined: 10 Dec 11
Posts: 26
Credit: 15,142,410
RAC: 0
United States
Message 1306261 - Posted: 14 Nov 2012, 23:32:18 UTC - in response to Message 1306198.  

Cherokee150,

I know something went very seriously wrong after an update or maintenance period three or four weeks ago. Things went very well for a few hours and then I stopped getting work. Seti@Home has never been right since. My notebooks without dedicated graphics cards are getting work in and then back out fine. One has a RAC of about 700; another has a RAC of about 1250 and then a single core machine has a RAC of about 175. (That's an older Pentium machine about ten years old). My desktops with Nvidia GTX 550Ti Cards are the ones having a lot of trouble. My Core Duo with one Nvidia GTX 550 Ti is about to run out of work and is getting lots of time outs and some transient https failures. It has a RAC of about 7,400, but normally has been around 8,000 My i7 2600k machine with two Nvidia GTX 550's normally has a RAC of 22,000 to 24,000, but is down to about 18,000 now. Again, lots of time outs and no work available messages, transient https failures, and time outs as well as backoffs. I have had to play with the NNT button a lot with the desktops to keep them in work, but that tends to foul up the getting work routine. I also have a newer notebook with an i7-2670qm processor and an Nvidia 525m graphics processor. It tends to keep its queue full, but sometimes I've had to play with the NNT and update buttons. I think I get upset with it all and kinda forget for a few minutes that it runs pretty much ok if I leave it unattended, but just check in twice a day. It has a RAC of just about 5,000. I don't know for sure about the number of ghosts, but I believe both the desktops have a lot of them because I see a message every so often about recovering lost work or jobs. With the new notebook running at about 5,000 RAC I should have an overall RAC of about 37,000 or close, but I am running several thousand behind that even though we run 24/7. This is another indication to me that something might be pretty wrong in getting work credit or sending work back. I don't know what is wrong. I am not a technical or database expert, but like I said near the start of this message something very wrong happened with a maintenance cycle around 4 or 5 weeks ago and the problem has remained in both sending work out and getting new work. Brother Frank.

Black, white, and tabby kitties here are worried about me. Henri, my big tabby, lies on my chest comforting me from PSA (Post Seti Anxiety).
ID: 1306261 · Report as offensive
chromespringer
Avatar

Send message
Joined: 3 Dec 05
Posts: 296
Credit: 55,183,482
RAC: 0
United States
Message 1306293 - Posted: 15 Nov 2012, 1:51:12 UTC

at the time of this posting, with 0 tasks to process and 0 tasks to report in BOINC Manager, i have 134 ghost tasks that have a deadline of 11/17 and 178 ghost tasks that have a deadline of 11/18. these tasks are allocated to mach xxxx033 at a rate of 20 per scheduler request.
i have only successfully contacted project 3 times today due to scheduler time outs, for a total of 60 cpu/gpu tasks downloaded and completed .. at this rate i will have approx. 14 error-ed time outs on 11/17 and 118 error-ed time outs on 11/18 (if my math is correct) :)
the last successful download i received was at 2:41 pm mst today. all uploads are painfully completed utilizing NNT.
ID: 1306293 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1306308 - Posted: 15 Nov 2012, 3:49:29 UTC

Ops .. - since ~ 4 hours no U/L & D/L activity shown on the Cricket graph ..


My BOINC:
U/L - OK
D/L - a few tries needed for complete
Report & Request - successful, a few not, successful, a few not ..


* Best regards! :-) * Sutaru Tsureku, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. *
ID: 1306308 · Report as offensive
Lionel

Send message
Joined: 25 Mar 00
Posts: 680
Credit: 563,640,304
RAC: 597
Australia
Message 1306319 - Posted: 15 Nov 2012, 4:36:13 UTC

.
.
Scheduler request failed: Timeout was reached
.
.
.
.
Scheduler request failed: HTTP Internal Server Error
.
.
.
.
Scheduler request failed: Failure when receiving data from peer
.
.

ID: 1306319 · Report as offensive
Lee Gresham
Avatar

Send message
Joined: 12 Aug 03
Posts: 159
Credit: 130,116,228
RAC: 0
United States
Message 1306334 - Posted: 15 Nov 2012, 5:31:42 UTC

Can someone point me to a forum entry detailing how to limit number of files being reported. I've looked thru the boinc & seti XMLs and didn't see a likely entry. I know I've done this before but as they say "the memory is the second thing to go" I forget what's first!

Thanks..............
Delta-V
ID: 1306334 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1306336 - Posted: 15 Nov 2012, 5:35:14 UTC - in response to Message 1306334.  

Can someone point me to a forum entry detailing how to limit number of files being reported. I've looked thru the boinc & seti XMLs and didn't see a likely entry. I know I've done this before but as they say "the memory is the second thing to go" I forget what's first!

Thanks..............


<cc_config>
<options>
<max_tasks_reported>N</max_tasks_reported>
</options>
</cc_config>

N=whatever number you set.


You have to add the line.
ID: 1306336 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13893
Credit: 208,696,464
RAC: 304
Australia
Message 1306338 - Posted: 15 Nov 2012, 5:42:28 UTC - in response to Message 1306334.  
Last modified: 15 Nov 2012, 5:43:54 UTC

I got back from work. One system completely out of work, the other running out of CPU work (GPU ran out ages ago).
Every single Scheduler request while i was at work timed out.

I set No New Tasks & managed to report them all, although one machine took almost 5 min before it got a response. I then allowed new work & one machine then requested work & got some, the other requested work and got a Scheduler Timeout.

The graphs show AP work still going out. I'm quite sure if the AP work stopped going out, i'd then be able to contact the Scheduler again.


As someone pointed out some time ago- Synergy has a lot of processes, and i think it's got more than it can hamdle.
Grant
Darwin NT
ID: 1306338 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22707
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1306347 - Posted: 15 Nov 2012, 6:31:36 UTC

Grant, in another thread you suggested using a proxy - have you considered that this problem may be a routing problem between your empty cruncher and the lab and that the cruncher concerned is attempting to use a different one to those whose connection is OK - I have to crunchers sharing the same internet connection, but when I do a tractroute they show they actually, consistently, use different paths through the internet to almost every server I look at....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1306347 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13893
Credit: 208,696,464
RAC: 304
Australia
Message 1306352 - Posted: 15 Nov 2012, 6:43:39 UTC - in response to Message 1306347.  

Grant, in another thread you suggested using a proxy - have you considered that this problem may be a routing problem between your empty cruncher and the lab and that the cruncher concerned is attempting to use a different one to those whose connection is OK

I used the proxy for both systems.
Proxy- both get a response, no proxy- neither get a response when trying to report & request work. No proxy- both sometimes get work if not reporting tasks.

So we'll go with the proxy, till it gets canned & i have to find another one.
Grant
Darwin NT
ID: 1306352 · Report as offensive
Lee Gresham
Avatar

Send message
Joined: 12 Aug 03
Posts: 159
Credit: 130,116,228
RAC: 0
United States
Message 1306357 - Posted: 15 Nov 2012, 6:58:58 UTC - in response to Message 1306336.  

<cc_config>
<options>
<max_tasks_reported>N</max_tasks_reported>
</options>
</cc_config>

N=whatever number you set.


You have to add the line.
[/quote]



Thanks Again
Delta-V
ID: 1306357 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1306360 - Posted: 15 Nov 2012, 7:05:45 UTC

My requests in the past 24 hours have been about 50/50 time-out and success. Most of the time-outs end up with a "resent lost task" on the next successful contact. This has been how it has been working all day and evening, until I managed to be issued an AP with a six minute deadline. Well there goes my consecutive valid streak of ~1100. "didn't resend lost task..(expired)" *sigh*
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1306360 · Report as offensive
Profile Khangollo
Avatar

Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1306370 - Posted: 15 Nov 2012, 7:57:07 UTC
Last modified: 15 Nov 2012, 7:59:32 UTC

All my tasks got auto-abandoned on the server side over night, once again!!!
http://setiathome.berkeley.edu/results.php?hostid=5323998&offset=0&show_names=0&state=6&appid=
I've had it with this project now.
ID: 1306370 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13893
Credit: 208,696,464
RAC: 304
Australia
Message 1306371 - Posted: 15 Nov 2012, 8:00:09 UTC - in response to Message 1306370.  

All my tasks got auto-aborted over night, once again!!!
http://setiathome.berkeley.edu/results.php?hostid=5323998&offset=0&show_names=0&state=6&appid=
I've had it with this project now.

Even with all the wierdness going on, i've yet to have that happen- not even once.
You haven't got any software that mucks about with your system clock at all?
Grant
Darwin NT
ID: 1306371 · Report as offensive
Profile Khangollo
Avatar

Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1306372 - Posted: 15 Nov 2012, 8:02:43 UTC - in response to Message 1306371.  

All my tasks got auto-aborted over night, once again!!!
http://setiathome.berkeley.edu/results.php?hostid=5323998&offset=0&show_names=0&state=6&appid=
I've had it with this project now.

Even with all the wierdness going on, i've yet to have that happen- not even once.
You haven't got any software that mucks about with your system clock at all?

No, I don't. Like I said, it happened on the server side! My host didn't even succeed in making a scheduler request. This is the second time it happened. I'm not the only one this happened to; there were other users reporting the same thing.
ID: 1306372 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13893
Credit: 208,696,464
RAC: 304
Australia
Message 1306374 - Posted: 15 Nov 2012, 8:35:29 UTC - in response to Message 1306372.  
Last modified: 15 Nov 2012, 8:36:45 UTC

No, I don't. Like I said, it happened on the server side! My host didn't even succeed in making a scheduler request. This is the second time it happened. I'm not the only one this happened to; there were other users reporting the same thing.

What version of BOINC?
Which OS & version?

EDIT- ie, were they the same as yours or different?
Grant
Darwin NT
ID: 1306374 · Report as offensive
Profile Khangollo
Avatar

Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1306420 - Posted: 15 Nov 2012, 12:50:40 UTC - in response to Message 1306374.  

No, I don't. Like I said, it happened on the server side! My host didn't even succeed in making a scheduler request. This is the second time it happened. I'm not the only one this happened to; there were other users reporting the same thing.

What version of BOINC?
Which OS & version?

EDIT- ie, were they the same as yours or different?

Linux/x64, BOINC 7.0.39
Others have had it happen on Windows (don't know about BOINC version, most definitely not 7.0.39).
I don't think it's client's fault but server's - I assume it receives a malformed request (due to networking problems) and thinks project was reset.
Client wasn't even notified about that and continued to crunch already abandoned tasks so I had to manually abort them (ok, I accidentaly aborted a few more tasks that weren't "abandoned"). I had to use proxy for that otherwise that computer on that ISP rarely manages to contact scheduler without timing out.
ID: 1306420 · Report as offensive
fscheel

Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1306421 - Posted: 15 Nov 2012, 12:58:25 UTC

The last few days I have been getting a few "error while downloading"
Any ideas as to what is causing this?

Frank
ID: 1306421 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19525
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1306438 - Posted: 15 Nov 2012, 14:23:14 UTC

Reading this, New technology can improve public WiFi connections by 700 per cent, it looks like we could do with these people from South Carolina U to try and speed up the Seti pipe.
ID: 1306438 · Report as offensive
Profile Brother Frank

Send message
Joined: 10 Dec 11
Posts: 26
Credit: 15,142,410
RAC: 0
United States
Message 1306440 - Posted: 15 Nov 2012, 14:28:10 UTC - in response to Message 1306421.  

Frank, I've been having a lot of this too and all my desktops with Nvidia GTX 550 ti graphics processors have been slowly running out of work. I have used the No new tasks option and then update followed by the allow new tasks and update a few minutes later after tasks upload routine again and again over the last week or so with gradually decreasing success. As of about 6 a.m. this morning I was out of all work on both desktops. I've switched over to my old standby's gpugrid, world community grid, and a few other cosmology projects. Some of us believe the problem is with the scheduler not being able to keep track of tasks completed and sending out way to many tasks. Many of us have many dozens and many hundreds of ghost processes in the system. There also seems to be an association between Astro Pulse work being split and sent out which may be fouling the rest of the scheduling work. The internet bandwidth of the 100 meg line from the lab to BOINC seems to be far beyond capacity. Some of us have noticed this big problem of getting work and reporting work after a maintenance downtime about 3 or 4 weeks ago. I noticed that the system seemed to come out of that maintenance fine and all my computers were happily send working out and getting new work without time outs or failures. My recollection is that it (Seti at Home) stopped running well after just a few hours.

My notebooks, even one with an Nvidia 525m graphics processor and an i7 2670 qm core processor is still getting some work and reporting out, but having dozens and dozens of reporting failures and time outs every day. My i3 notebook with intel integrated graphics is doing fine. My little core duo notebook whose Radeon 2600 series processor doesn't qualify to run jobs is working fine too. As I wrote earlier, I am gradually switching over to alternative projects with my desktops beginning today.

I have never seen it this bad in my year here and am rethinking my priorities. Right now I am thinking a new project mix will be 2 parts medical discovery and disease fighting projects along with some Seti at Home work with my notebooks. I have fought this kind of chronic frustration at work before and it is too stressful for many people to handle well. The low limits on work per cpu and graphics processor are already hitting my desktops even though they were both down to just a few dozen short work units each. The limits will not help with the problem at all according to what I have read here. From my point of view, the download/upload issue became much more severe after that maintenance downtime around 4 weeks ago. I remember it all happened not too long after my wife and I got back from a memorial service for a close family member around mid October. We had just returned from visiting our families for an extended period and were building RAC up again slowly. Momentum stopped. Sorry, I don't have enough data to track it back to an exact date. I hope the Seti folks address this very, very soon. They are way understaffed, but I know they have the project's interests at heart. I know too that there are times when a project just has to step back and solve serious issues that may negatively affect project morale if left without at least a partial or interim solution. Brother Frank on Seti at Home.
ID: 1306440 · Report as offensive
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (78) Server Problems?


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.