Message boards :
Number crunching :
Panic Mode On (77) Server Problems?
Message board moderation
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next
Author | Message |
---|---|
Slavac Send message Joined: 27 Apr 11 Posts: 1932 Credit: 17,952,639 RAC: 0 |
Now if only I had a large stack of money for more bandwidth. One day maybe. Correct for the most part. The only tie in to the large line is over 2 miles away and would run under a large section of the University. Installing such a line would likely be very expensive. The current gigabit line feeds the entire SSL lab. SETI is currently utilizing 10% of the line and as I understand it, gaining a larger percentage of the connection is largely political. Executive Director GPU Users Group Inc. - brad@gpuug.org |
Slavac Send message Joined: 27 Apr 11 Posts: 1932 Credit: 17,952,639 RAC: 0 |
The plan right now, pending specs, is building a dedicated upload and download server soon. This one will be specifically slated for nothing but replacing our two remaining old servers. Combine that with a load balancer, the new switch, George and the jbod array, we should be heading in the right direction. I don't know, but I'll ask one of the guys. Eric did confirm that if we get the load balancer working like we want we could likely stop the round robin dead connection issues. I wish I knew more about how the Scheduler operates to tell you how I could fix it with hardware or software. I'll let you guys know what I find out when I hear something back. Executive Director GPU Users Group Inc. - brad@gpuug.org |
Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20 |
Donald, whilst I do not doubt you. The tone of Matt's post yesterday did not suggest to me he was aware of any problems at all. Which again I agree is surprising as normally someone at the lab knows as soon, and sometimes before, we do. Bernie, Matt's comment about switching download servers and changing the server programs from Apache to nginX suggests to me that he IS aware of the problems. But I also believe that much of the difficulty is the increased traffic due to the "shortie storm", which continues unabated; and high-performance crunchers, with caches set for more than 2-3 days worth of work, trying to fill those caches from the 100MBs pipe and the 100 task/5 seconds Feeder process. The system is just swamped, and will be until the "shortie storm" abates. Donald Infernal Optimist / Submariner, retired |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
I wish I knew more about how the Scheduler operates to tell you how I could fix it with hardware or software. My understanding is there are 100 "slots" in the feeder- it can hold 100 WUs at a time. When we get the "Project has no tasks available", and the server status shows there are 100,000s ready to go it's usually because the feeder was empty at the time of the request. In the past, no matter the level of demand for work- even after an extended outage- you didn't get that response very often at all. But over the last few months, and the last 3 weeks in particular, it has become more & more frequent. Looking back though my client logs, i've been getting more "Project has no tasks available" than i have actual allocations for work. I'm not sure what limits the feeder to 100 slots, but i don't think that needs increasing (at this stage). As i said, in the past it was a very infrequent response to a work request but it would appear that the Scheduler/feeder system has reached some sort of limit & it can't actually feed the feeder anywhere nearly as quickly as it used to. And add to that the "No tasks sent message" becoming more frequent (once again i expect due to the system not being able to feed the feeder) & now all of the Scheduler timeouts. Maybe more RAM or disks to improve I/O on the Scheudler & feeder systems? And just to add to the woes- since the outage the MB splitters have been limited to 40/s, a lot of the time it's been less than 30/s (in the past they have been able to put out 60+/s). The present result creation rate has dropped to 16/s. End result, the Ready to Send buffer only barely touched 200,000 almost 8 hours after the outage (usually it gets back to 300,000 in a couple of hours) & now it is actually falling like a stone. In a few more hours, at the present rate, there won't be any work left to download. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Donald, whilst I do not doubt you. The tone of Matt's post yesterday did not suggest to me he was aware of any problems at all. Which again I agree is surprising as normally someone at the lab knows as soon, and sometimes before, we do. That's the download problems, which we've had for a year or 2 now. The new problems relate to not being able to upload, and problems getting work from the Scheduler or the Scheduler just timing out. Grant Darwin NT |
Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20 |
Donald, whilst I do not doubt you. The tone of Matt's post yesterday did not suggest to me he was aware of any problems at all. Which again I agree is surprising as normally someone at the lab knows as soon, and sometimes before, we do. Just looked at the Server Status page, Master Database shows 1100+ queries/second - that is a lot of traffic, most of it (I presume) Scheduler-ralated. And that is just what's getting through the pipe. The pipe is swamped, and assuming Matt's changes to the downmload servers solve or at least improve that issue, it will still take awhile to relieve the congestion. And as long as the high-performance crunchers are getting Tasks that take less time to crunch than to upload & report... Donald Infernal Optimist / Submariner, retired |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Results Ready to Send now less than 300, Result creation rate 30/s (needs to be at least 40 to build up any sort of buffer with the present load). Grant Darwin NT |
Slavac Send message Joined: 27 Apr 11 Posts: 1932 Credit: 17,952,639 RAC: 0 |
I wish I knew more about how the Scheduler operates to tell you how I could fix it with hardware or software. Thanks very much Grant. I'll pass this along as well to see if we can hunt down what the underlying issue is. Executive Director GPU Users Group Inc. - brad@gpuug.org |
MusicGod Send message Joined: 7 Dec 02 Posts: 97 Credit: 24,782,870 RAC: 0 |
only my Imac and Asus laptop are getting cpu, my desk units are only getting gpu |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems? @SETIEric@qoto.org (Mastodon) |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems? Things have been going well here for the last 10hrs, but who can say how long that will last for, though this is the best that it's been in the last few weeks. Cheers. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems? I had a very bussy "retry" day for uploads up to around 10 hours earlier... I dont think there is some geographic similarity between me and the rest of the users in this forum... But I dont know the if Im in the same "internet path"... Now it seems "normal"... i.e. with the ussual retries and backoffs that BOINC can handle without (ab)using the retry button... |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems? It's been OK since about 09:00 hours your time. Prior to that, since the weekly outage, uploads were pretty much impossible. Looking at your Server stats shows it right up. Prior to the outage, 100,000/hr were being returned. After the outage it quickly peaked at 120,000 & then dropped down to barely 40,000. It gradually creeped up to 60,000. Once the dam broke it hit 160,000 & has leveled off at around 100,000-110,000 per hour since then. EDIT- it looks like the Ready to Send, Result Creation rate & Average result turnaround time updates all died at about the same time, those numbers have been stale for a few hours now. The problem now is getting work, only about 1in3 to 1in5 requests result in work. The rest result in "Project has no tasks available", "No tasks sent" or "Timeout was reached" messages. It's not as bad as it was, but it's still occuring. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Not having any new tapes loaded for splitting might account for the lack of new work. |
shizaru Send message Joined: 14 Jun 04 Posts: 1130 Credit: 1,967,904 RAC: 0 |
Only getting GPU units, not getting any at all for CPU, been like this for a couple of days. There's a workaround if you are willing to jump through a few hoops. Click Account (bottom and/or top of this page) and then click on SETI@home preferences and set Use Nvidia GPU to NO. Next go to Boinc Manager in Advanced View, select seti@home in the Projects tab and hit update. Open Event log (messages tab, for others with older clients) and wait for the next request (should be five minutes). The next request should be for CPU. Just remember to go back and re-enable the GPU in your preferences when you've filled up:) |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
Not having any new tapes loaded for splitting might account for the lack of new work. There were quite a few tapes there, but they disappeared in the same moment when the SETI@home science database was disabled (about 1 hour ago). |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Not having any new tapes loaded for splitting might account for the lack of new work. As link Noted below, there were several "tapes" still to be split at the time i posted, the problem was the rate of splitting was considerably less than the demand. Grant Darwin NT |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems? DL/UL are now normal, but the problem, at least from our side, always returns when AP_splitters starts but they are off now. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems? The AP splitters are fine so long as they are only doing 1 or 2 new files at a time. Once over that is when things start falling apart I've noticed over several months now (in fact about as far back as when "synergy" took over a lot of the AP splitting). Cheers. |
Alaun Send message Joined: 29 Nov 05 Posts: 18 Credit: 9,310,773 RAC: 0 |
Bandwidth is obviously an issue here, I've been wondering why it's restricted but the last few posts have helped. So if I understand it right: 1) SETI@home servers are in the SSL building on the UCLA Berkeley campus. The SSL building is way up on a hill. 2) SETI@home has purchased a gigabit connection to the outside world through Hurricane Electric. 3) The Hurricane Electric line terminates somewhere across campus, and all our traffic must move through the University's network, specifically through a single fiber going up the hill to the SSL building. 4) Right now the University is giving Seti@home 10% of that line or 100MB. 5) In order to get more bandwidth down to the Hurricane Electric switch, there needs to be permission granted by the University to use more of their network. 6) This is tricky because of politics and the need to serve people on campus. Right? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.