Panic Mode On (66) Server problems?

Author	Message
zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65745 Credit: 55,293,173 RAC: 49	Message 1190887 - Posted: 2 Feb 2012, 7:57:35 UTC - in response to Message 1190885. Last modified: 2 Feb 2012, 7:58:00 UTC Dunno what's up with the scheduler/feeder again...... Result creation rate is very low....bandwidth is not saturated, and yet I am getting little success in filling caches even to the limits currently in place. Not flowing well at the moment. Lots of tasks to download, but their just out of reach, maybe the server needs a medic alert bracelet? ;) Night all. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 1190887 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1190888 - Posted: 2 Feb 2012, 8:09:19 UTC - in response to Message 1190887. I'm sucking down everything I can get, when I'm not hitting the limits. You people having problems are likely falling foul to that bit of yet un-identified hardware along the connection that is playing up. Cheers. ID: 1190888 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1190899 - Posted: 2 Feb 2012, 9:23:25 UTC - in response to Message 1190888. You people having problems are likely falling foul to that bit of yet un-identified hardware along the connection that is playing up. Nope. The problem isn't downloading the work, the problem is getting the work allocated to be downloaded. Grant Darwin NT ID: 1190899 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1190922 - Posted: 2 Feb 2012, 12:10:58 UTC - in response to Message 1190899. You people having problems are likely falling foul to that bit of yet un-identified hardware along the connection that is playing up. Nope. The problem isn't downloading the work, the problem is getting the work allocated to be downloaded. Correctamundo. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1190922 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1190938 - Posted: 2 Feb 2012, 13:34:24 UTC - in response to Message 1190922. You people having problems are likely falling foul to that bit of yet un-identified hardware along the connection that is playing up. Nope. The problem isn't downloading the work, the problem is getting the work allocated to be downloaded. Correctamundo. Indeed. I noticed this morning that several of my machines are listing MD5 and download errors in the past 24 hours. I haven't done any looking into it as there are many more successes than failures. However I wonder if it one of the download servers having an issue, both of them, or a line issue. So many places to look for troubleshooting. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1190938 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65745 Credit: 55,293,173 RAC: 49	Message 1190950 - Posted: 2 Feb 2012, 14:12:42 UTC Well lets see, can't download 4 wu's and I have 185 wu's to report, don't about them, as their uploading just fine... The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 1190950 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65745 Credit: 55,293,173 RAC: 49	Message 1190992 - Posted: 2 Feb 2012, 16:53:19 UTC - in response to Message 1190990. Well lets see, can't download 4 wu's and I have 185 wu's to report, don't about them, as their uploading just fine... Without proxy, no downloads. With proxy, full speed with downloads. Use the proxy Luke :-) No worries they finally downloaded, the last 3 found a hole and they went that way <-- -->... The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 1190992 ·

hbomber Volunteer tester Send message Joined: 2 May 01 Posts: 437 Credit: 50,852,854 RAC: 0	Message 1191011 - Posted: 2 Feb 2012, 18:17:01 UTC Last modified: 2 Feb 2012, 18:17:28 UTC HÃ¤r klagar man inte :) Since I hammered the IP of boinc2.ssl.berkeley.edu in my hosts file, ALL problems with connection went away completely. ID: 1191011 ·

hbomber Volunteer tester Send message Joined: 2 May 01 Posts: 437 Credit: 50,852,854 RAC: 0	Message 1191034 - Posted: 2 Feb 2012, 20:04:45 UTC Last modified: 2 Feb 2012, 20:07:06 UTC Seriously, I did it once per host, the hammering. Got tired of losing time to find working and not stammering proxy, just a few hours later to find out my host is dry, bcs its chewing units faster than they can be dowloaded. It was very close to make me "gnÃ¤lla" :) ID: 1191034 ·

Dave Stegner Volunteer tester Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27	Message 1191254 - Posted: 3 Feb 2012, 18:27:40 UTC +1 Dave ID: 1191254 ·

Dave Stegner Volunteer tester Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27	Message 1191255 - Posted: 3 Feb 2012, 18:45:10 UTC Comms are really trashed. SLWS009 4500 SETI@home 2/3/2012 10:43:30 Requesting new tasks for CPU 4501 SETI@home 2/3/2012 10:44:03 Scheduler request failed: Transferred a partial file 4502 2/3/2012 10:44:04 Project communication failed: attempting access to reference site 4503 2/3/2012 10:44:05 Internet access OK - project servers may be temporarily down. Dave ID: 1191255 ·

Sunny129 Send message Joined: 7 Nov 00 Posts: 190 Credit: 3,163,755 RAC: 0	Message 1191256 - Posted: 3 Feb 2012, 18:49:45 UTC - in response to Message 1190245. Initial AP-only cache building is a painful process, but it'll get there eventually. any chance of building up an AP cache right now if crunching on a GPU (specifically an HD 5870)? or will such a GPU plow through AP tasks at a rate that far exceeds the current AP task production rate? the reason i ask is b/c i used to be able to maintain a cache of ~50 AP tasks for my HD 5870, but that was ages (~6 months) ago...back then i used to see 6,000-12,000 AP "results ready to send" on the server status page regularly. i occasionally see small amounts of AP work ready to send, but that stat reads zero most of the time these days... ID: 1191256 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1191309 - Posted: 3 Feb 2012, 22:13:24 UTC IÂ´ve stopped proccessing APs month ago. My 5850 can finnish 24 - 30 APs a day. No chance under this conditions. IÂ´ll just wait. With each crime and every kindness we birth our future. ID: 1191309 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1191320 - Posted: 3 Feb 2012, 22:27:12 UTC It's likely to not work out very well for you. They don't pile up when they are being created. They are assigned and sent out nearly instantly. Problem is that there is nothing for a day or so, and then all of the sudden a few new tapes show up, and they get chewed through in 2-4 hours and then it's back to a day or so of no new APs being made. If you happen to catch the cycle just right, you could probably gather about 50 or so, but you have to make sure you are requesting work every 5 minutes, which means babysitting the "update" button. For me, since my DCF on this new machine is working on stabilizing, a 4-day cache with 3 allowed cores means 3 APs. As the DCF comes down, the number of tasks will increase. Until then, I'm empty most of the time. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1191320 ·

Sunny129 Send message Joined: 7 Nov 00 Posts: 190 Credit: 3,163,755 RAC: 0	Message 1191331 - Posted: 3 Feb 2012, 23:18:26 UTC i wonder if Astropulse will ever return to semi-regular work production. i mean i've always been aware of the fact that AP work production has always paled in comparison to Multibeam work production...but as i said before, i was at one time able to maintain a cache of ~50 AP tasks for my HD 5870 GPU...and by "one time" i mean several weeks - perhaps even a few months - during which AP work was being produced regularly...and this was no more than a year ago. i imagine there is no really answer to my inquiry, and that we'll just have to wait it out and see what happens... ID: 1191331 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1191406 - Posted: 4 Feb 2012, 7:09:25 UTC Actually, there is sort of an answer. Way back to right around the beginning of GPU crunching, AP was easy to acquire. At the time, there were two AP splitters and five MB splitters. The feeder was arranged in such a way that for every 100 tasks that it has, 97 were MB, and 3 were AP. With this setup, AP was chewing through tapes at roughly twice the speed of MB. At some point in time, disk space became an issue when there were somewhere around 60 50GB "tapes" waiting for MB to split and had already been chewed through by AP. In addition to that, GPU crunching came along and added much more demand to the project as a whole. The servers started having I/O issues and couldn't keep up with the demand, so the precision of MB was doubled, making the FFTs double, and therefore, doubling the crunching time. This was sort of a stop-gap to try to make less stress on the servers, as now some of the super-crunchers would only need 10,000 MBs/day instead of 20,000. I'm not entirely sure when it happened, but that feeder allocation of 97/3 has since gone away. The proof I have for that is about a year ago, when the new servers were installed and brought online, I personally got 20+ APs in one work request more than once. If 97/3 were still in effect, nobody should ever get more than 3 APs in one work request ever. There was a time with the cricket graph where you could watch and see when AP was being assigned even though it had been split and stock-piled for a while. Through some detective work, it was discovered that AP would be split and stockpiled in the database, and then every ~6 hours, there was about a 75% chance that any work that was issued, was AP. The taskID numbers and the splitter information in the file names suggested that AP was being split at nearly the exact same time as MB, but there would be several tens of thousands of consecutive wuIDs in a row that were nothing but AP. So the database was stockpiling, and when the stockpile hit a point or a time interval, that was nearly the only thing that got issued. Outside of the mass exodus of APs, getting one was nearly impossible. At some point after that, GPU crunching became faster and more of a database strain, and this periodic assignment of APs went away, and with that, APs are truly luck of the draw. Couple that with several straight weeks of the pipe being maxed out just for MB downloads (which never used to do it except for 3-10 hours after the weekly maintenance), and the limits that have been put in place to fix some DCF issue, and AP is a rare find. As I said previously, if you look at what tapes are being split, nearly always, AP is idle and not doing anything because it has already chewed through everything. It will sit for 6-36 hours without any new tapes while MB either catches up, or scheduling actually works without any issues. See, MB splitters will slow down and stop splitting once "ready to send" hits somewhere in the 200-250k range. With the limits in place, this happens frequently, so there are 250k MBs ready to be assigned, but the majority of hosts are at the limits, so there is nobody to take them. That means the creation rate is low (less than 1/sec), and it also means that until more tapes get completed.. no new APs. I have also mentioned once in the past few weeks that the number of splitters can and should be adjusted. We should go back down to one or two AP splitters, and have at least 6 MB. That will slow AP chewing down significantly and the tapes can be efficiently gone-through if MB and AP chew on the same tape at the same time. There may be I/O issues with this, but there are ways to off-set and balance it out. Work assignment can probably be smoothed out if we go back to the 97/3 scheme and drop the number of AP splitters. That would make them less difficult to get, but shouldn't make a significant "results in the field" increase. [/history lesson] Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1191406 ·

Dave Send message Joined: 29 Mar 02 Posts: 778 Credit: 25,001,396 RAC: 0	Message 1191418 - Posted: 4 Feb 2012, 9:33:17 UTC Can't report. It's cold in here. ID: 1191418 ·

Sunny129 Send message Joined: 7 Nov 00 Posts: 190 Credit: 3,163,755 RAC: 0	Message 1191438 - Posted: 4 Feb 2012, 13:38:19 UTC thanks for the detailed response Cosmic. though i do recall that for me, it wasn't quite that long ago that i was able to get AP tasks regularly. in fact, back when i was able to maintain a cache of ~50 AP tasks for the GPU at all times, the number of AP splitters was already at 6...and it was working fine that way for a while until the next major project outage/server maintenance period. things just haven't been the same since... ID: 1191438 ·

James Sotherden Send message Joined: 16 May 99 Posts: 10436 Credit: 110,373,059 RAC: 54	Message 1191443 - Posted: 4 Feb 2012, 14:11:47 UTC Anyone else getting timeouts on work units? I have 19 so far. [/quote] Old James ID: 1191443 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65745 Credit: 55,293,173 RAC: 49	Message 1191453 - Posted: 4 Feb 2012, 14:47:59 UTC - in response to Message 1191443. Anyone else getting timeouts on work units? I have 19 so far. Downloading? Lots, I have one left, I had I think about 20 last night before hitting the sack, so I let the PC handle them instead, as they'd keep going to retry almost immediately if not sooner. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 1191453 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.