Panic Mode On (66) Server problems?

Message boards : Number crunching : Panic Mode On (66) Server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 11 · Next

AuthorMessage
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65745
Credit: 55,293,173
RAC: 49
United States
Message 1190887 - Posted: 2 Feb 2012, 7:57:35 UTC - in response to Message 1190885.  
Last modified: 2 Feb 2012, 7:58:00 UTC

Dunno what's up with the scheduler/feeder again......
Result creation rate is very low....bandwidth is not saturated, and yet I am getting little success in filling caches even to the limits currently in place.

Not flowing well at the moment.

Lots of tasks to download, but their just out of reach, maybe the server needs a medic alert bracelet? ;)

Night all.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1190887 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1190888 - Posted: 2 Feb 2012, 8:09:19 UTC - in response to Message 1190887.  

I'm sucking down everything I can get, when I'm not hitting the limits.

You people having problems are likely falling foul to that bit of yet un-identified hardware along the connection that is playing up.

Cheers.
ID: 1190888 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1190899 - Posted: 2 Feb 2012, 9:23:25 UTC - in response to Message 1190888.  

You people having problems are likely falling foul to that bit of yet un-identified hardware along the connection that is playing up.

Nope.
The problem isn't downloading the work, the problem is getting the work allocated to be downloaded.
Grant
Darwin NT
ID: 1190899 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1190922 - Posted: 2 Feb 2012, 12:10:58 UTC - in response to Message 1190899.  

You people having problems are likely falling foul to that bit of yet un-identified hardware along the connection that is playing up.

Nope.
The problem isn't downloading the work, the problem is getting the work allocated to be downloaded.

Correctamundo.

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1190922 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1190938 - Posted: 2 Feb 2012, 13:34:24 UTC - in response to Message 1190922.  

You people having problems are likely falling foul to that bit of yet un-identified hardware along the connection that is playing up.

Nope.
The problem isn't downloading the work, the problem is getting the work allocated to be downloaded.

Correctamundo.

Indeed. I noticed this morning that several of my machines are listing MD5 and download errors in the past 24 hours.
I haven't done any looking into it as there are many more successes than failures. However I wonder if it one of the download servers having an issue, both of them, or a line issue. So many places to look for troubleshooting.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1190938 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65745
Credit: 55,293,173
RAC: 49
United States
Message 1190950 - Posted: 2 Feb 2012, 14:12:42 UTC

Well lets see, can't download 4 wu's and I have 185 wu's to report, don't about them, as their uploading just fine...
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1190950 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65745
Credit: 55,293,173
RAC: 49
United States
Message 1190992 - Posted: 2 Feb 2012, 16:53:19 UTC - in response to Message 1190990.  

Well lets see, can't download 4 wu's and I have 185 wu's to report, don't about them, as their uploading just fine...


Without proxy, no downloads. With proxy, full speed with downloads. Use the proxy Luke :-)

No worries they finally downloaded, the last 3 found a hole and they went that way <-- -->...
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1190992 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1191011 - Posted: 2 Feb 2012, 18:17:01 UTC
Last modified: 2 Feb 2012, 18:17:28 UTC

Här klagar man inte :)

Since I hammered the IP of boinc2.ssl.berkeley.edu in my hosts file, ALL problems with connection went away completely.
ID: 1191011 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1191034 - Posted: 2 Feb 2012, 20:04:45 UTC
Last modified: 2 Feb 2012, 20:07:06 UTC

Seriously, I did it once per host, the hammering. Got tired of losing time to find working and not stammering proxy, just a few hours later to find out my host is dry, bcs its chewing units faster than they can be dowloaded.

It was very close to make me "gnälla" :)
ID: 1191034 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1191254 - Posted: 3 Feb 2012, 18:27:40 UTC

+1
Dave

ID: 1191254 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1191255 - Posted: 3 Feb 2012, 18:45:10 UTC

Comms are really trashed.

SLWS009

4500 SETI@home 2/3/2012 10:43:30 Requesting new tasks for CPU
4501 SETI@home 2/3/2012 10:44:03 Scheduler request failed: Transferred a partial file
4502 2/3/2012 10:44:04 Project communication failed: attempting access to reference site
4503 2/3/2012 10:44:05 Internet access OK - project servers may be temporarily down.

Dave

ID: 1191255 · Report as offensive
Profile Sunny129
Avatar

Send message
Joined: 7 Nov 00
Posts: 190
Credit: 3,163,755
RAC: 0
United States
Message 1191256 - Posted: 3 Feb 2012, 18:49:45 UTC - in response to Message 1190245.  

Initial AP-only cache building is a painful process, but it'll get there eventually.

any chance of building up an AP cache right now if crunching on a GPU (specifically an HD 5870)? or will such a GPU plow through AP tasks at a rate that far exceeds the current AP task production rate? the reason i ask is b/c i used to be able to maintain a cache of ~50 AP tasks for my HD 5870, but that was ages (~6 months) ago...back then i used to see 6,000-12,000 AP "results ready to send" on the server status page regularly. i occasionally see small amounts of AP work ready to send, but that stat reads zero most of the time these days...
ID: 1191256 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1191309 - Posted: 3 Feb 2012, 22:13:24 UTC

I´ve stopped proccessing APs month ago.
My 5850 can finnish 24 - 30 APs a day.
No chance under this conditions.

I´ll just wait.



With each crime and every kindness we birth our future.
ID: 1191309 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1191320 - Posted: 3 Feb 2012, 22:27:12 UTC

It's likely to not work out very well for you. They don't pile up when they are being created. They are assigned and sent out nearly instantly. Problem is that there is nothing for a day or so, and then all of the sudden a few new tapes show up, and they get chewed through in 2-4 hours and then it's back to a day or so of no new APs being made.

If you happen to catch the cycle just right, you could probably gather about 50 or so, but you have to make sure you are requesting work every 5 minutes, which means babysitting the "update" button.

For me, since my DCF on this new machine is working on stabilizing, a 4-day cache with 3 allowed cores means 3 APs. As the DCF comes down, the number of tasks will increase. Until then, I'm empty most of the time.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1191320 · Report as offensive
Profile Sunny129
Avatar

Send message
Joined: 7 Nov 00
Posts: 190
Credit: 3,163,755
RAC: 0
United States
Message 1191331 - Posted: 3 Feb 2012, 23:18:26 UTC

i wonder if Astropulse will ever return to semi-regular work production. i mean i've always been aware of the fact that AP work production has always paled in comparison to Multibeam work production...but as i said before, i was at one time able to maintain a cache of ~50 AP tasks for my HD 5870 GPU...and by "one time" i mean several weeks - perhaps even a few months - during which AP work was being produced regularly...and this was no more than a year ago.

i imagine there is no really answer to my inquiry, and that we'll just have to wait it out and see what happens...
ID: 1191331 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1191406 - Posted: 4 Feb 2012, 7:09:25 UTC

Actually, there is sort of an answer. Way back to right around the beginning of GPU crunching, AP was easy to acquire. At the time, there were two AP splitters and five MB splitters. The feeder was arranged in such a way that for every 100 tasks that it has, 97 were MB, and 3 were AP.

With this setup, AP was chewing through tapes at roughly twice the speed of MB. At some point in time, disk space became an issue when there were somewhere around 60 50GB "tapes" waiting for MB to split and had already been chewed through by AP.

In addition to that, GPU crunching came along and added much more demand to the project as a whole. The servers started having I/O issues and couldn't keep up with the demand, so the precision of MB was doubled, making the FFTs double, and therefore, doubling the crunching time. This was sort of a stop-gap to try to make less stress on the servers, as now some of the super-crunchers would only need 10,000 MBs/day instead of 20,000.

I'm not entirely sure when it happened, but that feeder allocation of 97/3 has since gone away. The proof I have for that is about a year ago, when the new servers were installed and brought online, I personally got 20+ APs in one work request more than once. If 97/3 were still in effect, nobody should ever get more than 3 APs in one work request ever. There was a time with the cricket graph where you could watch and see when AP was being assigned even though it had been split and stock-piled for a while.

Through some detective work, it was discovered that AP would be split and stockpiled in the database, and then every ~6 hours, there was about a 75% chance that any work that was issued, was AP. The taskID numbers and the splitter information in the file names suggested that AP was being split at nearly the exact same time as MB, but there would be several tens of thousands of consecutive wuIDs in a row that were nothing but AP. So the database was stockpiling, and when the stockpile hit a point or a time interval, that was nearly the only thing that got issued.

Outside of the mass exodus of APs, getting one was nearly impossible. At some point after that, GPU crunching became faster and more of a database strain, and this periodic assignment of APs went away, and with that, APs are truly luck of the draw. Couple that with several straight weeks of the pipe being maxed out just for MB downloads (which never used to do it except for 3-10 hours after the weekly maintenance), and the limits that have been put in place to fix some DCF issue, and AP is a rare find.

As I said previously, if you look at what tapes are being split, nearly always, AP is idle and not doing anything because it has already chewed through everything. It will sit for 6-36 hours without any new tapes while MB either catches up, or scheduling actually works without any issues. See, MB splitters will slow down and stop splitting once "ready to send" hits somewhere in the 200-250k range. With the limits in place, this happens frequently, so there are 250k MBs ready to be assigned, but the majority of hosts are at the limits, so there is nobody to take them. That means the creation rate is low (less than 1/sec), and it also means that until more tapes get completed.. no new APs.

I have also mentioned once in the past few weeks that the number of splitters can and should be adjusted. We should go back down to one or two AP splitters, and have at least 6 MB. That will slow AP chewing down significantly and the tapes can be efficiently gone-through if MB and AP chew on the same tape at the same time. There may be I/O issues with this, but there are ways to off-set and balance it out.

Work assignment can probably be smoothed out if we go back to the 97/3 scheme and drop the number of AP splitters. That would make them less difficult to get, but shouldn't make a significant "results in the field" increase.


[/history lesson]
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1191406 · Report as offensive
Dave

Send message
Joined: 29 Mar 02
Posts: 778
Credit: 25,001,396
RAC: 0
United Kingdom
Message 1191418 - Posted: 4 Feb 2012, 9:33:17 UTC

Can't report. It's cold in here.
ID: 1191418 · Report as offensive
Profile Sunny129
Avatar

Send message
Joined: 7 Nov 00
Posts: 190
Credit: 3,163,755
RAC: 0
United States
Message 1191438 - Posted: 4 Feb 2012, 13:38:19 UTC

thanks for the detailed response Cosmic. though i do recall that for me, it wasn't quite that long ago that i was able to get AP tasks regularly. in fact, back when i was able to maintain a cache of ~50 AP tasks for the GPU at all times, the number of AP splitters was already at 6...and it was working fine that way for a while until the next major project outage/server maintenance period. things just haven't been the same since...
ID: 1191438 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1191443 - Posted: 4 Feb 2012, 14:11:47 UTC

Anyone else getting timeouts on work units? I have 19 so far.
[/quote]

Old James
ID: 1191443 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65745
Credit: 55,293,173
RAC: 49
United States
Message 1191453 - Posted: 4 Feb 2012, 14:47:59 UTC - in response to Message 1191443.  

Anyone else getting timeouts on work units? I have 19 so far.

Downloading? Lots, I have one left, I had I think about 20 last night before hitting the sack, so I let the PC handle them instead, as they'd keep going to retry almost immediately if not sooner.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1191453 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (66) Server problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.