CPU work units download stuck for 2 days?

Message boards : Number crunching : CPU work units download stuck for 2 days?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile woodyrox
Volunteer tester

Send message
Joined: 7 Apr 01
Posts: 34
Credit: 16,069,169
RAC: 0
United States
Message 1132229 - Posted: 26 Jul 2011, 15:12:08 UTC

I'm unable to get work CPU work units for 2 days. GPU tasks downloaded ok in the same time frame. Work units are stuck in "Downloading" status. 0 KB transferred and speed is always 0 KBps. Downloads time out and retry and always no progress.

Checking the server status page shows that download servers, anakin & vader, are up. Upload server, bruno, is shown disabled but uploads work fine here.

Is this a temporary server problem or has my boinc gone bonkers?
ID: 1132229 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1132249 - Posted: 26 Jul 2011, 19:55:50 UTC

There is no difference between MB tasks destined for either CPU or GPU, so its going to be down to BOINC, at your end, not requesting CPU tasks. A quick look suggests that a couple of your crunchers have had a lot of errors recently, and if the error rate gets too high BOINC does cut down the request rate for the affected processor.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1132249 · Report as offensive
Profile woodyrox
Volunteer tester

Send message
Joined: 7 Apr 01
Posts: 34
Credit: 16,069,169
RAC: 0
United States
Message 1132254 - Posted: 26 Jul 2011, 20:09:54 UTC - in response to Message 1132249.  

Thanks for your reply. This is the cruncher I'm having problems with:

http://setiathome.berkeley.edu/results.php?hostid=5047831

The only errors reported are the 8 work units I aborted after waiting for a day without a byte of transfer. After aborting those, my cruncher was given 9 work units this morning and not a byte of data has yet been downloaded.

I thought this problem might clear up after project maintenance, but no such luck. 9 stuck.
ID: 1132254 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1132255 - Posted: 26 Jul 2011, 20:10:54 UTC - in response to Message 1132229.  
Last modified: 26 Jul 2011, 20:12:02 UTC

I'm unable to get work CPU work units for 2 days. GPU tasks downloaded ok in the same time frame. Work units are stuck in "Downloading" status. 0 KB transferred and speed is always 0 KBps. Downloads time out and retry and always no progress.

Checking the server status page shows that download servers, anakin & vader, are up. Upload server, bruno, is shown disabled but uploads work fine here.

Is this a temporary server problem or has my boinc gone bonkers?

The servers can be up and doing their best and still have slow/stuck downloads. You may have heard mention of the cricket graph. Having a look at it you will notice that the bandwidth had been maxed out for a while. The green shows traffic going out of the lab to the world. The dip in activity today was during the servers being down for weekly maintenance.
If your machine is on 24/7 the downloads should get taken care of eventually. If you only have it on during a limited time you may have to resort to hitting the retry button on the download tab to get them to complete.

Sometimes it can be luck of the draw if some of them download while others stay stuck.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1132255 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1132279 - Posted: 26 Jul 2011, 20:46:49 UTC - in response to Message 1132229.  

I'm unable to get work CPU work units for 2 days. GPU tasks downloaded ok in the same time frame. Work units are stuck in "Downloading" status. 0 KB transferred and speed is always 0 KBps. Downloads time out and retry and always no progress.

Checking the server status page shows that download servers, anakin & vader, are up. Upload server, bruno, is shown disabled but uploads work fine here.

Is this a temporary server problem or has my boinc gone bonkers?



If there was a way to send you some of my VLARs, I will be happy to send them to you. My A-SYS has not had any cuda work since last Wednesday (7/20) just VLAR & AP work. It keeps asking for CUDA, but the scheduler keeps saying no joy. On the other hand the B-SYS keeps sucking them up a vacumn cleaner on steroids, getting over 200 today alone, with 47 since the servers came back online. There has been traffic problems, but I keep abusing the retry button and eventually everything get d/l.


I don't buy computers, I build them!!
ID: 1132279 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1132304 - Posted: 26 Jul 2011, 21:30:23 UTC - in response to Message 1132229.  

I'm unable to get work CPU work units for 2 days. GPU tasks downloaded ok in the same time frame. Work units are stuck in "Downloading" status. 0 KB transferred and speed is always 0 KBps. Downloads time out and retry and always no progress.

Hmm... having the same problem since about a day on all my machines, on two of them I get connect() failed and on my laptop HTTP error. Even using the retry button during the outage, trying to connect thru university proxy and all the usual stuff like rebooting didn't help. I can ping and tracert both download servers, but I don't get even a single byte of a WU from them and also not the fedora test page.
ID: 1132304 · Report as offensive
Profile woodyrox
Volunteer tester

Send message
Joined: 7 Apr 01
Posts: 34
Credit: 16,069,169
RAC: 0
United States
Message 1132307 - Posted: 26 Jul 2011, 21:39:46 UTC - in response to Message 1132255.  

The servers can be up and doing their best and still have slow/stuck downloads. You may have heard mention of the cricket graph. Having a look at it you will notice that the bandwidth had been maxed out for a while. The green shows traffic going out of the lab to the world. The dip in activity today was during the servers being down for weekly maintenance.
If your machine is on 24/7 the downloads should get taken care of eventually. If you only have it on during a limited time you may have to resort to hitting the retry button on the download tab to get them to complete.

Sometimes it can be luck of the draw if some of them download while others stay stuck.


Yeah, I've looked at cricket a few times and see the servers are maxed. But I've had problems with downloads before, and symptoms have been different. In the past, the download would stall after a few bytes. Now I'm getting the big goose egg, as in zero bytes for all work units. I haven't seen this before and was wondering if there are possibly other problems. My machine is nearly out of work so I wanted to get ahead of the eventuality of sitting idle.
ID: 1132307 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1132324 - Posted: 26 Jul 2011, 22:19:55 UTC - in response to Message 1132307.  
Last modified: 26 Jul 2011, 22:24:09 UTC

But I got you right that the tasks are assigned to you by the scheduler but aren't downloaded? That means they show up in the Tasks tab as "Downloading" and in the Transfers tab as what "Suspended", "Download pending" or what?

That sounds very suspicious; I recently get my downloads through with only a few retries, if any.

When did you last reboot that machine(s) [edit]and the router, as Richard says:-)[/edit]?

Do you have any SETI-related entries in your etc\hosts file?

Did you try some logging flags in cc_config.xml (like <file_xfer_debug>, <http_xfer_debug>)?

Gruß,
Gundolf
ID: 1132324 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1132325 - Posted: 26 Jul 2011, 22:20:36 UTC - in response to Message 1132307.  

The servers can be up and doing their best and still have slow/stuck downloads. You may have heard mention of the cricket graph. Having a look at it you will notice that the bandwidth had been maxed out for a while. The green shows traffic going out of the lab to the world. The dip in activity today was during the servers being down for weekly maintenance.
If your machine is on 24/7 the downloads should get taken care of eventually. If you only have it on during a limited time you may have to resort to hitting the retry button on the download tab to get them to complete.

Sometimes it can be luck of the draw if some of them download while others stay stuck.

Yeah, I've looked at cricket a few times and see the servers are maxed. But I've had problems with downloads before, and symptoms have been different. In the past, the download would stall after a few bytes. Now I'm getting the big goose egg, as in zero bytes for all work units. I haven't seen this before and was wondering if there are possibly other problems. My machine is nearly out of work so I wanted to get ahead of the eventuality of sitting idle.

I get that sometimes. Rebooting my router (a combination ADSL modem/router/switch) seems to wake up the embedded DNS server which seems to be causing most grief these days.
ID: 1132325 · Report as offensive
Profile woodyrox
Volunteer tester

Send message
Joined: 7 Apr 01
Posts: 34
Credit: 16,069,169
RAC: 0
United States
Message 1132384 - Posted: 27 Jul 2011, 2:36:37 UTC - in response to Message 1132325.  

I get that sometimes. Rebooting my router (a combination ADSL modem/router/switch) seems to wake up the embedded DNS server which seems to be causing most grief these days.


Rebooted my router, no joy. Rebooted my computer... still glum. I did update the project to report a finished task, and that went through immediately. I'm talking to the servers, but they're not giving me any bits.

But I got you right that the tasks are assigned to you by the scheduler but aren't downloaded? That means they show up in the Tasks tab as "Downloading" and in the Transfers tab as what "Suspended", "Download pending" or what?

That sounds very suspicious; I recently get my downloads through with only a few retries, if any.


Yes, you got that exectly right. Suspicious is the reason I'm posting about it. The scheduler for sure assigned me tasks. You can see them in my task list at:

http://setiathome.berkeley.edu/results.php?hostid=5047831

On the above status screen, all of the tasks are shown "In progress" even though they are downloading.

In my task tab, the tasks say "Downloading". In the "Transfers" tab, the status is Download pending, then Downloading and finally Retry in...

Ummm, I just noticed something and don't know if this is significant. The task file names on the seti task details web page don't match the file names on my Transfers tab. The seti file names have a "_1" appended to the end. My file names match except there is no ending "_1".

Do you have any SETI-related entries in your etc\hosts file?

Did you try some logging flags in cc_config.xml (like <file_xfer_debug>, <http_xfer_debug>)?


There's only my localhost in my hosts file.

I tried adding the debug log levels. I'm not familiar with that file format but looked it up. Here's what I did:

<cc_config>
	<log_flags>
		<file_xfer_debug>
		<http_xfer_debug>
	</log_flags>
	
	<options>
		
	</options>
</cc_config>


Here's what my 6.6.9 version of boinc complained about:

Tue 26 Jul 2011 10:30:05 PM EDT		Unrecognized tag in cc_config.xml: <file_xfer_debug>
Tue 26 Jul 2011 10:30:05 PM EDT		Missing end tag in cc_config.xml
Tue 26 Jul 2011 10:30:05 PM EDT		Starting BOINC client version 6.6.9 for i686-pc-linux-gnu


ID: 1132384 · Report as offensive
Profile woodyrox
Volunteer tester

Send message
Joined: 7 Apr 01
Posts: 34
Credit: 16,069,169
RAC: 0
United States
Message 1132400 - Posted: 27 Jul 2011, 3:09:39 UTC

So I thought it might be useful to post the message log:

Tue 26 Jul 2011 10:40:13 PM EDT	SETI@home	Started download of 21mr11af.31268.237100.13.10.36
Tue 26 Jul 2011 10:42:13 PM EDT		Project communication failed: attempting access to reference site
Tue 26 Jul 2011 10:42:13 PM EDT	SETI@home	Temporarily failed download of 21ap11ac.3769.1860.16.10.163: HTTP error
Tue 26 Jul 2011 10:42:13 PM EDT	SETI@home	Backing off 3 hr 55 min 12 sec on download of 21ap11ac.3769.1860.16.10.163
Tue 26 Jul 2011 10:42:13 PM EDT	SETI@home	Started download of 21mr11af.30646.238327.12.10.94
Tue 26 Jul 2011 10:42:14 PM EDT		Internet access OK - project servers may be temporarily down.
Tue 26 Jul 2011 10:42:14 PM EDT	SETI@home	Temporarily failed download of 21mr11af.31268.237100.13.10.36: HTTP error
Tue 26 Jul 2011 10:42:14 PM EDT	SETI@home	Backing off 3 hr 34 min 26 sec on download of 21mr11af.31268.237100.13.10.36
Tue 26 Jul 2011 10:44:15 PM EDT		Project communication failed: attempting access to reference site
Tue 26 Jul 2011 10:44:15 PM EDT	SETI@home	Temporarily failed download of 21mr11af.30646.238327.12.10.94: HTTP error
Tue 26 Jul 2011 10:44:15 PM EDT	SETI@home	Backing off 1 hr 28 min 18 sec on download of 21mr11af.30646.238327.12.10.94
Tue 26 Jul 2011 10:44:17 PM EDT		Internet access OK - project servers may be temporarily down.


But you see, the file names in the above log are missing the "_1". I noticed that the correct file names are shown in the Tasks tab. Hope this is helpful.

This problem started all of a sudden. I haven't changed anything on my end.
ID: 1132400 · Report as offensive
Profile woodyrox
Volunteer tester

Send message
Joined: 7 Apr 01
Posts: 34
Credit: 16,069,169
RAC: 0
United States
Message 1132410 - Posted: 27 Jul 2011, 3:57:14 UTC

Just got handed 7 more work units. Same deal, stuck in my craw.
ID: 1132410 · Report as offensive
Profile woodyrox
Volunteer tester

Send message
Joined: 7 Apr 01
Posts: 34
Credit: 16,069,169
RAC: 0
United States
Message 1132414 - Posted: 27 Jul 2011, 4:04:32 UTC

Joy!

Here's what I did. Advanceced->Preferences->Clear

This reset my preferences to global. Stopped & restarted the client and wham! All the work units downloaded. The only difference I see is that "Use GPU while computer is in use" is not checked. This computer doesn't have a CUDA GPU and I can't see how that made any difference. Anyway I'm up and running, and did not run out of work units.

Thanks for everyone's help.
ID: 1132414 · Report as offensive
Bernd Noessler

Send message
Joined: 15 Nov 09
Posts: 99
Credit: 52,635,434
RAC: 0
Germany
Message 1132438 - Posted: 27 Jul 2011, 5:36:23 UTC

I think the webserver on 208.68.240.13 is down for more than a day now.

If your client does not try the second one (208.68.240.18) you cannot download.
ID: 1132438 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1132474 - Posted: 27 Jul 2011, 8:30:32 UTC - in response to Message 1132438.  

I think the webserver on 208.68.240.13 is down for more than a day now.

If your client does not try the second one (208.68.240.18) you cannot download.

That seems to be true, according to http debug, all my computers were trying *.13 all the time and inserting "208.68.240.18 boinc2.ssl.berkeley.edu" in the hosts file solved the problem.
ID: 1132474 · Report as offensive
Profile woodyrox
Volunteer tester

Send message
Joined: 7 Apr 01
Posts: 34
Credit: 16,069,169
RAC: 0
United States
Message 1132552 - Posted: 27 Jul 2011, 13:50:47 UTC - in response to Message 1132474.  

I think the webserver on 208.68.240.13 is down for more than a day now.

If your client does not try the second one (208.68.240.18) you cannot download.

That seems to be true, according to http debug, all my computers were trying *.13 all the time and inserting "208.68.240.18 boinc2.ssl.berkeley.edu" in the hosts file solved the problem.


This is good to know. I figured out cc_config.xml file format and got the communications logs working. My file looks like this:

<cc_config>
	<log_flags>
		<file_xfer_debug>1</file_xfer_debug>
		<http_xfer_debug>1</http_xfer_debug>
	</log_flags>
	
	<options>
		
	</options>
</cc_config>


So I will look for failed host attempts and will edit the hosts file if needed.

thanks
ID: 1132552 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1132560 - Posted: 27 Jul 2011, 14:09:43 UTC - in response to Message 1132414.  
Last modified: 27 Jul 2011, 14:10:31 UTC

Joy!

Here's what I did. Advanceced->Preferences->Clear

This reset my preferences to global. Stopped & restarted the client and wham! All the work units downloaded. The only difference I see is that "Use GPU while computer is in use" is not checked. This computer doesn't have a CUDA GPU and I can't see how that made any difference. Anyway I'm up and running, and did not run out of work units.

Thanks for everyone's help.

Computers... sometimes you just want to toss them out a window. Glad it wants to play now.
for future reference your cc_config.xml would look something like this for the logging flags. Where 1 turns on the option and 0 turns it off.

<cc_config>
	<log_flags>
		<task>0</task>
		<file_xfer>0</file_xfer>
		<sched_ops>1</sched_ops>
		<coproc_debug>0</coproc_debug>
		<cpu_sched>0</cpu_sched>
		<cpu_sched_debug>0</cpu_sched_debug>
		<dcf_debug>0</dcf_debug>
		<sched_op_debug>0</sched_op_debug>
		<state_debug>0</state_debug>
		<http_debug>0</http_debug>
		<http_xfer_debug>0</http_xfer_debug>
	</log_flags>
	<options>
		<no_gpus>0</no_gpus>
		<allow_remote_gui_rpc>1</allow_remote_gui_rpc>
		<save_stats_days>180</save_stats_days>
	</options>
</cc_config>

Edit: Ok fine you already figure it out. :)
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1132560 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1132611 - Posted: 27 Jul 2011, 16:59:49 UTC

Since it's not the first time that we have problems like that here, I wonder if it would not cause less problems if SETI had two different download server URLs, for example dl1.ssl.berkeley.edu and dl2.ssl.berkeley.edu and send both as possible download locations like rosetta is doing for example:
<url>http://srv3.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://boinc.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://srv4.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://srv3.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://boinc.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://srv4.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>


So for a SETI WU it could be:
<url>http://dl1.ssl.berkeley.edu/sah/download_fanout/61/08ap11ae.3480.1703.14.10.29</url>
<url>http://dl2.ssl.berkeley.edu/sah/download_fanout/61/08ap11ae.3480.1703.14.10.29</url>


Don't know how the load balancing works in that case, if the BOINC client picks just one of them, than that would be pretty easy, not need for any big server side changes. If the client starts from the top and tries one after the other, than the sheduler would have to send dl1,dl2 to all even number results (_0, _2,...) and dl2,dl1 to all odd number results. I think it might work better that the current way... but I might be wrong of course.
ID: 1132611 · Report as offensive

Message boards : Number crunching : CPU work units download stuck for 2 days?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.