increase '> CPUs x2' in UL for work request

Message boards : Number crunching : increase '> CPUs x2' in UL for work request
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 919051 - Posted: 18 Jul 2009, 16:48:10 UTC


O.K., all around here know: > 'CPUs x 2' in UL overview in BOINC and no work request.

IIRC, in the time I let run my old QX6700.. 4 WU/h.. this would mean after ~ 2 hours no work request.

Now, with my GPU cruncher.. the same AR.. 4 WUs /~ 6,75 minutes .. this mean after ~ 13 minutes no work request..

~ 2 hours to ~ 13 minutes.. I think this isn't fair.. ;-)


It's not possible to increase this value for GPU cruncher?
That they have the chance to 'bridge' UL failures?


Now.. I could download 24 WUs.. but now they are all in the UL overview and BOINC don't ask for new work.. and the GPU cruncher idle..
If the value would be higher, I could crunch ~ 2 hours.. and maybe in this time the UL server would be again well.. and BOINC could UL and DL..

ID: 919051 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919052 - Posted: 18 Jul 2009, 16:54:47 UTC - in response to Message 919051.  
Last modified: 18 Jul 2009, 16:55:41 UTC


IIRC, in the time I let run my old QX6700.. 4 WU/h.. this would mean after ~ 2 hours no work request.

Actually, if you're able to upload, and you have a 2 day cache, you should be able to keep two days of work -- this is only a limit when the upload servers are in trouble.

This only becomes an issue if there are extended problems -- lasting more than two days.

... but I'm thinking that something is wrong if you haven't been able to upload.

Have you tried restarting BOINC?

Edit: remember that this is a BOINC "feature" and not something that SETI@Home can change.
ID: 919052 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 919061 - Posted: 18 Jul 2009, 17:12:16 UTC
Last modified: 18 Jul 2009, 17:22:53 UTC


I meant my opening post, if the UL server have probs and with an idle PC which could download 8 WUs.


My QX6700 could crunch ~ 2 hours, until BOINC wouldn't ask to new work.
In this time BOINC could DL maybe a 10 day WU cache.

My current GPU cruncher wouldn't ask for new work after ~ 13 minutes.
In this time BOINC could DL only some WUs.. maybe for 20 minutes crunching time.

So why is a fast GPU cruncher so disadvantage?
If he could ask also 2 hours for new work.. he could have WUs maybe for 1/4 day.
Never up to 10 day WU cache.


It would be better/fair if a CPU-Quad have the same 'brake' like a GPU-Quad.
'No work reqest' at performance equality.


This would mean maybe, if I compare my upper mentioned PCs:
CPU-Quad: > CPUs x2
GPU-Quad: > GPUs x10 (or much more)


I ran out of work with my GPU cruncher many times since I have him.

The max. was ~ 4 - 5 day WU cache.
Then again server probs at Berkeley.
With good luck down to 1/2 day.. and again fill up the cache to ~ 4 - 5 days.
Then again server probs at Berkeley.
Maybe fill up to 1 day..
Then again server probs at Berkeley.
..idle GPU cruncher..
..maybe for one up to 3 days.

Then again a fresh start.. download some WUs.. UL server down.. and again idle to the time the UL server again well..

ID: 919061 · Report as offensive
Profile Zeus Fab3r
Avatar

Send message
Joined: 17 Jan 01
Posts: 649
Credit: 275,335,635
RAC: 597
Serbia
Message 919065 - Posted: 18 Jul 2009, 17:24:03 UTC - in response to Message 919051.  


O.K., all around here know: > 'CPUs x 2' in UL overview in BOINC and no work request.

This would mean maybe, if I compare my upper mentioned PCs:
CPU-Quad: > CPUs x2
GPU-Quad: > GPUs x10 (or much more)


I suggest that following formula should be considered:
'CUDA Stream Processors x2'
That'll do the trick if Sutaru agrees... :)

Who the hell is General Failure and why is he reading my harddisk?¿
ID: 919065 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 919121 - Posted: 18 Jul 2009, 21:26:02 UTC - in response to Message 919065.  
Last modified: 18 Jul 2009, 21:38:26 UTC

I suggest that following formula should be considered:
'CUDA Stream Processors x2'
That'll do the trick if Sutaru agrees... :)


Yes, this would be a well idea! :-)

But.. I think current not possible, because the SETI@home application and the BOINC client can't communicate this way.

Or?

Maybe we could take the GFLOPS of the BOINC benchmarks?

1 GFLOPS = 1 WU

This would mean for my GPU cruncher, one OCed GPU have 112 GFLOPS -> 4 x 112 = 448 results in the UL overview and work request possible.
> 448 and no work request.

Hmm.. if this 448 are normal ARs [0.44x] my GPU cruncher reach this value after 448 WUs x 6:45 [m:s] = ~ 12.5 hours
And after ~ 12.5 hours no UL possible - no work request.

If the server play the game well with the GPU cruncher, he make ~ 860 'normal' MB WUs/day.

If this would be only shorties.. shorties are after ~ 2:30 finished..
448 WUs x 2:30 [m:s] = ~ 4 1/2 hours
And after ~ 4 1/2 hours no UL possible - no work request.


The calculations are with an idle GPU cruncher, which start to crunch.
In the short time (~ 2 months) I have all 4 GPUs insert, he couldn't crunch all the time continuously.. more times idle..


This would mean, with a new calculation of > 'GPUs x X', 'CUDA Stream Processors x X' or 'GFLOPS x X' for GPU cruncher, that they could 'bridge' longer UL server outages.

ID: 919121 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919146 - Posted: 18 Jul 2009, 22:40:02 UTC - in response to Message 919061.  
Last modified: 18 Jul 2009, 22:40:47 UTC

I meant my opening post, if the UL server have probs and with an idle PC which could download 8 WUs.

I understood that.

What I said was that this should be a pretty exceptional case, and that it is in exactly those exceptional times when the upload server is going to be down for a long time that stopping uploads prevents an impossible situation when the server is back up.

But I think you need to be looking for other problems....

I don't have nearly the number of uploads you have, but for the past couple of days, I've been able to get uploads through more often than not, and I wonder if there isn't something going on at your end.

I just had three go through on the first try.

I'd sure take a look and see if I could find some other problem. At a minimum I'd probably shutdown and reboot.
ID: 919146 · Report as offensive
john deneer
Volunteer tester
Avatar

Send message
Joined: 16 Nov 06
Posts: 331
Credit: 20,996,606
RAC: 0
Netherlands
Message 919150 - Posted: 18 Jul 2009, 22:52:57 UTC - in response to Message 919146.  
Last modified: 18 Jul 2009, 22:53:23 UTC

I just had three go through on the first try.

I'd sure take a look and see if I could find some other problem. At a minimum I'd probably shutdown and reboot.

Hi Ned,

That remark (I just had three go through ...) sure triggered something in my feet :-)

When I read that, I ran up the stairs and turned my machine on immediately. And what do you know: those uploads lingering here all day went through immediately!

I'm pretty sure there's nothing wrong with Sutaru's system, I have been having the same problems all day (and the day before that) as well.

Now I'm going to try and get some new units to crunch :-)

Regards,
John.
ID: 919150 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919155 - Posted: 18 Jul 2009, 23:03:17 UTC - in response to Message 919150.  

I'm pretty sure there's nothing wrong with Sutaru's system, I have been having the same problems all day (and the day before that) as well.

I haven't seen enough to prove it one way or another, but I think there is a long-term bug in LIBCURL (which BOINC uses, along with a whole bunch of other packages) that hangs on to outdated DNS for a long time.

Your machine was off, now it is on, and it works. Could it be that it worked because it did not have some bad info stored somewhere, or a screwed up IP stack that was reloaded on start-up?

I haven't seen it enough times personally to have much of a diagnosis, just that I think something might be going on.
ID: 919155 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 919157 - Posted: 18 Jul 2009, 23:15:37 UTC - in response to Message 919155.  
Last modified: 18 Jul 2009, 23:17:18 UTC

AP stopped splitting work for a while- downloads dropped by 10Mb/s, uploadeds increased by 30Mb/s. AP appears to have fired up again- downloads jumped 8Mb/s, uploads dropped by about 15Mb/s.


EDIT- during that download drop in traffic i had about 70 uploads go through one ofter the other. Since AP fired up again, nothing's gone through.
Grant
Darwin NT
ID: 919157 · Report as offensive
john deneer
Volunteer tester
Avatar

Send message
Joined: 16 Nov 06
Posts: 331
Credit: 20,996,606
RAC: 0
Netherlands
Message 919159 - Posted: 18 Jul 2009, 23:20:54 UTC - in response to Message 919155.  

I'm pretty sure there's nothing wrong with Sutaru's system, I have been having the same problems all day (and the day before that) as well.

I haven't seen enough to prove it one way or another, but I think there is a long-term bug in LIBCURL (which BOINC uses, along with a whole bunch of other packages) that hangs on to outdated DNS for a long time.

Your machine was off, now it is on, and it works. Could it be that it worked because it did not have some bad info stored somewhere, or a screwed up IP stack that was reloaded on start-up?

I haven't seen it enough times personally to have much of a diagnosis, just that I think something might be going on.

That machine was turned off and on (and off again :-) several times over the last three days. Reboots, flushing the dns, restarting boinc and what have you more over the last 24 hours or so. Uploads went gaga somewhere yesterday, and I kept getting messages that indicated things weren't going all too well. I didn't change any of the software on that machine for weeks, and all of a sudden a couple of days ago it went into a state where it wasn't able to continuously upload or download (even when there wasn't the 2xcpu+1 limit). Mostly related to high scores on the cricket graph, but sometimes worse than what you would expect because of high traffic. Most of it is caused by high traffic, but some of it is probably caused by the fiddling going on at Berkeley :-)

And indeed, I got my first 20 or so units just a few seconds ago. Hurray, and thanks again for mentioning those succesfull uploads :-)

Regards,
John.
ID: 919159 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 919160 - Posted: 18 Jul 2009, 23:22:30 UTC

As Ned mentioned there are times that the TCP stack can become corrupt. The longer it runs and the more browser windows open and closed. Things that contact the network (email etc) and the Boinc... There are many possibilites that something could come in corrupt the network stack (continous retries to send work and having errors returned).

The easist way for a User is to "Shutdown" wait 15 seconds and Restart the Computer


Please consider a Donation to the Seti Project.

ID: 919160 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 919162 - Posted: 18 Jul 2009, 23:30:34 UTC - in response to Message 919155.  

I'm pretty sure there's nothing wrong with Sutaru's system, I have been having the same problems all day (and the day before that) as well.

I haven't seen enough to prove it one way or another, but I think there is a long-term bug in LIBCURL (which BOINC uses, along with a whole bunch of other packages) that hangs on to outdated DNS for a long time.

Your machine was off, now it is on, and it works. Could it be that it worked because it did not have some bad info stored somewhere, or a screwed up IP stack that was reloaded on start-up?

I haven't seen it enough times personally to have much of a diagnosis, just that I think something might be going on.

Ned,

You know abot DNS, TCP/IP and suchlike. Could you have a look at a rather old BOINC thread, please? (So old that the highlighting coding is not longer compatible) DNS caching in 6.2.18 and onwards.

I saw a number of cases where a libcurl retry reversed the order of the IP elements. Nobody else took any notice, but it seemed like a bug at the time.
ID: 919162 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919166 - Posted: 18 Jul 2009, 23:41:02 UTC - in response to Message 919162.  


Ned,

You know abot DNS, TCP/IP and suchlike. Could you have a look at a rather old BOINC thread, please? (So old that the highlighting coding is not longer compatible) DNS caching in 6.2.18 and onwards.

I saw a number of cases where a libcurl retry reversed the order of the IP elements. Nobody else took any notice, but it seemed like a bug at the time.

It'd take me a minute to find, since RFC-1034 and RFC-1035 don't use the word "random" to describe the behaviour.

Whenever you query DNS, the order records are returned should be randomized.

The idea is that if you have one DNS name pointing to two IP addresses that half of the traffic will go to one, and half will go to the other, due to that randomization.

The biggest single problem: the RFC is unclear who should randomize. Some DNS servers (Microsoft's, unless it has been recently fixed) do not, they assume the resolver or the client will randomize. Some resolvers assume that the server or the client will take care of it. I think most clients assume that the randomization happened elsewhere.

What *SHOULD* happen is that everyone should assume that nothing else randomized, and reshuffle the answers.

Two consecutive queries should never return the same records in the same order, unless the record types themselves are all different (the special case where randomizing doesn't help). Never.

The case in the other thread is not a bug.

What I'm not sure about in LIBCURL (since I don't normally use it), but my questions are about how it honors TTL (which it does not get from the resolver if they call "gethostbyname()") and how it handles failed responses.

In my opinion, it would probably be better to just turn DNS caching in LIBCURL off, and let the underlying OS handle it (since all the modern ones cache DNS locally anyway).
ID: 919166 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 919167 - Posted: 18 Jul 2009, 23:41:24 UTC
Last modified: 18 Jul 2009, 23:44:48 UTC


For ~ 25 minutes BOINC could UL.. and request new work.

BOINC had/will DL now ~ 360 WUs.


The 'problem' is, that my GPU cruncher make ~ 860 normal AR WUs / day.

This mean.. 860 result ULs / day.

If only shorties, ~ 2.7 x more ULs / day.


I have also only 'DSL light' 384/64 DL/UL. [kbit/s]
More isn't possible because of the thin cable of the T-Com* in our village.

I have also a prob like Berkeley with the cable.. ;-)


And if > 8 results are in the UL overview in BOINC, no work request.

And in my upper calculation this can happen after ~ 13 min. or less..
And if the UL server will be offline some hours and the cache is also down - the GPU cruncher will again idle.


Now ~ 20 minutes later..
Now again probs with ULs. ~ 21 reports in UL overview and no new work request.

Like I said.. if the UL server will be offline now for ~ 10 hours.. my GPU cruncher will again idle.. (with only ~ 360 normal AR WUs, if shorties and/or killed VLARs - then earlier)


It would be more fair, if the GPU cruncher could have a higher '> GPUs x X' (or something similar, CUDA/shader cores, GFLOPS) as '> CPUs x 2'.


BTW.
The GPU cruncher is a pure crunching machine. Only BOINC active.
No reboot the last ~ two days.


[* T-Com is the owner of (nearly) all telephone/DSL connections in Germany, former monopolists]

ID: 919167 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919170 - Posted: 18 Jul 2009, 23:45:20 UTC - in response to Message 919160.  

As Ned mentioned there are times that the TCP stack can become corrupt. The longer it runs and the more browser windows open and closed. Things that contact the network (email etc) and the Boinc... There are many possibilites that something could come in corrupt the network stack (continous retries to send work and having errors returned).

The easist way for a User is to "Shutdown" wait 15 seconds and Restart the Computer


I lost track of the friend who taught me this, but there are two reasons to try anything. First is because you think it'll fix the problem, and the second is because you don't think it'll fix the problem, but it's quick and easy to try, and you might get lucky.

Rebooting is a "quick, I might get lucky" kind of thing. I find that it works more often than not, and that sometimes luck is better than skill.
ID: 919170 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919171 - Posted: 18 Jul 2009, 23:46:33 UTC - in response to Message 919167.  

I have also only 'DSL light' 384/64 DL/UL. [kbit/s]

I wonder what would happen if you told BOINC to limit to 48 kbit/sec., upload and download.

ID: 919171 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 919175 - Posted: 18 Jul 2009, 23:57:06 UTC - in response to Message 919166.  

Whenever you query DNS, the order records are returned should be randomized.

No, no - that wasn't it.

Libcurl looked up a domain - say boinc2.ssl.berkeley.edu

It got an address (any address) - say 208.68.240.13

Some time later, it tried to re-use the same address - but actually attempted connection to:

13.240.68.208

(there was colour back in those days....)
ID: 919175 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919178 - Posted: 19 Jul 2009, 0:04:29 UTC - in response to Message 919162.  


You know abot DNS, TCP/IP and suchlike. Could you have a look at a rather old BOINC thread, please? (So old that the highlighting coding is not longer compatible) DNS caching in 6.2.18 and onwards.

I saw a number of cases where a libcurl retry reversed the order of the IP elements. Nobody else took any notice, but it seemed like a bug at the time.

Actually, this looks like a different bug.

If there are two "A" records, LIBCURL should try to connect to both addresses, regardless of the order.

... at least in your example, it appears to have only tried one.

It appears that a call to res_init() after a failure would correct a lot of sins, but I'm not an expert on LIBCURL, and don't claim to play one on TV.
ID: 919178 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 919182 - Posted: 19 Jul 2009, 0:16:04 UTC - in response to Message 919178.  

But what changed 208.68.240.13 to 13.240.68.208 ?

It wasn't DNS.
ID: 919182 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919183 - Posted: 19 Jul 2009, 0:16:09 UTC - in response to Message 919175.  

Whenever you query DNS, the order records are returned should be randomized.

No, no - that wasn't it.

Libcurl looked up a domain - say boinc2.ssl.berkeley.edu

It got an address (any address) - say 208.68.240.13

Some time later, it tried to re-use the same address - but actually attempted connection to:

13.240.68.208

(there was colour back in those days....)

Oh, wow. That's a huge, huge bug. Can't imagine that hasn't been fixed.

There is a part of me that says that BOINC should set CURLOPT_DNS_TIMEOUT to 0 (disable caching). There is also a call (res_init()) that would pick up DNS server changes if they were changed by DHCP that I suspect would be a good idea after any failure -- mentioned in the LIBCURL documentation.
ID: 919183 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : increase '> CPUs x2' in UL for work request


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.