Message boards :
Number crunching :
increase '> CPUs x2' in UL for work request
Message board moderation
Author | Message |
---|---|
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
O.K., all around here know: > 'CPUs x 2' in UL overview in BOINC and no work request. IIRC, in the time I let run my old QX6700.. 4 WU/h.. this would mean after ~ 2 hours no work request. Now, with my GPU cruncher.. the same AR.. 4 WUs /~ 6,75 minutes .. this mean after ~ 13 minutes no work request.. ~ 2 hours to ~ 13 minutes.. I think this isn't fair.. ;-) It's not possible to increase this value for GPU cruncher? That they have the chance to 'bridge' UL failures? Now.. I could download 24 WUs.. but now they are all in the UL overview and BOINC don't ask for new work.. and the GPU cruncher idle.. If the value would be higher, I could crunch ~ 2 hours.. and maybe in this time the UL server would be again well.. and BOINC could UL and DL.. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Actually, if you're able to upload, and you have a 2 day cache, you should be able to keep two days of work -- this is only a limit when the upload servers are in trouble. This only becomes an issue if there are extended problems -- lasting more than two days. ... but I'm thinking that something is wrong if you haven't been able to upload. Have you tried restarting BOINC? Edit: remember that this is a BOINC "feature" and not something that SETI@Home can change. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
I meant my opening post, if the UL server have probs and with an idle PC which could download 8 WUs. My QX6700 could crunch ~ 2 hours, until BOINC wouldn't ask to new work. In this time BOINC could DL maybe a 10 day WU cache. My current GPU cruncher wouldn't ask for new work after ~ 13 minutes. In this time BOINC could DL only some WUs.. maybe for 20 minutes crunching time. So why is a fast GPU cruncher so disadvantage? If he could ask also 2 hours for new work.. he could have WUs maybe for 1/4 day. Never up to 10 day WU cache. It would be better/fair if a CPU-Quad have the same 'brake' like a GPU-Quad. 'No work reqest' at performance equality. This would mean maybe, if I compare my upper mentioned PCs: CPU-Quad: > CPUs x2 GPU-Quad: > GPUs x10 (or much more) I ran out of work with my GPU cruncher many times since I have him. The max. was ~ 4 - 5 day WU cache. Then again server probs at Berkeley. With good luck down to 1/2 day.. and again fill up the cache to ~ 4 - 5 days. Then again server probs at Berkeley. Maybe fill up to 1 day.. Then again server probs at Berkeley. ..idle GPU cruncher.. ..maybe for one up to 3 days. Then again a fresh start.. download some WUs.. UL server down.. and again idle to the time the UL server again well.. |
Zeus Fab3r Send message Joined: 17 Jan 01 Posts: 649 Credit: 275,335,635 RAC: 597 |
I suggest that following formula should be considered: 'CUDA Stream Processors x2' That'll do the trick if Sutaru agrees... :) Who the hell is General Failure and why is he reading my harddisk?¿ |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
I suggest that following formula should be considered: Yes, this would be a well idea! :-) But.. I think current not possible, because the SETI@home application and the BOINC client can't communicate this way. Or? Maybe we could take the GFLOPS of the BOINC benchmarks? 1 GFLOPS = 1 WU This would mean for my GPU cruncher, one OCed GPU have 112 GFLOPS -> 4 x 112 = 448 results in the UL overview and work request possible. > 448 and no work request. Hmm.. if this 448 are normal ARs [0.44x] my GPU cruncher reach this value after 448 WUs x 6:45 [m:s] = ~ 12.5 hours And after ~ 12.5 hours no UL possible - no work request. If the server play the game well with the GPU cruncher, he make ~ 860 'normal' MB WUs/day. If this would be only shorties.. shorties are after ~ 2:30 finished.. 448 WUs x 2:30 [m:s] = ~ 4 1/2 hours And after ~ 4 1/2 hours no UL possible - no work request. The calculations are with an idle GPU cruncher, which start to crunch. In the short time (~ 2 months) I have all 4 GPUs insert, he couldn't crunch all the time continuously.. more times idle.. This would mean, with a new calculation of > 'GPUs x X', 'CUDA Stream Processors x X' or 'GFLOPS x X' for GPU cruncher, that they could 'bridge' longer UL server outages. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
I meant my opening post, if the UL server have probs and with an idle PC which could download 8 WUs. I understood that. What I said was that this should be a pretty exceptional case, and that it is in exactly those exceptional times when the upload server is going to be down for a long time that stopping uploads prevents an impossible situation when the server is back up. But I think you need to be looking for other problems.... I don't have nearly the number of uploads you have, but for the past couple of days, I've been able to get uploads through more often than not, and I wonder if there isn't something going on at your end. I just had three go through on the first try. I'd sure take a look and see if I could find some other problem. At a minimum I'd probably shutdown and reboot. |
john deneer Send message Joined: 16 Nov 06 Posts: 331 Credit: 20,996,606 RAC: 0 |
I just had three go through on the first try. Hi Ned, That remark (I just had three go through ...) sure triggered something in my feet :-) When I read that, I ran up the stairs and turned my machine on immediately. And what do you know: those uploads lingering here all day went through immediately! I'm pretty sure there's nothing wrong with Sutaru's system, I have been having the same problems all day (and the day before that) as well. Now I'm going to try and get some new units to crunch :-) Regards, John. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
I'm pretty sure there's nothing wrong with Sutaru's system, I have been having the same problems all day (and the day before that) as well. I haven't seen enough to prove it one way or another, but I think there is a long-term bug in LIBCURL (which BOINC uses, along with a whole bunch of other packages) that hangs on to outdated DNS for a long time. Your machine was off, now it is on, and it works. Could it be that it worked because it did not have some bad info stored somewhere, or a screwed up IP stack that was reloaded on start-up? I haven't seen it enough times personally to have much of a diagnosis, just that I think something might be going on. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304 |
AP stopped splitting work for a while- downloads dropped by 10Mb/s, uploadeds increased by 30Mb/s. AP appears to have fired up again- downloads jumped 8Mb/s, uploads dropped by about 15Mb/s. EDIT- during that download drop in traffic i had about 70 uploads go through one ofter the other. Since AP fired up again, nothing's gone through. Grant Darwin NT |
john deneer Send message Joined: 16 Nov 06 Posts: 331 Credit: 20,996,606 RAC: 0 |
I'm pretty sure there's nothing wrong with Sutaru's system, I have been having the same problems all day (and the day before that) as well. That machine was turned off and on (and off again :-) several times over the last three days. Reboots, flushing the dns, restarting boinc and what have you more over the last 24 hours or so. Uploads went gaga somewhere yesterday, and I kept getting messages that indicated things weren't going all too well. I didn't change any of the software on that machine for weeks, and all of a sudden a couple of days ago it went into a state where it wasn't able to continuously upload or download (even when there wasn't the 2xcpu+1 limit). Mostly related to high scores on the cricket graph, but sometimes worse than what you would expect because of high traffic. Most of it is caused by high traffic, but some of it is probably caused by the fiddling going on at Berkeley :-) And indeed, I got my first 20 or so units just a few seconds ago. Hurray, and thanks again for mentioning those succesfull uploads :-) Regards, John. |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
As Ned mentioned there are times that the TCP stack can become corrupt. The longer it runs and the more browser windows open and closed. Things that contact the network (email etc) and the Boinc... There are many possibilites that something could come in corrupt the network stack (continous retries to send work and having errors returned). The easist way for a User is to "Shutdown" wait 15 seconds and Restart the Computer Please consider a Donation to the Seti Project. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
I'm pretty sure there's nothing wrong with Sutaru's system, I have been having the same problems all day (and the day before that) as well. Ned, You know abot DNS, TCP/IP and suchlike. Could you have a look at a rather old BOINC thread, please? (So old that the highlighting coding is not longer compatible) DNS caching in 6.2.18 and onwards. I saw a number of cases where a libcurl retry reversed the order of the IP elements. Nobody else took any notice, but it seemed like a bug at the time. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
It'd take me a minute to find, since RFC-1034 and RFC-1035 don't use the word "random" to describe the behaviour. Whenever you query DNS, the order records are returned should be randomized. The idea is that if you have one DNS name pointing to two IP addresses that half of the traffic will go to one, and half will go to the other, due to that randomization. The biggest single problem: the RFC is unclear who should randomize. Some DNS servers (Microsoft's, unless it has been recently fixed) do not, they assume the resolver or the client will randomize. Some resolvers assume that the server or the client will take care of it. I think most clients assume that the randomization happened elsewhere. What *SHOULD* happen is that everyone should assume that nothing else randomized, and reshuffle the answers. Two consecutive queries should never return the same records in the same order, unless the record types themselves are all different (the special case where randomizing doesn't help). Never. The case in the other thread is not a bug. What I'm not sure about in LIBCURL (since I don't normally use it), but my questions are about how it honors TTL (which it does not get from the resolver if they call "gethostbyname()") and how it handles failed responses. In my opinion, it would probably be better to just turn DNS caching in LIBCURL off, and let the underlying OS handle it (since all the modern ones cache DNS locally anyway). |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
For ~ 25 minutes BOINC could UL.. and request new work. BOINC had/will DL now ~ 360 WUs. The 'problem' is, that my GPU cruncher make ~ 860 normal AR WUs / day. This mean.. 860 result ULs / day. If only shorties, ~ 2.7 x more ULs / day. I have also only 'DSL light' 384/64 DL/UL. [kbit/s] More isn't possible because of the thin cable of the T-Com* in our village. I have also a prob like Berkeley with the cable.. ;-) And if > 8 results are in the UL overview in BOINC, no work request. And in my upper calculation this can happen after ~ 13 min. or less.. And if the UL server will be offline some hours and the cache is also down - the GPU cruncher will again idle. Now ~ 20 minutes later.. Now again probs with ULs. ~ 21 reports in UL overview and no new work request. Like I said.. if the UL server will be offline now for ~ 10 hours.. my GPU cruncher will again idle.. (with only ~ 360 normal AR WUs, if shorties and/or killed VLARs - then earlier) It would be more fair, if the GPU cruncher could have a higher '> GPUs x X' (or something similar, CUDA/shader cores, GFLOPS) as '> CPUs x 2'. BTW. The GPU cruncher is a pure crunching machine. Only BOINC active. No reboot the last ~ two days. [* T-Com is the owner of (nearly) all telephone/DSL connections in Germany, former monopolists] |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
As Ned mentioned there are times that the TCP stack can become corrupt. The longer it runs and the more browser windows open and closed. Things that contact the network (email etc) and the Boinc... There are many possibilites that something could come in corrupt the network stack (continous retries to send work and having errors returned). I lost track of the friend who taught me this, but there are two reasons to try anything. First is because you think it'll fix the problem, and the second is because you don't think it'll fix the problem, but it's quick and easy to try, and you might get lucky. Rebooting is a "quick, I might get lucky" kind of thing. I find that it works more often than not, and that sometimes luck is better than skill. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
I have also only 'DSL light' 384/64 DL/UL. [kbit/s] I wonder what would happen if you told BOINC to limit to 48 kbit/sec., upload and download. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
Whenever you query DNS, the order records are returned should be randomized. No, no - that wasn't it. Libcurl looked up a domain - say boinc2.ssl.berkeley.edu It got an address (any address) - say 208.68.240.13 Some time later, it tried to re-use the same address - but actually attempted connection to: 13.240.68.208 (there was colour back in those days....) |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Actually, this looks like a different bug. If there are two "A" records, LIBCURL should try to connect to both addresses, regardless of the order. ... at least in your example, it appears to have only tried one. It appears that a call to res_init() after a failure would correct a lot of sins, but I'm not an expert on LIBCURL, and don't claim to play one on TV. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
But what changed 208.68.240.13 to 13.240.68.208 ? It wasn't DNS. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Whenever you query DNS, the order records are returned should be randomized. Oh, wow. That's a huge, huge bug. Can't imagine that hasn't been fixed. There is a part of me that says that BOINC should set CURLOPT_DNS_TIMEOUT to 0 (disable caching). There is also a call (res_init()) that would pick up DNS server changes if they were changed by DHCP that I suspect would be a good idea after any failure -- mentioned in the LIBCURL documentation. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.