Message boards :
Number crunching :
Panic Mode On (18) Server problems
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 · Next
| Author | Message |
|---|---|
OzzFan ![]() Send message Joined: 9 Apr 02 Posts: 15692 Credit: 84,761,841 RAC: 28
|
Ned , I couldn't agree more. People see "bandwidth" and they think "size", and since everyone knows AP is bigger, they rest their case that AP is the problem. This is overly simplistic logic that doesn't use any investigating or sleuthing to find the actual cause. |
TCP JESUS Send message Joined: 19 Jan 03 Posts: 205 Credit: 1,248,845 RAC: 0
|
The Seti@Home Project should then BAN all CUDA devices from the project as well as CPUs that are capable of greater than 1553.93 million ops/sec Floating Point speed as well as 3313.65 million ops/sec Integer speeds. If the network is clearly not capable of handling the increase in users and the technology that THEY (the new users) bring to the table, we should all be forced to use Single Core Pentium 4 Class CPUs ......with Hyperthreading being a grey area open for discussion. Calling work unit sizes the 'cure' is like arguing what is heavier ? a pound of feathers or a pound of bricks ? In another thread, it was pointed out to me that the whole purpose of Seti@home is to show what can be done on a limited budget, yet while Berkley retains it's 'limited budget', there are some here that have build CRUNCH-ONLY machines (in the multiples sometimes) that are worth more than $2,000 USD each.....and for what ? for THEIR enjoyment 1st and foremost (and the reward of RAC standings) and the Project's use 2nd. I for one LOVE to contribute to a cause. I DO NOT however like to be told HOW to contribute......that in itself is nearly a deal breaker. <putting on flame suite> Allan I am TCP JESUS...The Carpenter Phenom Jesus....and HAMMERING is what I do best! formerly known as...MC Hammer. |
DPRGI - Luivul Send message Joined: 24 Jan 03 Posts: 17 Credit: 20,639,801 RAC: 0
|
The problem isn't AP is the big number of WU out to the field. The number of WU processed with the new systems like CUDA and faster multi core CPU create a very big amount of data to process. The bandwith is non enought for this kind of traffic. The number of crunchers (active users) is very high |
OzzFan ![]() Send message Joined: 9 Apr 02 Posts: 15692 Credit: 84,761,841 RAC: 28
|
The Seti@Home Project should then BAN all CUDA devices from the project as well as CPUs that are capable of greater than 1553.93 million ops/sec Floating Point speed as well as 3313.65 million ops/sec Integer speeds. You make this statement, then you close with that you don't like being told HOW to contribute. Wouldn't this statement be the same thing as being told HOW to contribute? In the recent past, when AP v5.03 was plentiful out in the field, and MB was plentiful for CUDA and CPU alike, the servers had no problem keeping up with demand. The problem is that AP dried up, causing everyone to use MB instead. The shorter workunits mean that everyone is asking for more work, more often. Once the crunchers are saturated with all the AP they can handle, things will return to "normal" (if you can define that). The servers simply over-stress themselves when people are asking for a lot. Once the requests are satisfied, and people are spending more time crunching AP than asking for more work, the connection issues will go away. Calling work unit sizes the 'cure' is like arguing what is heavier ? a pound of feathers or a pound of bricks ? Not entirely. If I have three students asking for work and I give the first two only a single, short task to perform while I give the third student a longer task to perform, I'm going to have the first two students coming back asking for more work more often. It would be wise if I want to stop answering their requests for more work to give all students the longer tasks so they don't ask as often. In another thread, it was pointed out to me that the whole purpose of Seti@home is to show what can be done on a limited budget, yet while Berkley retains it's 'limited budget', there are some here that have build CRUNCH-ONLY machines (in the multiples sometimes) that are worth more than $2,000 USD each.....and for what ? for THEIR enjoyment 1st and foremost (and the reward of RAC standings) and the Project's use 2nd. Each volunteer is allowed to create as much waste cycles as they wish. This is another example of giving people their choices and HOW they want to contribute. Personally, I have a computer museum type hobby, and I'd like to have each machine doing something useful since I can't see all that power going to waste. However, since I cannot afford the electricity anymore, the majority of my machines have been powered down, and you can see this by the last connection attempt they made weeks ago. Also note that half the active computers are friend's machines attached to my account. I for one LOVE to contribute to a cause. I DO NOT however like to be told HOW to contribute......that in itself is nearly a deal breaker. Just remember those statements when making suggestions. :) <putting on flame suite> Only those who seek controversy need to wear flame-retardant suits. ;) |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
Ned , People see bandwidth and they think "wire speed" and they don't realize that every component has something analogous to bandwidth. My personal thought is that the real issue is the sheer number of TCP "SYN" packets hitting the upload servers or download servers. There is no practical way in the current BOINC client, to tell the client to sit down, shut up, be quiet, and wait it's turn. Am I right? I don't know. But banning people or technology without actually analyzing the data is not an answer. The truth, no matter how much we'd like to think otherwise, is that we simply don't have the metrics, and we don't have the access to get them. |
TCP JESUS Send message Joined: 19 Jan 03 Posts: 205 Credit: 1,248,845 RAC: 0
|
The Seti@Home Project should then BAN all CUDA devices from the project as well as CPUs that are capable of greater than 1553.93 million ops/sec Floating Point speed as well as 3313.65 million ops/sec Integer speeds. No, That was my poor attempt at sarcasm.....lol I am TCP JESUS...The Carpenter Phenom Jesus....and HAMMERING is what I do best! formerly known as...MC Hammer. |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
In another thread, it was pointed out to me that the whole purpose of Seti@home is to show what can be done on a limited budget, yet while Berkley retains it's 'limited budget', there are some here that have build CRUNCH-ONLY machines (in the multiples sometimes) that are worth more than $2,000 USD each.....and for what ? for THEIR enjoyment 1st and foremost (and the reward of RAC standings) and the Project's use 2nd. While I admire those who have spent money to build the biggest, baddest, fastest cruncher possible, it's important to note that Berkeley does not ask for anything but waste CPU cycles. ... and while it's true that BOINC is supposed to make distributed processing in a very limited budget possible, I don't think anyone thought budgets would be this tight. In another post, there is a graph showing 180,000 active crunchers. Could that possibly be a 60,000:1 ratio of participants to staff? A couple more people would likely make a big difference, and still show what can be done on a vanishingly small budget. When you get right down to it, this is an exercise in engineering. How much can we squeeze into how little, what problems will emerge, and what can be done to mitigate the issues that are found in the process? |
TCP JESUS Send message Joined: 19 Jan 03 Posts: 205 Credit: 1,248,845 RAC: 0
|
Fair enough, but perhaps the Scientists need to take into account "The Human Condition" ? They say that Auto racing was born once the second automobile rolled off the assembly line....and similar rings true here in BOINC-Land. The minute that STATS pages and engines were implimented, it became MORE than a simple 'donation' of CPU cycles - IT BECAME A GAME ;) Without STATS, the membership might be WAY down......but I can bet that the level of malcontent will be as well among 'crunchers' either running out of work, or failing at uploading completed work ;) Sound plausible ? Allan. I am TCP JESUS...The Carpenter Phenom Jesus....and HAMMERING is what I do best! formerly known as...MC Hammer. |
Westsail and *Pyxey* Send message Joined: 26 Jul 99 Posts: 338 Credit: 20,544,999 RAC: 0
|
ruh roh! lol, just checked my pendings.. Pending credit: 5,518,866,941.93 ermm...Houston? "The most exciting phrase to hear in science, the one that heralds new discoveries, is not Eureka! (I found it!) but rather, 'hmm... that's funny...'" -- Isaac Asimov
|
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
Allan, This isn't a case of what the project should do, but how long it takes to do all of the things that need to be done, in the time available and with the available staff. I see two common assumptions that really aren't fair. 1) I don't see it DONE nobody at the project wants to do it. 2) If I don't get my (uploads/downloads/stats) it is because the project doesn't care about the crunchers. From reading what Matt says in the technical news, I don't see a simple solution to the current issues -- and as an observer, we've seen bumpy times, and we've seen that bumpiness smooth out. It's possible that anything (disruptive) that the SETI staff might do would simply take away resources and make the recovery take longer. Even more bandwidth or a (needed) mega-server or two would help later, but only add to the problems now. But to take "things aren't working well" and turn it into "don't they understand that we're IMPORTANT!" is both narcissistic and unfair. They're doing the best they can with what they have available, and things will be better, hopefully sooner rather than later. -- Ned |
|
Terror Australis Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44
|
The Seti@Home Project should then BAN all CUDA devices from the project as well as CPUs that are capable of greater than 1553.93 million ops/sec Floating Point speed as well as 3313.65 million ops/sec Integer speeds. Irony |
JimHilty2 Send message Joined: 30 Apr 03 Posts: 75 Credit: 7,199,464 RAC: 0
|
Weeee! Flamed by General Ludd and Ozzy in the same evening. I'm Honoured lol. Pity I missed it but it was after midnight here.
|
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14023 Credit: 208,696,464 RAC: 304
|
Weeee! Flamed by General Ludd and Ozzy in the same evening. I'm Honoured lol. Pity I missed it but it was after midnight here. It would appear you're confused as to what flaming is. Flaming. Grant Darwin NT |
JimHilty2 Send message Joined: 30 Apr 03 Posts: 75 Credit: 7,199,464 RAC: 0
|
Weeee! Flamed by General Ludd and Ozzy in the same evening. I'm Honoured lol. Pity I missed it but it was after midnight here. Can absolutely nothing be said tongue in cheek on this board lately.
|
ML1 Send message Joined: 25 Nov 01 Posts: 21985 Credit: 7,508,002 RAC: 20
|
Can absolutely nothing be said tongue in cheek on this board lately. Must be the heat and Global Warming! Happy hot crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3
|
There is no practical way in the current BOINC client, to tell the client to sit down, shut up, be quiet, and wait it's turn. Maybe not in the client, but there is in the back-end. Very strict: <one_result_per_host_per_wu/> If present, send at most one result of a given workunit to a given host. Less strict: <max_wus_in_progress> N </max_wus_in_progress> <max_wus_in_progress_gpu> M </max_wus_in_progress_gpu> Limit the number of jobs in progress on a given host (and thus limit average turnaround time). (needs a BOINC 6 client) And of course: <next_rpc_delay>x</next_rpc_delay> In each scheduler reply, tell the clients to do another scheduler RPC after at most X seconds, regardless of whether they need work. Just set to more than the 11 seconds it is now. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
Maybe not in the client, but there is in the back-end. I would agree. I don't think the BOINC infrastructure can cope with 4,422,597 results out in the field, and the corresponding traffic loads for what are, in very many cases, very short computation durations. By definition, BOINC can't have been pre-tested at this sort of stress level - we're performing that test now. But do these controls provide the necessary level of "KEEP CALM and CARRY ON"? I don't think they do. Very strict: This basically says "no self validation", and I think it's been in place at SETI ever since the very beginning. [We were very surprised when it was relaxed at Beta recently]. This will very rarely slow down the rate work is issued, because there are so many other active hosts to choose from: on the other hand, the additional rule-checking probably adds to the server CPU and database load. Nothing to be gained here. Less strict: This might well help, especially in the GPU case (where you could be sure of a BOINC 6 client), but what value would you coose for M? If M is an absolute fixed number (which it seems to be, from the wording), then IMO BOINC has been shortsighted and provided an inappropriate tool. Vyper's top host, with 8 GPUs, would be very hard hit by a value of M which would be completely irrelevant to, and have no effect on, a slower host. For the project to impose a value of M low enough to curb the majority of GPU hosts would be a difficult political decision. It would have been better if BOINC had provided a tool which operated more proportionately across the wide range of GPU speeds, from 8400GT to Tesla. And of course: Those sound like two separate and distinct settings. Increasing the delay between RPC attempts - from the current 11 seconds back to the 10 minutes it was when I first ran SETI/BOINC - might well sooth things down a bit, but we sure as heck don't want an "at most" value forcing unecessary redundant RPCs onto an already overcrowded pipe (BOINC v6.6 supplies more than enough of those already). Personally, I still feel that a lower cap than 10 days on the maximum amount of work cached would be the fairest way of reducing the amount of work held 'live' on the project's servers and in the project's database. If that could be made a server-side project configuration parameter, then people who have a genuine reason for large caches could still choose, and use, the maximum client setting on other projects (we occasionally hear from mariners who crunch BOINC, but are unable to report results for months at a time - they tend to run CPDN). |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 20033 Credit: 40,757,560 RAC: 67
|
Instead of being annoyed over the present situation maybe it might be better if we could get our combined brains together and come up with a better system, that would reduce traffic in times of server stress. My suggestion is: When making request the host tells the server, what processors it has the class of tasks it can process (MB-CUDA, MB-CPU, AP-V503, AP-V505 etc) the number of tasks and the total predicted completion time for those tasks for each class defined above. At the server, on that info and without checking, during stressful periods, it either tells the host to come back later, if you can keep crunching for the next X hours without further tasks, (come back later to be ~0.5 * X hours) or allocates enough tasks in those classes where work is required for the host to keep crunching for the next X hours. It seems utterly stupid to me to see a request for 86 sec of work when I have a two day cache. i.e. my cache is 99.95% full. And at the opposite end of the scale not get any tasks when I only have two Seti tasks on my computer which have already started processing and my cache being totally filled by a project which was already heavily into LTD before the stress period. I now have an Einstein cache of nearly 5 days. IMHO I don't think BOINC works as intended. edit] I agree with Richard that max cache needs to be changed, my suggestion is make it one day (24 hours) less than the min shortest deadline for projects attached too. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3
|
Less strict: The full text is: Limit the number of jobs in progress on a given host (and thus limit average turnaround time). Starting with 6.8, the BOINC client report the resources used by in-progress jobs; in this case, the max CPU jobs in progress is N*NCPUS and the max GPU jobs in progress is M*NGPUs. Otherwise, the overall maximum is N*NCPUS + M*NGPUS). It would have been better if BOINC had provided a tool which operated more proportionately across the wide range of GPU speeds, from 8400GT to Tesla. Since the ATI cards aren't detected yet, this may still be an option to add. Why not write an enhancement ticket, so the devs can think about how to do that, come 6.10? But mind that restricting clients to come back to base is a bit difficult due to there being not enough seconds in a day. 86,400 seconds in a day, there being 180,000 hosts out there, still means that on a normal day 2.1 hosts a second will request work. And that's only in an ideal world, where where a supercomputer with a super database will keep track of which 2.1 hosts may contact the server this second and get their work transported to them in the remaining 0.9 seconds, before the next batch is allowed to contact home base. :) |
|
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0
|
Ageless wrote: ... Right idea, wrong parameter. That one tells each client it should contact the server periodically even if not otherwise needed, and isn't used here. <min_sendwork_interval> N </min_sendwork_interval> Minimum number of seconds between sending jobs to a given host. You can use this to limit the impact of faulty hosts. is set to 10 seconds for this project, that causes Scheduler replies to have the 11 second <request_delay> value. I think increasing min_sendwork_interval to 100 seconds might be helpful. What would probably be better is added code which would lengthen the <request_delay> proportionally as the available bandwidth is approached. But I don't know what reliable indicator they have which could be used as input to that kind of scheme. Joe |
©2026 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.