Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 107 · Next
Author | Message |
---|---|
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
Now- from the point of view of the servers a WU can be considered a result from the Splitters. But it tends to confuse things (more than they already are) when you send out a result to get a result.A result for the server is a row in the database. It is created by the splitters. Or by the validator when it needs to do a resend to resolve an inconclusive. Think of it as an empty form for the result to be filled when the host has crunched the task. There is no point in tracking results and tasks in separate tables in the database. They always have a on to one mapping between them so one table is enough. The servers are interested in the results, so a result table it is. SSPs of some other projects use the word 'task' where our SSP uses 'result'. But it is a distinct thing from the workunits just like in here. Colloquial use of workunit to refer to tasks comes from Seti@Home classic. Boinc has used different terminology from the start. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14658 Credit: 200,643,578 RAC: 874 ![]() ![]() |
In the BOINC Manager, there is the Tasks tab. There each Task is synonymous with Work Unit (it has the WU name there as the identifier).Not true. A task name is distinguished from the original workunit name by having the replication number (_0, _1 and so on) appended. This may not always be visible because of the length of the BLC names, but it's there. |
Ian&Steve C. ![]() Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 ![]() ![]() |
In terms of the database, "task" = "result" and "workunit" = "workunit" = the collection of all replications, the replicated workunits are the tasks or results. just look at anyone's host "task" list. there are two columns, one labelled "task" which is the individual result. the other is "workunit" and clicking that shows all the replicated tasks/results spawned from that one workunit. Ville is correct about the relationship between SSP's results returned and waiting for validation and workunits waiting for assimilation. there is no category for "results waiting for assimilation". these are in the ["results returned" AND "awaiting validation"] category. they have been returned, and validated, but do not get removed from this category until the entire workunit is assimilated. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ![]() ![]() |
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
Please forgive me, WU, results, tasks or what ever are just "words", the real problem is all this mess comes from the bad idea to huge increase the server side limits 4 months ago IIRC with no test done before, after that all is a complete mess. The solution to the problem is well know. At least for those who has a minimal knowledge on how DB and servers works, but i strongly believe that solution will never happening. The shutdown (or hibernation as some insist to call) is to close to take any real measure to fix the problem. So we will going to live with all this ups & downs until the end. ![]() |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13778 Credit: 208,696,464 RAC: 304 ![]() ![]() |
Yep.In the BOINC Manager, there is the Tasks tab. There each Task is synonymous with Work Unit (it has the WU name there as the identifier).Not true. A task name is distinguished from the original workunit name by having the replication number (_0, _1 and so on) appended. This may not always be visible because of the length of the BLC names, but it's there. I think i've finally got my head around it. If i think of "Results ready to send" as "Tasks Ready to send" and "Current result creation rate" as "Current Task creation rate" it makes more sense. The Splitters produce Work Units- but no Work Units are ever sent out. Only tasks. Tasks are copies of the WU, with an _0, _1, _2 etc added on to it's name. The data is the same, but each Task is unique. When a WU is split (WUname), that results in 2 "Results ready to send" (ie for me "Tasks ready to send") (WUname_0 and WUname_1) So the Current result creation rate (ie for me Task creation rate) is actually 2* the actual splitter output rate (2 Results (Tasks) are produced for each WU produced). When a Task is completed, a Result is produced & returned. By having the Task's "go out" in the result table, it means when the Result is returned it isn't necessary to move the result from one table (eg a Tasks table) to another table (eg a Result table) as it's already in the Result table. It just gets the appropriate fields filled in (hence processing a result, produces a result). Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13778 Credit: 208,696,464 RAC: 304 ![]() ![]() |
Please forgive me, WU, results, tasks or what ever are just "words", the real problem is all this mess comes from the bad idea to huge increase the server side limits 4 months agoNope. That had a very slight impact, but it was the combination of file after file of noise bombs, coupled with the increase in Quorum numbers to keep bad results out of the science database from the RX 5000 driver issue that blew everything out of the water (checkout the graphs to confirm). Grant Darwin NT |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14658 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Yep.https://www.youtube.com/watch?v=uVmU3iANbgk :-) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13778 Credit: 208,696,464 RAC: 304 ![]() ![]() |
I wish it had occurred that quickly.Yep.https://www.youtube.com/watch?v=uVmU3iANbgk :-) Way past my bed time. Grant Darwin NT |
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
Please forgive me, WU, results, tasks or what ever are just "words", the real problem is all this mess comes from the bad idea to huge increase the server side limits 4 months agoNope. Sorry. I not agree. The problems started well before the driver issue. The driver problem just make it worst by creating the perfect storm! If you roll back to october/november (before the increase of the server limits) all works fine and the work was flowing normally. After the change, the problem started. Obviously takes few days to been noticed while the number of WU distributed/assimilated/crunched/whatever reaches the edge of the servers capacity. But any further discussion about who is right or wrong is futile. The real fix will never be happening. In my case i will remain crunching until the end but is a shame to close (or hibernate) the project after >20 years of running, with all such mess happening. ![]() |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
Task is what the SSP means when it says 'result'. It uses the database terminology. The server is interested in the results so it calls the table that has one row for each individual task the result table. . . I understand it as a result is the output of the splitters, which is assigned a number and is a WU. When assigned to the field it is given a task number for each host it is sent to so yes, a task is what each host is processing. But there are 2 tasks (or more) in the field for each 'result'. The problem is the term result is used in several different contexts and confuses the issue. Sometimes it means WU and sometimes it means task, but which one and where. Stephen :( |
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
The task limit increase would have made only a max 50% increase (probably a lot less because not every user is having a cache that hits the limit) in the results out in the field and results waiting for validation (the true validation, not the assimilation queue). The rate of the stuff transitioning between the states shouldn't have changed at all. That can't explain the assimilator problem or the difficulty of the replica keeping up. The remaining explanations are that there is some 'software rot' or seti@home has simply grown too big. The number of users has climbed slowly and the crunching power of the hardware has grown too. Perhaps we just recently hit the limits of the server hardware. Overflow results now automatically getting triple or quadruple replication certainly contributes to this problem. Also remember the event that happened exactly when the current ongoing problems started: Setiathome upgraded their boinc server software to a version that had a bug (or misconfiguration) that prevented it from giving any work to anonymous platform hosts and then rolled back this change. Perhaps that version left some stuff around when the old version was restored and this stuff is now the cause of the software rot? The perfect storm was a transient thing. It can't explain the problems still remaining months later. |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
Please forgive me, WU, results, tasks or what ever are just "words", the real problem is all this mess comes from the bad idea to huge increase the server side limits 4 months ago IIRC with no test done before, after that all is a complete mess. The solution to the problem is well know. At least for those who has a minimal knowledge on how DB and servers works, but i strongly believe that solution will never happening. The shutdown (or hibernation as some insist to call) is to close to take any real measure to fix the problem. So we will going to live with all this ups & downs until the end. . . It is easy to blame the server limit increases, and they did complicate the problem as did the concurrent flood of noise bombs which all came at a very bad time, but I remain convinced it was the ill advised OS upgrade to 7.15 that really screwed the pooch. After it was rolled back the servers never behaved quite right again. . . We each have our theory I guess ... Stephen :( |
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
. . I understand it as a result is the output of the splitters, which is assigned a number and is a WU. When assigned to the field it is given a task number for each host it is sent to so yes, a task is what each host is processing. But there are 2 tasks (or more) in the field for each 'result'. The problem is the term result is used in several different contexts and confuses the issue. Sometimes it means WU and sometimes it means task, but which one and where.Result is consistently used to mean the task that a single host is crunching - or failing to crunch. Either as a synonym for task or more narrowly to mean the result the task produced. Some people just seem to use workunit inconsistently sometimes meaning the actual workunit and sometimes meaning the task/result. Two tasks produce two results. There are exactly as many task as there are results. Workunit has multiple - usually two - results. A validated workunit has exactly one canonical result that the assimilator inserts into the science database. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14658 Credit: 200,643,578 RAC: 874 ![]() ![]() |
. . I understand it as a result is the output of the splitters, which is assigned a number and is a WU. When assigned to the field it is given a task number for each host it is sent to so yes, a task is what each host is processing. But there are 2 tasks (or more) in the field for each 'result'. The problem is the term result is used in several different contexts and confuses the issue. Sometimes it means WU and sometimes it means task, but which one and where.Oh dear. No, no, no. A WU is a workunit, and a workunit is a WU. Think of it as the data file produced by the splitter. A result is a task, and a task is a result. Two words for the same thing. If there are two of them in the field, they are both tasks, and they are both results. There are, however, two tasks/results (at least) for each WU. Each of these words has a precise, specific, scientific meaning. OK, colloquially, they sometimes get muddled up in the general chit-chat here, but when we're looking in detail at the figures (this conversation arose from discussion of the statistics on the SSP), we have to be careful and precise with the words we use. |
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
Replica seconds behind master 59,870 At what level we could turn on the panic bottom? ![]() |
Ian&Steve C. ![]() Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 ![]() ![]() |
I'm sure the guys don't want to touch these things anymore, but maybe someone should take a look at it and see if there's something they can do to get it to move towards recovery. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ![]() ![]() |
Speedy ![]() Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 ![]() ![]() |
I'm sure the guys don't want to touch these things anymore, but maybe someone should take a look at it and see if there's something they can do to get it to move towards recovery. Maybe when you posted you did something because the splitters are currently running at over 94 a second.it is certainly better than the 3 point something that they were running at. At most before hibernation we can only have another 2 weekly outages assuming that they decide to do maintenance ![]() |
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
Replica seconds behind master 66,057 Did we reach the panic threshold? ![]() |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
Replica seconds behind master 66,057 I think we passed the panic stage a long time ago. With the replica so far behind, you are operating on pure faith. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13778 Credit: 208,696,464 RAC: 304 ![]() ![]() |
The perfect storm was a transient thing. It can't explain the problems still remaining months later.The Quorum settings weren't transient, they are still in effect. Hence we are still having database base bloat as the bloat was never cleared, and the current settings are maintaining that bloat. Grant Darwin NT |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.