1)
Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
(Message 2030638)
Posted 3 Feb 2020 by ![]() Post: My Inconclusive results are going up too, even though I've only had a handful of Tasks since last night. Last night I had a large number of Inconclusive results that said 'minimum quorum 1' and only listed a single Inconclusive host. I didn't see how a single Inconclusive host task could ever validate. Now, it's very difficult to bring up my Inconclusive tasks lists, but, it seems those tasks are now listed as; https://setiathome.berkeley.edu/workunit.php?wuid=3862758806I have a couple of invalid tasks with minimum quorum = 1. Perhaps I have a lot of valid tasks as well with min.q.=1, but they are much harder to spot. https://setiathome.berkeley.edu/workunit.php?wuid=3861384942 https://setiathome.berkeley.edu/workunit.php?wuid=3861339403 https://setiathome.berkeley.edu/workunit.php?wuid=3861247650 https://setiathome.berkeley.edu/workunit.php?wuid=3861247545 and so on... https://setiathome.berkeley.edu/results.php?userid=5276&offset=0&show_names=0&state=5&appid= |
2)
Message boards :
Number crunching :
How to Fix the current Issues - One man's opinion
(Message 2030121)
Posted 31 Jan 2020 by ![]() Post: In my opinion this project needs a new splitting / validation process which is able to handle the ultra high performance of the present and future GPUs as well as the oldest CPUs. It could be achieved by sending larger chunks of data to fast hosts (expanding in the power of 2, limited by the actual processing speed of the slowest device (GPU/CPU) in the given system). It needs a new client app also, as it should omit the parts of the data poised by RFI. I think the need for the transition to that adaptive splitting algorithm is now. Please share your ideas! (Besides that it can't be done.) |
3)
Message boards :
Number crunching :
How to Fix the current Issues - One man's opinion
(Message 2030119)
Posted 31 Jan 2020 by ![]() Post: If you have really long work units, then increasing the limits for returned signals is not enough. A rfi spike will fill any reasonable limit and this will then mask all the good parts of the data. Bigger time windows mean more observation time is lost due to these events.RFI spikes can be easily detected and omitted by the app from the result, so no observation time would be lost. This is why I suggested in another post that the clients would process the long workunits in multiple parts that would match the size of the current workunits and produce result data separately for each part. So you could have result overflow for one part but good results for the rest.This would leave the load on the servers unchanged. Further tweaking and optimizing client behavior would make the servers' job harder, this isn't the right way. There's no easy way to fix the problems we face. |
4)
Message boards :
Number crunching :
How to Fix the current Issues - One man's opinion
(Message 2029785)
Posted 29 Jan 2020 by ![]() Post: Beta IS a project in and of itself and does not need to be created. It already exists on the same severs as SETI Prime does. It is there to test new apps and server software, hence the name Beta. I don't see messing with Beta when Prime needs more fixing.We're discussing ideas this project needs to adopt to get fixed for good. That's what beta could be used for. Tinkering with the old stuff couldn't achieve that in the long term. |
5)
Message boards :
Number crunching :
How to Fix the current Issues - One man's opinion
(Message 2029784)
Posted 29 Jan 2020 by ![]() Post: What do you actually mean by "doubling task size"?I would go for the 1st option. A task which covers a longer period in time would also mean that less overlap (=less network traffic, less disk space) is necessary for data processing / transfer. The ideal solution would be to send as much data to a host that the actual device (CPU/GPU) could process it in 1~2 hours. For example a very fast host would receive up to 256 times longer chunks of data to process. I can easily spot tasks, which were processed by my wingman over 400 times slower. In other words my host puts 400 times higher load on the servers than the other host does. This is not necessary. The ability to reduce the workload on the servers should be adopted in the way the data is split between hosts, as future GPUs will be even faster. I'm aware that the storage limits of the given workunit for the found spikes / pulses / triplets / Gaussians should be increased as well. The 2nd option is also viable, but the 3rd wouldn't change things much. The point is to reduce the number of tasks out in the field, and the number of server-client transactions to make it easier for the servers to handle their job. |
6)
Message boards :
Number crunching :
How to Fix the current Issues - One man's opinion
(Message 2029776)
Posted 29 Jan 2020 by ![]() Post: That trend will follow the uptime/downtime ratio of this project (plus many other aspects).This amount is exponentially decaying as we go back in time, but the volunteers of this project can provide the computing power to convert (even re-calculate) that amount of data (as the computing power is exponentially growing), but I'm not sure if it should be converted at all. The architecture of the science database can be changed without changing the meaning the data in it, so this project can use a different architecture in the future.Not as exponential as you might think, as the number of active users has diminished over that period. For several reasons, BOINC and credit screw to name but two. The goal should be to reduce downtime (ideally to 0), as the frequent and extended downtime periods resulted in counterproductive user action. |
7)
Message boards :
Number crunching :
How to Fix the current Issues - One man's opinion
(Message 2029775)
Posted 29 Jan 2020 by ![]() Post: Greetings,Perhaps the Beta should be created to be able to handle a few task size doubling now, and several more in the future. |
8)
Message boards :
Number crunching :
How to Fix the current Issues - One man's opinion
(Message 2029623)
Posted 28 Jan 2020 by ![]() Post: This amount is exponentially decaying as we go back in time, but the volunteers of this project can provide the computing power to convert (even re-calculate) that amount of data (as the computing power is exponentially growing), but I'm not sure if it should be converted at all. The architecture of the science database can be changed without changing the meaning the data in it, so this project can use a different architecture in the future.How many results does the Science database hold, considering it has been running for over 20 years.The science data base only allows 30 items of interest to be recorded, this is not going to change.What forbids it to change? |
9)
Message boards :
Number crunching :
How to Fix the current Issues - One man's opinion
(Message 2029621)
Posted 28 Jan 2020 by ![]() Post: The science data base only allows 30 items of interest to be recorded, this is not going to change.What law of nature forbids it to change? |
10)
Message boards :
Number crunching :
How to Fix the current Issues - One man's opinion
(Message 2029585)
Posted 27 Jan 2020 by ![]() Post: This is not only one man's opinion. See my post* regarding this matter in the server issues thread. I thought of starting a new thread about it myself, but here it is. *EDIT: let me quote myself, as we should discuss it in this thread. Retvari Zoltan wrote: ... this project should seriously consider doubling the length of its workunits, while reducing the max allowed to 50+50. That would halve the number of the entries of the tables the server need to keep. You can name it sah v9. After a test period it could be decided to go back to sah v8, or double the length of the workunits again (reducing limits to 25+25), even keep both alive. The variety in the performance of the devices connected to this project is so large that |
11)
Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
(Message 2029359)
Posted 26 Jan 2020 by ![]() Post: The Return rate keeps falling, the Work in progress numbers keep falling, yet the Validation/Assimilation backlogs continue to grow.If the problems persist after all of the effort described above, this project should seriously consider doubling the length of its workunits, while reducing the max allowed to 50+50. That would halve the number of the entries of the tables the server need to keep. You can name it sah v9. After a test period it could be decided to go back to sah v8, or double the length of the workunits again (reducing limits to 25+25), even keep both alive. The variety in the performance of the devices connected to this project is so large that |
12)
Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
(Message 2028652)
Posted 20 Jan 2020 by ![]() Post: So E@H doesn't generate as much "heat"?GPU projects / apps (I am running) in the order of heat generation: (i3-4160 3.6GHz, 2x4GB DDR3 1333MHz, RTX2080Ti PCIe3.0x16 RAM@13600MHz) 1. GPUGrid / Acemd3 (cuda10) GPU@1700MHz 331W 2. SETI@home / GPU special app (cuda10.2) GPU@1875MHz 325W 3. Einstein@home / O2MDF 2.07 GW-OpenCL-NVidia GPU@1875MHz 295W 4. Einstein@home / FGRPB1G 1.20 OpenCL-NVidia GPU@1875MHz 293WThe power consumption shown is the peak average power consumption while the task is running. The long term heat output is the best for GPUGrid, as it's running for 1h41m without significant change in the power consumption. The SETI@home special app has lower heat output in the long term, as it frequently drops to 95W during the workunit change (can be fixed by using mutex bulid). Einstein@home also has lower heat output in the long term than 295W, as it drops to ~130W at 99% with ~250W spikes. |
13)
Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
(Message 2028531)
Posted 19 Jan 2020 by ![]() Post: Of topic but IMHO the next bottleneck of the project in the following years is the grow of the GPU capacity. Today top GPU could crunch a WU in less than 30 secs. So a host with 10 of this GPUs produces 100 WU on each 5 min, the ask for new job cicle. With the arrival of the ampres GPU`s that number will be rise even more. Feed them with this 5 min cicle will be an impossible task on such multi GPU`s coming monsters who probably will run with a lot of cores CPU (maybe more than 1) too.My post was about that this bottleneck is present in the system right now. The overhead on the crunchers' computers is one thing, the other is that the servers are crushed on a daily basis by the overwhelming amount of results they have to deal with. It is clear that the fastest hosts need more work to survive the outages without running dry, but making the queues longer by allowing more elements in it made this situation worse, so it's quite logical to make the elements longer. That would be a real win-win situation: less administration on server side, more time for the work queue to run dry on the client's side. Less client-server transactions equals faster recovery after an outage. Even if the server hardware will be upgraded the increase of the computing power (by the arrival of new GPUs) out in the field could have the same effect on the new servers very soon. |
14)
Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
(Message 2028472)
Posted 19 Jan 2020 by ![]() Post: Perhaps I missed it, but I don't see anyone mentioning what might be the best way to shrink things a bit.The other way to catch up with the computing power the state of the art computers do provide is to make the workunits longer. Provided that their length is not hard coded into the apps. (Is the length of the tasks hard coded into the apps?) The state of the art GPUs can process a workunit (with the special app) in less than a minute (~30 secs), so the overhead of getting this workunit to be actually processed takes comparable time (~3 sec) to the processing itself. This approach would lower the impact of this overhead, and make the tables shorter at the same time. The number of the max queued tasks per GPU/CPU could be reduced as well. |
15)
Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
(Message 2027933)
Posted 16 Jan 2020 by ![]() Post: I get: Database Error Database Error Warning: Invalid argument supplied for foreach() in /disks/carolyn/b/home/boincadm/projects/sah/html/inc/result.inc on line 757 Database Error Warning: Invalid argument supplied for foreach() in /disks/carolyn/b/home/boincadm/projects/sah/html/inc/result.inc on line 766When I try to access my "All tasks" list. |
16)
Message boards :
Number crunching :
New CUDA 10.2 Linux App Available
(Message 2024948)
Posted 25 Dec 2019 by ![]() Post: As per the ReadMe, the 10.2 version in the All-In-One requires Pascal and Higher. Maxwells run just about as well on CUDA 9.0. If you add Maxwell to the 10.2 App, the App will be around 220MBs instead of around 200MBs. The larger you make the App, the Slower it will run for everyone. For the Newer GPUs it's best to keep Maxwell out of the App. That's why it says Maxwell can use the CUDA 9.0 App, it won't work with the smaller App.Thanks for this clarification. I got lost where you said "...that means Maxwell will still work with 10.2, but MAY be removed on Newer ToolKits". In this sentence the 10.2 regards to the CUDA version, not your app. Now it's clear. |
17)
Message boards :
Number crunching :
New CUDA 10.2 Linux App Available
(Message 2024945)
Posted 25 Dec 2019 by ![]() Post: Your version (the CUDA10.2 mutex build) is working fine on the same host. (I can't give a link to a finished wu atm, as the servers are struggling.) |
18)
Message boards :
Number crunching :
New CUDA 10.2 Linux App Available
(Message 2024936)
Posted 25 Dec 2019 by ![]() Post: As far as the Release Notes, this is All you need to read, "Note that support for these compute capabilities may be removed in a future release of CUDA." That means Maxwell will still work with 10.2 , but MAY be removed on Newer ToolKits. nVidia has been complaining about Clang 6.0 since ToolKit 10.0, but, it still works with ToolKit 10.2 even though it has been listed as 'deprecated' for some time now.The CUDA 10.2 special app throws SIGSEGV: segmentation violationerrors on one of my systems: Libraries: libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3 CUDA: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 440.44, CUDA version 10.2, compute capability 5.2, 4096MB, 4006MB available, 6611 GFLOPS peak) OpenCL: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 440.44, device version OpenCL 1.2 CUDA, 12210MB, 4006MB available, 6611 GFLOPS peak) OS: Linux Ubuntu: Ubuntu 18.04.3 LTS [5.0.0-37-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]The Nvidia driver is from the graphics-drivers ppa. The CUDA 10.1 special app is working fine on the same system. |
19)
Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
(Message 2024669)
Posted 23 Dec 2019 by ![]() Post: The plan class of the last app should be cuda60, as there's no official cuda90 app. |
20)
Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
(Message 2024611)
Posted 23 Dec 2019 by ![]() Post: Bad news: It's not received the instruction to run on device 0 / device 1Please try to force your system to ask for cuda tasks only. I think you can achieve that by uninstalling opencl. Judging by the stderr output, if the special app runs instead the original CUDA60 app, it will run on the designated GPU. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.