Posts by Retvari Zoltan

1) Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118) (Message 2030638)
Posted 3 Feb 2020 by Profile Retvari Zoltan
Post:
My Inconclusive results are going up too, even though I've only had a handful of Tasks since last night. Last night I had a large number of Inconclusive results that said 'minimum quorum 1' and only listed a single Inconclusive host. I didn't see how a single Inconclusive host task could ever validate. Now, it's very difficult to bring up my Inconclusive tasks lists, but, it seems those tasks are now listed as; https://setiathome.berkeley.edu/workunit.php?wuid=3862758806
minimum quorum 1
initial replication 3
   Task    Computer            Sent                  Time reported                 Status        Runtime CPUtime Credit             Application
8495599283  1473578  31 Jan 2020, 5:02:48 UTC  31 Jan 2020, 21:47:15 UTC  Completed and validated  15.36  12.61   3.59  SETI@home v8 v8.20 (opencl_ati5_mac) x86_64-apple-darwin
8498611906  6796479   1 Feb 2020, 3:00:50 UTC   1 Feb 2020, 4:00:03 UTC   Completed and validated   4.10   1.93   3.59  SETI@home v8 v8.11 (cuda42_mac) x86_64-apple-darwin
8498669733  8673543   1 Feb 2020, 4:01:52 UTC   1 Feb 2020, 5:29:49 UTC   Completed and validated  15.11  13.09   3.59  SETI@home v8 v8.22 (opencl_nvidia_SoG)
So, the single host are now triple hosts, but they are still just sitting there with a number of them showing one or two Completed, waiting for validation hosts, and some with one or two Inconclusive hosts.
I have a couple of invalid tasks with minimum quorum = 1. Perhaps I have a lot of valid tasks as well with min.q.=1, but they are much harder to spot.
https://setiathome.berkeley.edu/workunit.php?wuid=3861384942
https://setiathome.berkeley.edu/workunit.php?wuid=3861339403
https://setiathome.berkeley.edu/workunit.php?wuid=3861247650
https://setiathome.berkeley.edu/workunit.php?wuid=3861247545
and so on...
https://setiathome.berkeley.edu/results.php?userid=5276&offset=0&show_names=0&state=5&appid=
2) Message boards : Number crunching : How to Fix the current Issues - One man's opinion (Message 2030121)
Posted 31 Jan 2020 by Profile Retvari Zoltan
Post:
In my opinion this project needs a new splitting / validation process which is able to handle the ultra high performance of the present and future GPUs as well as the oldest CPUs. It could be achieved by sending larger chunks of data to fast hosts (expanding in the power of 2, limited by the actual processing speed of the slowest device (GPU/CPU) in the given system).
It needs a new client app also, as it should omit the parts of the data poised by RFI.
I think the need for the transition to that adaptive splitting algorithm is now.
Please share your ideas! (Besides that it can't be done.)
3) Message boards : Number crunching : How to Fix the current Issues - One man's opinion (Message 2030119)
Posted 31 Jan 2020 by Profile Retvari Zoltan
Post:
If you have really long work units, then increasing the limits for returned signals is not enough. A rfi spike will fill any reasonable limit and this will then mask all the good parts of the data. Bigger time windows mean more observation time is lost due to these events.
RFI spikes can be easily detected and omitted by the app from the result, so no observation time would be lost.

This is why I suggested in another post that the clients would process the long workunits in multiple parts that would match the size of the current workunits and produce result data separately for each part. So you could have result overflow for one part but good results for the rest.
This would leave the load on the servers unchanged. Further tweaking and optimizing client behavior would make the servers' job harder, this isn't the right way. There's no easy way to fix the problems we face.
4) Message boards : Number crunching : How to Fix the current Issues - One man's opinion (Message 2029785)
Posted 29 Jan 2020 by Profile Retvari Zoltan
Post:
Beta IS a project in and of itself and does not need to be created. It already exists on the same severs as SETI Prime does. It is there to test new apps and server software, hence the name Beta. I don't see messing with Beta when Prime needs more fixing.
We're discussing ideas this project needs to adopt to get fixed for good. That's what beta could be used for. Tinkering with the old stuff couldn't achieve that in the long term.
5) Message boards : Number crunching : How to Fix the current Issues - One man's opinion (Message 2029784)
Posted 29 Jan 2020 by Profile Retvari Zoltan
Post:
What do you actually mean by "doubling task size"?
Do you mean just adding more data points to increase the file size from 700k to 1400k?
Do you mean doubling the resolution, so doubling the file size?
Do you mean putting two data sets into one file, so doubling the file size?
I would go for the 1st option. A task which covers a longer period in time would also mean that less overlap (=less network traffic, less disk space) is necessary for data processing / transfer.
The ideal solution would be to send as much data to a host that the actual device (CPU/GPU) could process it in 1~2 hours. For example a very fast host would receive up to 256 times longer chunks of data to process. I can easily spot tasks, which were processed by my wingman over 400 times slower. In other words my host puts 400 times higher load on the servers than the other host does. This is not necessary. The ability to reduce the workload on the servers should be adopted in the way the data is split between hosts, as future GPUs will be even faster.
I'm aware that the storage limits of the given workunit for the found spikes / pulses / triplets / Gaussians should be increased as well.
The 2nd option is also viable, but the 3rd wouldn't change things much.
The point is to reduce the number of tasks out in the field, and the number of server-client transactions to make it easier for the servers to handle their job.
6) Message boards : Number crunching : How to Fix the current Issues - One man's opinion (Message 2029776)
Posted 29 Jan 2020 by Profile Retvari Zoltan
Post:
This amount is exponentially decaying as we go back in time, but the volunteers of this project can provide the computing power to convert (even re-calculate) that amount of data (as the computing power is exponentially growing), but I'm not sure if it should be converted at all. The architecture of the science database can be changed without changing the meaning the data in it, so this project can use a different architecture in the future.
Not as exponential as you might think, as the number of active users has diminished over that period. For several reasons, BOINC and credit screw to name but two.
That trend will follow the uptime/downtime ratio of this project (plus many other aspects).
The goal should be to reduce downtime (ideally to 0), as the frequent and extended downtime periods resulted in counterproductive user action.
7) Message boards : Number crunching : How to Fix the current Issues - One man's opinion (Message 2029775)
Posted 29 Jan 2020 by Profile Retvari Zoltan
Post:
Greetings,

What you guys may or may not know or understand is that there is already a 2nd project here at SETI. It's call Beta. That project resides on the same servers as SETI Prime does. What you are suggesting is for a 3rd project to be installed. I don't see the point. Just one man's opinion on this topic. ;)

Have a great day! :)

Siran
Perhaps the Beta should be created to be able to handle a few task size doubling now, and several more in the future.
8) Message boards : Number crunching : How to Fix the current Issues - One man's opinion (Message 2029623)
Posted 28 Jan 2020 by Profile Retvari Zoltan
Post:
The science data base only allows 30 items of interest to be recorded, this is not going to change.
What forbids it to change?
How many results does the Science database hold, considering it has been running for over 20 years.
Conservative estimate 3 million/day * 365 days * 20 years = ...
This amount is exponentially decaying as we go back in time, but the volunteers of this project can provide the computing power to convert (even re-calculate) that amount of data (as the computing power is exponentially growing), but I'm not sure if it should be converted at all. The architecture of the science database can be changed without changing the meaning the data in it, so this project can use a different architecture in the future.
9) Message boards : Number crunching : How to Fix the current Issues - One man's opinion (Message 2029621)
Posted 28 Jan 2020 by Profile Retvari Zoltan
Post:
The science data base only allows 30 items of interest to be recorded, this is not going to change.
What law of nature forbids it to change?
10) Message boards : Number crunching : How to Fix the current Issues - One man's opinion (Message 2029585)
Posted 27 Jan 2020 by Profile Retvari Zoltan
Post:
This is not only one man's opinion.
See my post* regarding this matter in the server issues thread.
I thought of starting a new thread about it myself, but here it is.
*EDIT: let me quote myself, as we should discuss it in this thread.
Retvari Zoltan wrote:
... this project should seriously consider doubling the length of its workunits, while reducing the max allowed to 50+50. That would halve the number of the entries of the tables the server need to keep. You can name it sah v9. After a test period it could be decided to go back to sah v8, or double the length of the workunits again (reducing limits to 25+25), even keep both alive. The variety in the performance of the devices connected to this project is so large that it could be seen even from the Moon it makes reasonable for this project to let go its "one fits for all" attitude, because this is the root cause of the server crashes. The practical problems we face every day is only the consequence of that. Tinkering with the server components and micro-managing the acute problems covers it for a long while, but the time spent with it could be put into making the project more future proof instead. The outages won't go away until the root cause is present in the system. It hurts every cruncher (though it hurts the top performers the most) therefore it hurts the performance of the whole project.
11) Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118) (Message 2029359)
Posted 26 Jan 2020 by Profile Retvari Zoltan
Post:
The Return rate keeps falling, the Work in progress numbers keep falling, yet the Validation/Assimilation backlogs continue to grow.

I think they're just going to have to stop all work production, and let the servers sit for a week (or more) and let systems return the odd resend they get in order for the Validation backlog to clear, and then allow the resulting increased Assimilation backlog to clear (and hopefully the Deleters & Purgers won't develop a backlog).
Then reset the server side limits back to 100 + 100, pull all BLC35 files and not re-release them until both extra replication to handle the RX5000 series is reduced back to just 2 and they have their new storage server running, which will hopefully perform well enough even if all the data isn't cached.
Then re-release the BLC35s and see if the system grinds to halt again or not. And just maybe release their wish list for better hardware that can handle the loads Seti will be dealing with in the future (maybe get a second hand 2015 PowerEdge R730 server- supports 2 CPUs and 128GB of RAM per CPU?).
If the problems persist after all of the effort described above, this project should seriously consider doubling the length of its workunits, while reducing the max allowed to 50+50. That would halve the number of the entries of the tables the server need to keep. You can name it sah v9. After a test period it could be decided to go back to sah v8, or double the length of the workunits again (reducing limits to 25+25), even keep both alive. The variety in the performance of the devices connected to this project is so large that it could be seen even from the Moon it makes reasonable for this project to let go its "one fits for all" attitude, because this is the root cause of the server crashes. The practical problems we face every day is only the consequence of that. Tinkering with the server components and micro-managing the acute problems covers it for a long while, but the time spent with it could be put into making the project more future proof instead. The outages won't go away until the root cause is present in the system. It hurts every cruncher (though it hurts the top performers the most) therefore it hurts the performance of the whole project.
12) Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118) (Message 2028652)
Posted 20 Jan 2020 by Profile Retvari Zoltan
Post:
So E@H doesn't generate as much "heat"?
GPU projects / apps (I am running) in the order of heat generation:
(i3-4160 3.6GHz, 2x4GB DDR3 1333MHz, RTX2080Ti PCIe3.0x16 RAM@13600MHz)
1.       GPUGrid / Acemd3 (cuda10)              GPU@1700MHz 331W
2.     SETI@home / GPU special app (cuda10.2)   GPU@1875MHz 325W
3. Einstein@home / O2MDF 2.07 GW-OpenCL-NVidia  GPU@1875MHz 295W
4. Einstein@home / FGRPB1G 1.20 OpenCL-NVidia   GPU@1875MHz 293W
The power consumption shown is the peak average power consumption while the task is running.
The long term heat output is the best for GPUGrid, as it's running for 1h41m without significant change in the power consumption.
The SETI@home special app has lower heat output in the long term, as it frequently drops to 95W during the workunit change (can be fixed by using mutex bulid).
Einstein@home also has lower heat output in the long term than 295W, as it drops to ~130W at 99% with ~250W spikes.
13) Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118) (Message 2028531)
Posted 19 Jan 2020 by Profile Retvari Zoltan
Post:
Of topic but IMHO the next bottleneck of the project in the following years is the grow of the GPU capacity. Today top GPU could crunch a WU in less than 30 secs. So a host with 10 of this GPUs produces 100 WU on each 5 min, the ask for new job cicle. With the arrival of the ampres GPU`s that number will be rise even more. Feed them with this 5 min cicle will be an impossible task on such multi GPU`s coming monsters who probably will run with a lot of cores CPU (maybe more than 1) too.
My post was about that this bottleneck is present in the system right now.
The overhead on the crunchers' computers is one thing, the other is that the servers are crushed on a daily basis by the overwhelming amount of results they have to deal with.
It is clear that the fastest hosts need more work to survive the outages without running dry, but making the queues longer by allowing more elements in it made this situation worse, so it's quite logical to make the elements longer. That would be a real win-win situation: less administration on server side, more time for the work queue to run dry on the client's side. Less client-server transactions equals faster recovery after an outage.
Even if the server hardware will be upgraded the increase of the computing power (by the arrival of new GPUs) out in the field could have the same effect on the new servers very soon.
14) Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118) (Message 2028472)
Posted 19 Jan 2020 by Profile Retvari Zoltan
Post:
Perhaps I missed it, but I don't see anyone mentioning what might be the best way to shrink things a bit.
I grab a task from Einstein, the deadline is around 10-14 days.
I grab a task here, the deadline is 1-2 months. That's too high.
They're very upfront at e@h about not wanting to raise their deadlines, specifically due to database server loading issues.
Perhaps lowering it here only makes sense.
SETI is arguably the most visible BOINC project.
As a result, I suspect this project gets the largest percentage of people new to the concept, and thus more likely to decide it isn't for them and go away, leaving work in the db to time out.
The long deadlines made sense in the earlier days of the project when task run times were high, and computers weaker. Perhaps it's time to revisit the decision.
The other way to catch up with the computing power the state of the art computers do provide is to make the workunits longer.
Provided that their length is not hard coded into the apps. (Is the length of the tasks hard coded into the apps?)
The state of the art GPUs can process a workunit (with the special app) in less than a minute (~30 secs), so the overhead of getting this workunit to be actually processed takes comparable time (~3 sec) to the processing itself. This approach would lower the impact of this overhead, and make the tables shorter at the same time.
The number of the max queued tasks per GPU/CPU could be reduced as well.
15) Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118) (Message 2027933)
Posted 16 Jan 2020 by Profile Retvari Zoltan
Post:
I get:
Database Error
Database Error
Warning: Invalid argument supplied for foreach() in /disks/carolyn/b/home/boincadm/projects/sah/html/inc/result.inc on line 757 Database Error
Warning: Invalid argument supplied for foreach() in /disks/carolyn/b/home/boincadm/projects/sah/html/inc/result.inc on line 766
When I try to access my "All tasks" list.
16) Message boards : Number crunching : New CUDA 10.2 Linux App Available (Message 2024948)
Posted 25 Dec 2019 by Profile Retvari Zoltan
Post:
As per the ReadMe, the 10.2 version in the All-In-One requires Pascal and Higher. Maxwells run just about as well on CUDA 9.0. If you add Maxwell to the 10.2 App, the App will be around 220MBs instead of around 200MBs. The larger you make the App, the Slower it will run for everyone. For the Newer GPUs it's best to keep Maxwell out of the App. That's why it says Maxwell can use the CUDA 9.0 App, it won't work with the smaller App.
Thanks for this clarification.
I got lost where you said "...that means Maxwell will still work with 10.2, but MAY be removed on Newer ToolKits". In this sentence the 10.2 regards to the CUDA version, not your app. Now it's clear.
17) Message boards : Number crunching : New CUDA 10.2 Linux App Available (Message 2024945)
Posted 25 Dec 2019 by Profile Retvari Zoltan
Post:
Your version (the CUDA10.2 mutex build) is working fine on the same host.
(I can't give a link to a finished wu atm, as the servers are struggling.)
18) Message boards : Number crunching : New CUDA 10.2 Linux App Available (Message 2024936)
Posted 25 Dec 2019 by Profile Retvari Zoltan
Post:
As far as the Release Notes, this is All you need to read, "Note that support for these compute capabilities may be removed in a future release of CUDA." That means Maxwell will still work with 10.2 , but MAY be removed on Newer ToolKits. nVidia has been complaining about Clang 6.0 since ToolKit 10.0, but, it still works with ToolKit 10.2 even though it has been listed as 'deprecated' for some time now.
That doesn't matter for this CUDA 10.2 App though, if you Read the First Post you will see, "The 10.2 App will need driver 440.33 or Higher, Ubuntu 15.04 or Higher, and Pascal or Higher GPU." Maxwell users can use the CUDA 9.0 App, it works just about as well on Maxwell.
The CUDA 10.2 special app throws
SIGSEGV: segmentation violation
errors on one of my systems:
Libraries: libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
CUDA: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 440.44, CUDA version 10.2, compute capability 5.2, 4096MB, 4006MB available, 6611 GFLOPS peak)
OpenCL: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 440.44, device version OpenCL 1.2 CUDA, 12210MB, 4006MB available, 6611 GFLOPS peak)
OS: Linux Ubuntu: Ubuntu 18.04.3 LTS [5.0.0-37-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]
The Nvidia driver is from the graphics-drivers ppa.

The CUDA 10.1 special app is working fine on the same system.
19) Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118) (Message 2024669)
Posted 23 Dec 2019 by Profile Retvari Zoltan
Post:
The plan class of the last app should be cuda60, as there's no official cuda90 app.
20) Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118) (Message 2024611)
Posted 23 Dec 2019 by Profile Retvari Zoltan
Post:
Bad news: It's not received the instruction to run on device 0 / device 1
Please try to force your system to ask for cuda tasks only. I think you can achieve that by uninstalling opencl. Judging by the stderr output, if the special app runs instead the original CUDA60 app, it will run on the designated GPU.


Next 20


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.