Aborted: run time limit exceeded -> gpu task

Author	Message
Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 2012072 - Posted: 15 Sep 2019, 18:30:47 UTC I am running Lubuntu 18.04 (I think) and I had a very busy/exhausting week. Since I thought I had everything set on a "set and forget" I didn't check on the system https://setiathome.berkeley.edu/show_host_detail.php?hostid=8684146 during that time. I came home and collapsed all week. The other gotcha is an unreliable LAN cable that I have just replaced. That cable was in place for the last two times it locked up. I have a temporary fix in place till I get another longish LAN cable. I am guessing that the Boinc/Seti locked up caused the above error message because I didn't catch it in time. Is this right or do I need to bark up another troubleshooting path? Tom A proud member of the OFA (Old Farts Association). ID: 2012072 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22186 Credit: 416,307,556 RAC: 380	Message 2012186 - Posted: 16 Sep 2019, 9:29:04 UTC Well, maybe, but there are other issues with that computer as it has more recently dumped something over 400 "Error 11" tasks. So it most definitely needs more investigation to get to the bottom of that issue. (The last time I had a dump of abandoned tasks was when I cold re-started one of my computers which has a flakey BIOS back-up battery, which is another job on my Tuit list) None of you aborted tasks that I looked at show either a valid "estimated run time", or an elapsed time so there is obviously something very wrong with that one. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2012186 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2012225 - Posted: 16 Sep 2019, 19:04:52 UTC - in response to Message 2012072. That is an application error code and the codes are listed here: https://boinc.mundayweb.com/wiki/index.php?title=Project_application_errors Exit code -11 is a floating point rounding error. Likely due to overheating or overclocking. https://boinc.mundayweb.com/wiki/index.php?title=Exit_code_-11 Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2012225 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 2012254 - Posted: 16 Sep 2019, 23:34:59 UTC - in response to Message 2012186. Last modified: 16 Sep 2019, 23:36:04 UTC Well, maybe, but there are other issues with that computer as it has more recently dumped something over 400 "Error 11" tasks. So it most definitely needs more investigation to get to the bottom of that issue. (The last time I had a dump of abandoned tasks was when I cold re-started one of my computers which has a flakey BIOS back-up battery, which is another job on my Tuit list) None of you aborted tasks that I looked at show either a valid "estimated run time", or an elapsed time so there is obviously something very wrong with that one. Thank you both for your expertise. I was scratching my head. After resetting "everything" to bios defaults I have applied "4.0 Presets", "Die-B, 3200Mhz (fast) from Stilt?, and can't remember if I turned up the phase control/current delivery or not. I did not switch the LLC at all. And I did a manual cpu voltage override of 1.31. When I just looked it it, it was running a bunch cooler (how big is a bunch.... well hold out your hand and let me see how much I can get on there)... Psensor is reporting 69C and the max it has reported is 71C. It was running hotter than that. The cpu watch command still reporting 4.0 to just under that. As many as 4 threads are hitting 4Ghz at the same time. Again thank you..... Heat is the computers enemy.... (and strikes again...). Tom A proud member of the OFA (Old Farts Association). ID: 2012254 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2012258 - Posted: 17 Sep 2019, 0:56:02 UTC - in response to Message 2012254. After resetting "everything" to bios defaults I have applied "4.0 Presets", "Die-B, 3200Mhz (fast) from Stilt?, and can't remember if I turned up the phase control/current delivery or not. I did not switch the LLC at all. And I did a manual cpu voltage override of 1.31 What does psensor show for cpu voltage at load? How much does it sag from the manual 1.31V BIOS setting. How many cpu tasks are running? I have LLC on Auto and it sags down to 1.33V from the Auto BIOS setting of 1.35V. That is with 8 cpu tasks and 3 gpu tasks running. CPU is set to 4050Mhz manually. Temps are normally around 68-73 Â°C. under load for my 2700X. That 1.31V might not be enough for 4Ghz. Depends on how much it sags under load. If the cpu loading is minimal, it probably is OK. I would feel better with a voltage closer to 1.35V which is the default for 4Ghz. But that would raise temps and I don't know your cooling solution. I have AIO or custom loop cooling on all my cpus. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2012258 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 2012359 - Posted: 17 Sep 2019, 23:53:34 UTC - in response to Message 2012258. After resetting "everything" to bios defaults I have applied "4.0 Presets", "Die-B, 3200Mhz (fast) from Stilt?, and can't remember if I turned up the phase control/current delivery or not. I did not switch the LLC at all. And I did a manual cpu voltage override of 1.31 What does psensor show for cpu voltage at load? How much does it sag from the manual 1.31V BIOS setting. How many cpu tasks are running? I have LLC on Auto and it sags down to 1.33V from the Auto BIOS setting of 1.35V. That is with 8 cpu tasks and 3 gpu tasks running. CPU is set to 4050Mhz manually. Temps are normally around 68-73 Â°C. under load for my 2700X. That 1.31V might not be enough for 4Ghz. Depends on how much it sags under load. If the cpu loading is minimal, it probably is OK. I would feel better with a voltage closer to 1.35V which is the default for 4Ghz. But that would raise temps and I don't know your cooling solution. I have AIO or custom loop cooling on all my cpus. Those are great questions. The cooling solution I am using is the Biggest two fan cpu Air Cooler from Notura(sp) and a 40mm fan under the cpu socket. The MB is installed on a modest mining rack. I have 6 gpus running on riser cards but nothing else cooling them. The rack is on the floor under a table where another system may grow. Here is a specific answer to the data question. Thank you for helping get it setup so I can answer these kinds of questions. tom@EJS-GIFT:~$ sensors asuswmisensors-isa-0000 Adapter: ISA adapter CPU Core Voltage: +1.30 V CPU SOC Voltage: +1.04 V DRAM Voltage: +1.40 V VDDP Voltage: +0.63 V 1.8V PLL Voltage: +1.96 V +12V Voltage: +11.83 V +5V Voltage: +4.99 V 3VSB Voltage: +3.36 V VBAT Voltage: +3.23 V AVCC3 Voltage: +3.38 V SB 1.05V Voltage: +1.07 V CPU Core Voltage: +1.30 V CPU SOC Voltage: +1.04 V DRAM Voltage: +1.42 V CPU Fan: 1424 RPM Chassis Fan 1: 4720 RPM Chassis Fan 2: 0 RPM Chassis Fan 3: 0 RPM HAMP Fan: 0 RPM Water Pump: 0 RPM CPU OPT: 0 RPM Water Flow: 0 RPM AIO Pump: 1448 RPM CPU Temperature: +71.0Â°C CPU Socket Temperature: +47.0Â°C Motherboard Temperature: +32.0Â°C Chipset Temperature: +48.0Â°C Tsensor 1 Temperature: +216.0Â°C CPU VRM Temperature: +50.0Â°C Water In: +216.0Â°C Water Out: +216.0Â°C CPU VRM Output Current: +80.00 A asus-isa-0000 Adapter: ISA adapter cpu_fan: 0 RPM k10temp-pci-00c3 Adapter: PCI adapter temp1: +71.2Â°C (high = +70.0Â°C) tom@EJS-GIFT:~$ ^C Given my low temps and the fact that I am running about as hard as it is possible 90% of available threads are processing something. And it hasn't crashed yet. I am loathed to touch it. It had crashed twice (app locked up OS) previously. I assume the "AIO pump" is because I am plugged into the wrong fan socket. Tom A proud member of the OFA (Old Farts Association). ID: 2012359 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2012368 - Posted: 18 Sep 2019, 2:30:37 UTC - in response to Message 2012359. Everything that sensors reports looks perfectly normal to me. If I had those sensor readings I would be loathe to change them. The header labelled AIO pump is just a normal fan header only with extra temperature and PWM controls on it. It normally just puts out constant +12V for its output unless you set it up for PWM control and set temperature/PWM percentage thresholds in the BIOS. No problem running a simple fan from it. If you run sensors with the -u parameter, you can see what the native name in the BIOS is for all the sensor outputs. All I'm saying is that if for some reason the system crashes after a few days, the cpu voltage may not be sufficient. I would throw some LLC2-3 on that 1.31V in the BIOS and see if the system stays up longer. I have my systems running for as long as I ignore them. Sometimes over a month of system uptime before I come visit and see that there are pending updates that I allow to go through. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2012368 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 2012477 - Posted: 18 Sep 2019, 22:55:07 UTC - in response to Message 2012368. All I'm saying is that if for some reason the system crashes after a few days, the cpu voltage may not be sufficient. I would throw some LLC2-3 on that 1.31V in the BIOS and see if the system stays up longer. I have my systems running for as long as I ignore them. Sometimes over a month of system uptime before I come visit and see that there are pending updates that I allow to go through. I will update next week. It seems to be perfectly happy. And I am currently too busy to be hand holding anyway. Tom A proud member of the OFA (Old Farts Association). ID: 2012477 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 2012851 - Posted: 22 Sep 2019, 11:42:03 UTC - in response to Message 2012477. All I'm saying is that if for some reason the system crashes after a few days, the cpu voltage may not be sufficient. I would throw some LLC2-3 on that 1.31V in the BIOS and see if the system stays up longer. I have my systems running for as long as I ignore them. Sometimes over a month of system uptime before I come visit and see that there are pending updates that I allow to go through. I will update next week. It seems to be perfectly happy. And I am currently too busy to be hand holding anyway. Tom It does seem to have the screen/keyboard become un-responsive. That was with cpu voltage at 1.30~?/auto. So I am going to reset the LLC to 3. Tom A proud member of the OFA (Old Farts Association). ID: 2012851 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2013205 - Posted: 24 Sep 2019, 23:46:08 UTC I would echo others in that my Linux box was throwing those errors on GPUs from time to time, as you may recall me mentioning in the other thread. Since I found and corrected a problem with a power splitter feeding a 6-pin on one 980, I have not had a recurrence since 9 Sep. Ymmv ... ID: 2013205 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.