Aborted: run time limit exceeded -> gpu task

Message boards : Number crunching : Aborted: run time limit exceeded -> gpu task
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2012072 - Posted: 15 Sep 2019, 18:30:47 UTC

I am running Lubuntu 18.04 (I think) and I had a very busy/exhausting week. Since I thought I had everything set on a "set and forget" I didn't check on the system https://setiathome.berkeley.edu/show_host_detail.php?hostid=8684146 during that time. I came home and collapsed all week.

The other gotcha is an unreliable LAN cable that I have just replaced. That cable was in place for the last two times it locked up. I have a temporary fix in place till I get another longish LAN cable.

I am guessing that the Boinc/Seti locked up caused the above error message because I didn't catch it in time. Is this right or do I need to bark up another troubleshooting path?

Tom
A proud member of the OFA (Old Farts Association).
ID: 2012072 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22325
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2012186 - Posted: 16 Sep 2019, 9:29:04 UTC

Well, maybe, but there are other issues with that computer as it has more recently dumped something over 400 "Error 11" tasks.
So it most definitely needs more investigation to get to the bottom of that issue.

(The last time I had a dump of abandoned tasks was when I cold re-started one of my computers which has a flakey BIOS back-up battery, which is another job on my Tuit list)
None of you aborted tasks that I looked at show either a valid "estimated run time", or an elapsed time so there is obviously something very wrong with that one.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2012186 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2012225 - Posted: 16 Sep 2019, 19:04:52 UTC - in response to Message 2012072.  

That is an application error code and the codes are listed here:

https://boinc.mundayweb.com/wiki/index.php?title=Project_application_errors

Exit code -11 is a floating point rounding error. Likely due to overheating or overclocking.

https://boinc.mundayweb.com/wiki/index.php?title=Exit_code_-11
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2012225 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2012254 - Posted: 16 Sep 2019, 23:34:59 UTC - in response to Message 2012186.  
Last modified: 16 Sep 2019, 23:36:04 UTC

Well, maybe, but there are other issues with that computer as it has more recently dumped something over 400 "Error 11" tasks.
So it most definitely needs more investigation to get to the bottom of that issue.

(The last time I had a dump of abandoned tasks was when I cold re-started one of my computers which has a flakey BIOS back-up battery, which is another job on my Tuit list)
None of you aborted tasks that I looked at show either a valid "estimated run time", or an elapsed time so there is obviously something very wrong with that one.


Thank you both for your expertise. I was scratching my head.

After resetting "everything" to bios defaults I have applied "4.0 Presets", "Die-B, 3200Mhz (fast) from Stilt?, and can't remember if I turned up the phase control/current delivery or not. I did not switch the LLC at all. And I did a manual cpu voltage override of 1.31.

When I just looked it it, it was running a bunch cooler (how big is a bunch.... well hold out your hand and let me see how much I can get on there)...

Psensor is reporting 69C and the max it has reported is 71C. It was running hotter than that.
The cpu watch command still reporting 4.0 to just under that. As many as 4 threads are hitting 4Ghz at the same time.

Again thank you.....
Heat is the computers enemy.... (and strikes again...).

Tom
A proud member of the OFA (Old Farts Association).
ID: 2012254 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2012258 - Posted: 17 Sep 2019, 0:56:02 UTC - in response to Message 2012254.  

After resetting "everything" to bios defaults I have applied "4.0 Presets", "Die-B, 3200Mhz (fast) from Stilt?, and can't remember if I turned up the phase control/current delivery or not. I did not switch the LLC at all. And I did a manual cpu voltage override of 1.31

What does psensor show for cpu voltage at load? How much does it sag from the manual 1.31V BIOS setting. How many cpu tasks are running? I have LLC on Auto and it sags down to 1.33V from the Auto BIOS setting of 1.35V. That is with 8 cpu tasks and 3 gpu tasks running. CPU is set to 4050Mhz manually. Temps are normally around 68-73 °C. under load for my 2700X.

That 1.31V might not be enough for 4Ghz. Depends on how much it sags under load. If the cpu loading is minimal, it probably is OK. I would feel better with a voltage closer to 1.35V which is the default for 4Ghz. But that would raise temps and I don't know your cooling solution. I have AIO or custom loop cooling on all my cpus.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2012258 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2012359 - Posted: 17 Sep 2019, 23:53:34 UTC - in response to Message 2012258.  

After resetting "everything" to bios defaults I have applied "4.0 Presets", "Die-B, 3200Mhz (fast) from Stilt?, and can't remember if I turned up the phase control/current delivery or not. I did not switch the LLC at all. And I did a manual cpu voltage override of 1.31

What does psensor show for cpu voltage at load? How much does it sag from the manual 1.31V BIOS setting. How many cpu tasks are running? I have LLC on Auto and it sags down to 1.33V from the Auto BIOS setting of 1.35V. That is with 8 cpu tasks and 3 gpu tasks running. CPU is set to 4050Mhz manually. Temps are normally around 68-73 °C. under load for my 2700X.

That 1.31V might not be enough for 4Ghz. Depends on how much it sags under load. If the cpu loading is minimal, it probably is OK. I would feel better with a voltage closer to 1.35V which is the default for 4Ghz. But that would raise temps and I don't know your cooling solution. I have AIO or custom loop cooling on all my cpus.


Those are great questions. The cooling solution I am using is the Biggest two fan cpu Air Cooler from Notura(sp) and a 40mm fan under the cpu socket. The MB is installed on a modest mining rack. I have 6 gpus running on riser cards but nothing else cooling them. The rack is on the floor under a table where another system may grow.
Here is a specific answer to the data question. Thank you for helping get it setup so I can answer these kinds of questions.

tom@EJS-GIFT:~$ sensors
asuswmisensors-isa-0000
Adapter: ISA adapter
CPU Core Voltage:         +1.30 V  
CPU SOC Voltage:          +1.04 V  
DRAM Voltage:             +1.40 V  
VDDP Voltage:             +0.63 V  
1.8V PLL Voltage:         +1.96 V  
+12V Voltage:            +11.83 V  
+5V Voltage:              +4.99 V  
3VSB Voltage:             +3.36 V  
VBAT Voltage:             +3.23 V  
AVCC3 Voltage:            +3.38 V  
SB 1.05V Voltage:         +1.07 V  
CPU Core Voltage:         +1.30 V  
CPU SOC Voltage:          +1.04 V  
DRAM Voltage:             +1.42 V  
CPU Fan:                 1424 RPM
Chassis Fan 1:           4720 RPM
Chassis Fan 2:              0 RPM
Chassis Fan 3:              0 RPM
HAMP Fan:                   0 RPM
Water Pump:                 0 RPM
CPU OPT:                    0 RPM
Water Flow:                 0 RPM
AIO Pump:                1448 RPM
CPU Temperature:          +71.0°C  
CPU Socket Temperature:   +47.0°C  
Motherboard Temperature:  +32.0°C  
Chipset Temperature:      +48.0°C  
Tsensor 1 Temperature:   +216.0°C  
CPU VRM Temperature:      +50.0°C  
Water In:                +216.0°C  
Water Out:               +216.0°C  
CPU VRM Output Current:  +80.00 A  

asus-isa-0000
Adapter: ISA adapter
cpu_fan:        0 RPM

k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +71.2°C  (high = +70.0°C)

tom@EJS-GIFT:~$ ^C


Given my low temps and the fact that I am running about as hard as it is possible 90% of available threads are processing something. And it hasn't crashed yet. I am loathed to touch it. It had crashed twice (app locked up OS) previously.

I assume the "AIO pump" is because I am plugged into the wrong fan socket.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2012359 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2012368 - Posted: 18 Sep 2019, 2:30:37 UTC - in response to Message 2012359.  

Everything that sensors reports looks perfectly normal to me. If I had those sensor readings I would be loathe to change them. The header labelled AIO pump is just a normal fan header only with extra temperature and PWM controls on it. It normally just puts out constant +12V for its output unless you set it up for PWM control and set temperature/PWM percentage thresholds in the BIOS. No problem running a simple fan from it. If you run sensors with the -u parameter, you can see what the native name in the BIOS is for all the sensor outputs.

All I'm saying is that if for some reason the system crashes after a few days, the cpu voltage may not be sufficient. I would throw some LLC2-3 on that 1.31V in the BIOS and see if the system stays up longer. I have my systems running for as long as I ignore them. Sometimes over a month of system uptime before I come visit and see that there are pending updates that I allow to go through.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2012368 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2012477 - Posted: 18 Sep 2019, 22:55:07 UTC - in response to Message 2012368.  


All I'm saying is that if for some reason the system crashes after a few days, the cpu voltage may not be sufficient. I would throw some LLC2-3 on that 1.31V in the BIOS and see if the system stays up longer. I have my systems running for as long as I ignore them. Sometimes over a month of system uptime before I come visit and see that there are pending updates that I allow to go through.


I will update next week. It seems to be perfectly happy. And I am currently too busy to be hand holding anyway.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2012477 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2012851 - Posted: 22 Sep 2019, 11:42:03 UTC - in response to Message 2012477.  


All I'm saying is that if for some reason the system crashes after a few days, the cpu voltage may not be sufficient. I would throw some LLC2-3 on that 1.31V in the BIOS and see if the system stays up longer. I have my systems running for as long as I ignore them. Sometimes over a month of system uptime before I come visit and see that there are pending updates that I allow to go through.


I will update next week. It seems to be perfectly happy. And I am currently too busy to be hand holding anyway.

Tom


It does seem to have the screen/keyboard become un-responsive. That was with cpu voltage at 1.30~?/auto. So I am going to reset the LLC to 3.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2012851 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 2013205 - Posted: 24 Sep 2019, 23:46:08 UTC

I would echo others in that my Linux box was throwing those errors on GPUs from time to time, as you may recall me mentioning in the other thread. Since I found and corrected a problem with a power splitter feeding a 6-pin on one 980, I have not had a recurrence since 9 Sep.
Ymmv ...
ID: 2013205 · Report as offensive

Message boards : Number crunching : Aborted: run time limit exceeded -> gpu task


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.