Message boards :
Number crunching :
You might want to check this one Again...
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Now it's one of the other machines. This is the host I had the problem with when I inserted a Card BOINC didn't recognize correctly. It has never given an 'Invalid', except when it tried to run two tasks at once, that was also caused by the 'other' card a while back. I've been keeping an eye on it, especially after BOINC decided to send it 100 APs when it averages about 4 a day. Must be the other card again... I'm trying to download the task using the instructions from this post, WOW! Where on earth did that come from?. I can't seem to get the fanout correct. The name is ap_20jl12aa_B0_P1_00201_20130325_22943.wu, I get, 65ded978f0f467e770370f8a526415da from here http://www.miraclesalad.com/webtools/md5.php. That's as far as I get, the next step goes over my head "by taking 'modulo 4' of the first character"? Whut? I'm having a hard time believing the card could be that far off when all it's other tasks seem fine. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
... Modulo 4 means the remainder when you divide by 4. The 9 has a remainder of 1 when divided by 4, so the fanout is 178. Joe |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Thanks, maybe I should have paid more attention to the diagram below and not made it into something more difficult than it was. It's really simple, look for the first character in the 4 rows and use the corresponding number...okay. [ 0, 4, 8, c ] --> (0) [ 1, 5, 9, d ] --> 1 [ 2, 6, a, e ] --> 2 [ 3, 7, b, f ] --> 3 So, http://boinc2.ssl.berkeley.edu/sah/download_fanout/178/ap_20jl12aa_B0_P1_00201_20130325_22943.wu gives me a 404. I had tried 178 before I posted, I ran through the possibilities of the first character since there aren't that many...I received a 404 then as well. Any ideas? On the 404 as well the extreme differences in the results? It's almost as if I ran a different task. If you look at the Application details on that Host, they are completely borked. That Host is closing in on 1 Million, and a good percentage came from that card running around 4 APs a day since November. You wouldn't know it by looking at the Applications page. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
That's the right filepath. Or rather is was the right filepath. Checking the WU link you quoted, I see that the last task was handed in and validated 8 Apr 2013, 2:52:43 UTC. It will have been purged from the download dir shortly afterwards, which is why your are getting a 404 - file not found. It's already been deleted. If you want to run tasks offline you need to get them (at least here on main) before they validate. A person who won't read has no advantage over one who can't read. (Mark Twain) |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
If the server loses the detail for that app_version and starts guesstimating from scratch and estimates far too small (for whatever reason) you may well end up with a far too large work request. Are you running with <flops>? How closely do the estimates match the real runtimes? Or rather was there a large difference between the two when you got those 100 tasks? A person who won't read has no advantage over one who can't read. (Mark Twain) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Less than an hour is rather shortly, I agree. Back when Richard first explained how to find the file, I don't think it was that short. I had tried it within an hour after the last task validated. I just tried another one within an hour, it too seems to be history, http://boinc2.ssl.berkeley.edu/sah/download_fanout/3bd/ap_31my12ae_B6_P1_00148_20130325_28395.wu, from here 8 Apr 2013, 15:16:20 UTC, it was gone by 8 Apr 2013, 16:10:00 UTC. Not much time to check a questionable result, is it....less than an hour. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
If the server loses the detail for that app_version and starts guesstimating from scratch and estimates far too small (for whatever reason) you may well end up with a far too large work request. There is a flops setting, and it is close to reality, 6.7 hours. It has been close for a very long time. What didn't take very long was for BOINC to send those tasks after the second card was inserted, basically shortly after I allowed new tasks. Even between both cards, it shouldn't have sent tasks bringing the total to 100. The card was a trooper though, between Mar 26th & April 8th it had completed 75 AP tasks, most of them overflows though. It still takes around 6.4 hours on the normal ones. Only one of those 75 was an Invalid, and that one is highly questionable. One host found 30/30 and the other two find 1/0? Like I said, most of those prior 75 were overflows, and it didn't seem to have a problem with those. It has just found another overflow that is being listed as 'Inconclusive', just as many overflows are listed. This one seems to be normal though, the other host also found 30/30. The only logical explanation is that the three hosts worked a different task. The other host sent the task when it was issued got a 'Download error'. The two after that apparently worked a different task. I decided to run a Project reset, only way to be sure another task isn't what it seems. After the reset, BOINC decided to send me all ATI APs and then run an ATI AP on my CPU. Run an ATI AP on my CPU? Something wrong with that picture. I did another Reset. Same thing, this time I noticed the ATI AP being run by the CPU had a 'Normal' CPU run-time of 8.5 hours instead of the ATI 6.7 hour time. Another Reset, then Exit BOINC and let it think about it. I decided to try BOINC 7.0.60 since 7.0.58 seemed to be acting strange. After restarting, I was sent the two different classes, 601 & 604. The correct devices were assigned the correct class task. Progress. Then, BOINC started sending more ATI APs, even though I already had more than enough. I set it to NNT, as it stands at this moment. I guess I might as well make a backup of the Project folder since I will be running the existing tasks for a while. Seems to be the only way to obtain a copy of the WU if something else goes wrong. Deleted in less than an hour, even if the results seem questionable...okay. |
andybutt Send message Joined: 18 Mar 03 Posts: 262 Credit: 164,205,187 RAC: 516 |
Replaced the bad 580 today with a 690. not sure the problem has gone away as it now seems the other 580 is throwing up a few invalids. Maybe PSU or ram? I will sit on it for a couple of days to see what happens Andy |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Less than an hour is rather shortly, I agree. Back when Richard first explained how to find the file, I don't think it was that short. I had tried it within an hour after the last task validated. I just tried another one within an hour, it too seems to be history, http://boinc2.ssl.berkeley.edu/sah/download_fanout/3bd/ap_31my12ae_B6_P1_00148_20130325_28395.wu, from here 8 Apr 2013, 15:16:20 UTC, it was gone by 8 Apr 2013, 16:10:00 UTC. Not much time to check a questionable result, is it....less than an hour. Once a WU has a canonical result, assimilation and file deletion normally happen as quickly as possible. Game over. An exception is when there's another unfinished task not past deadline. Of course if the Transitioner, Assimilator, or File Deleter is not keeping up as happens sometimes there may be a delay. Purging of the database is one day after file deletion (the BOINC default is 7 days). That's why you can see the WU and Task details even after the files are gone. Joe |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I'm still waiting for someone to attempt to explain the missing 500 AstroPulse v6 (anonymous platform, ATI GPU) tasks that disappeared from the Application details for host 6796475 page. That card had around 500 'Number of tasks completed' there the last time I checked. It now has 10 listed. It's almost as if the history for that card has been deleted. 10 is a very interesting number. At around 4 to 5 a day, that would place the deletion about the same time as that suspicious Invalid task, 2893216598 6796475 26 Mar 2013, 1:44:53 UTC 6 Apr 2013, 11:06:33 UTC Completed, marked as invalid 19,661.57 731.94 0.00 AstroPulse v6 Anonymous platform (ATI GPU) What's up with that? Too much of a coincidence for me. The Game Continues... :-) |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
That is a bit odd. Usually the "consecutive valid" count goes back to zero, but not "number of tasks completed." That is always supposed to go only up. I don't know why it would re-set back to zero. Maybe someone like Richard of Josef could have a theory for that one. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
I'd say for some unfathonable reason that entry got deleted and a new entry was started. I find it unlikely to be linked in any way to the false overflow. False overflows just occasionally happen - something gets corrupted, the task does a false overflow, the next task is fine. If you still had the task and were to run it offline, you'd probably find it to report the 1/0 the others found. We tend to blame Eddy. Or Cosmic Rays. Or Sunspots. You say the problem occured when you introduced a second card into the system? Looks like Boinc got itself all tangled over that one. Not much use trying to dissect the logs afterwards - if you weren't running the appropriate debug log flags at the time (e.g. work_fetch_debug). If, however, your total estimated runtime on board exceeds your cache settings and it is still asking for work, something is fishy. That may need investigating - 7.0.60 is a release candidate, if you've found a bug it needs to be reported rather sooner than later. Could you possibly enable work_fetch_debug and post the output? [one instance is enough] A person who won't read has no advantage over one who can't read. (Mark Twain) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I'd say for some unfathonable reason that entry got deleted and a new entry was started. I find it unlikely to be linked in any way to the false overflow. False overflows just occasionally happen - something gets corrupted, the task does a false overflow, the next task is fine. If you still had the task and were to run it offline, you'd probably find it to report the 1/0 the others found. We tend to blame Eddy. Or Cosmic Rays. Or Sunspots. As of now, the AstroPulse v6 (anonymous platform, ATI GPU), Number of tasks completed, seems to be hung at 10. For all I know, it could have been stuck at 10 since I inserted the other ATI card back on the 21st. Great, 5 months worth of work has been vaporized. I also find it odd the Cosmic Ray chose to wait until this other problem developed before striking... It appears the work fetch for the CPU tasks is normal. It's just the ATI work fetch that has gone astray. My guess is it has to do with the nasty bug identified back here, ..."we had believed that all CAL-capable GPUs were also OpenCL capable".... I also now suspect the Number 10 mentioned previously is probably related to the ATI 3650 that has never completed any tasks, and won't anytime soon. How it arrived at 10 is another mystery. Maybe BOINC just flips a coin to determine which ATI card it should track? This problem has surfaced in the new SETI V.7 as well. I disagree though, I think it's going to get Fugly, not simply Ugly, "OpenCL version detection is very limited and OpenCL device driver version detection is non-existent. The original code was written assuming a single version number would identify the driver and all its components." The link apparently doesn't work correctly if you're not a member there, you'll have to scroll down to Message 45488. Bad BOINC, Bad... Of course, people with nVidia cards will probably be unaffected. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Here's the output from Work Fetch; The Current state is 14 CPU AstroPulses, 49 ATI AstroPulses. At around 4 ATI APs a day, that's about 12 days worth of work for the 4670 on the Host. 4/8/2013 4:02:56 AM | | Starting BOINC client version 7.0.60 for windows_intelx86 4/8/2013 4:02:56 AM | | OS: Microsoft Windows 8: Professional with Media Center x86 Edition, (06.02.9200.00) 4/8/2013 4:02:56 AM | | CAL: ATI GPU 0: ATI Radeon HD 4600 series (R730) (CAL version 1.4.1734, 1024MB, 992MB available, 992 GFLOPS peak) 4/8/2013 4:02:56 AM | | CAL: ATI GPU 1: ATI Radeon HD 2600 (RV630) (CAL version 1.4.1734, 1024MB, 992MB available, 348 GFLOPS peak) 4/8/2013 4:02:56 AM | | OpenCL: AMD/ATI GPU 0: ATI Radeon HD 4600 series (R730) (driver version CAL 1.4.1734, device version OpenCL 1.0 AMD-APP (937.2), 1024MB, 992MB available, 992 GFLOPS peak) 4/8/2013 4:02:56 AM | SETI@home | Found app_info.xml; using anonymous platform 4/8/2013 4:02:56 AM | SETI@home | Config: excluded GPU. Type: ATI. App: astropulse_v6. Device: 1 4/8/2013 4:02:56 AM | | Version change (7.0.58 -> 7.0.60) 4/8/2013 4:02:56 AM | SETI@home | URL http://setiathome.berkeley.edu/; Computer ID 6796475; resource share 100 ------ 4/9/2013 11:49:34 PM | | Re-reading cc_config.xml 4/9/2013 11:49:34 PM | | Using proxy info from GUI 4/9/2013 11:49:34 PM | | Using HTTP proxy 192.168.1.3:5555 4/9/2013 11:49:34 PM | SETI@home | Config: excluded GPU. Type: ATI. App: astropulse_v6. Device: 1 4/9/2013 11:49:34 PM | | log flags: file_xfer, sched_ops, task, work_fetch_debug 3 Day Buffer 4/10/2013 12:08:34 AM | | [work_fetch] work fetch start 4/10/2013 12:08:34 AM | | [work_fetch] choose_project() for ATI: buffer_low: yes; sim_excluded_instances 2 4/10/2013 12:08:34 AM | | [work_fetch] no eligible project for ATI 4/10/2013 12:08:34 AM | | [work_fetch] choose_project() for CPU: buffer_low: no; sim_excluded_instances 0 4/10/2013 12:08:34 AM | | [work_fetch] ------- start work fetch state ------- 4/10/2013 12:08:34 AM | | [work_fetch] target work buffer: 259200.00 + 8640.00 sec 4/10/2013 12:08:34 AM | | [work_fetch] --- project states --- 4/10/2013 12:08:34 AM | SETI@home | [work_fetch] REC 70205.664 prio -1.507817 can req work 4/10/2013 12:08:34 AM | | [work_fetch] --- state for CPU --- 4/10/2013 12:08:34 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 442948.16 busy 0.00 4/10/2013 12:08:34 AM | SETI@home | [work_fetch] fetch share 1.000 4/10/2013 12:08:34 AM | | [work_fetch] --- state for ATI --- 4/10/2013 12:08:34 AM | | [work_fetch] shortfall 267840.00 nidle 1.00 saturated 0.00 busy 0.00 4/10/2013 12:08:34 AM | SETI@home | [work_fetch] fetch share 1.000 4/10/2013 12:08:34 AM | | [work_fetch] ------- end work fetch state ------- 4/10/2013 12:08:34 AM | | [work_fetch] No project chosen for work fetch 4 Day Buffer 4/10/2013 12:11:59 AM | | [work_fetch] work fetch start 4/10/2013 12:11:59 AM | | [work_fetch] choose_project() for ATI: buffer_low: yes; sim_excluded_instances 2 4/10/2013 12:11:59 AM | | [work_fetch] no eligible project for ATI 4/10/2013 12:11:59 AM | | [work_fetch] choose_project() for CPU: buffer_low: no; sim_excluded_instances 0 4/10/2013 12:11:59 AM | | [work_fetch] ------- start work fetch state ------- 4/10/2013 12:11:59 AM | | [work_fetch] target work buffer: 345600.00 + 8640.00 sec 4/10/2013 12:11:59 AM | | [work_fetch] --- project states --- 4/10/2013 12:11:59 AM | SETI@home | [work_fetch] REC 70207.385 prio -1.383871 can req work 4/10/2013 12:11:59 AM | | [work_fetch] --- state for CPU --- 4/10/2013 12:11:59 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 442752.26 busy 0.00 4/10/2013 12:11:59 AM | SETI@home | [work_fetch] fetch share 1.000 4/10/2013 12:11:59 AM | | [work_fetch] --- state for ATI --- 4/10/2013 12:11:59 AM | | [work_fetch] shortfall 354240.00 nidle 1.00 saturated 0.00 busy 0.00 4/10/2013 12:11:59 AM | SETI@home | [work_fetch] fetch share 1.000 4/10/2013 12:11:59 AM | | [work_fetch] ------- end work fetch state ------- 4/10/2013 12:11:59 AM | | [work_fetch] No project chosen for work fetch |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
<insert favourite swearword here> Here's the output from Work Fetch; I'll PM. I need a copy of client_state.xml and of app_info.xml, please. At this point it's either a malformed app_info.xml or you've found a bug ;) I really hope it's just a malformed AI. 4/10/2013 12:11:59 AM | | [work_fetch] work fetch start There's your culprit for the excess workfetch. For some reason boinc thinks you don't have any ATI work at all and therefore keeps asking for it. I need to see client_state.xml and app_info.xml to know what's going wrong there. I'll PM an email addy... A person who won't read has no advantage over one who can't read. (Mark Twain) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Add a fresh set of startup messages from the Event Log to the file-set you're going to email to William. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
On a hunch - can you undo that 'excluded GPU' and with NNT set (so you don't ask for work) run WFD again? It should show how much work it would ask for. Excluding the GPUs may get in the way of properly calculating saturation. A person who won't read has no advantage over one who can't read. (Mark Twain) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Sorry guys, I'm in Windows 'ell at the moment. This is going to have to wait. The Windows 8 host is working fine at the moment with NNT. You might want to reread the thread about the two ATI cards in the meantime. The only way to keep BOINC from trying to run an AP task on the 3650 is to set it to Exclude the 3650, it'a a BOINC thing. Otherwise, I can't keep the 3650 in the machine as BOINC keeps starting tasks on it...it's in the other two threads, Continuing SETI Problems with 2 ATI Cards Installed This part,'sim_excluded_instances 2', doesn't that suggest BOINC thinks there are two OpenCL cards installed? To me, the problem is BOINC is trying to download tasks for two cards, and one of those two has never completed a task, so, BOINC doesn't have a history on it. That appears to be why the Application details on that host is Borked. Why isn't the Application details listing the actual number of ATI tasks that host is completing? It's STILL stuck at 10.... Right now I'm trying to repair the damage Ubuntu did to my Mac XP Bootcamp partition. Getting XP going again is on the front burner. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
That was a pain. I had to swap the data around on three different HDs, reformat the drives, and completely reinstall XP on a new Bootcamp partition. OSX does not like what Ubuntu does to the partition maps. Works fine on my PCs. At least I now have a fresh XP, on a faster drive... I'm going to try the New BOINC Charlie has. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Another one that makes you say Whut? I had the 6850 in that machine to see how it worked with Windows 8. It completed close to 200 APs in those few days without a problem, until; Workunit 1228052924 2959989611 6686127 27 Apr 2013, 2:31:36 UTC 27 Apr 2013, 11:31:54 UTC Completed and validated 20,933.70 4,127.53 780.63 AstroPulse v6 v6.04 (opencl_nvidia_100) 2959989612 6796475 27 Apr 2013, 2:31:40 UTC 28 Apr 2013, 17:32:17 UTC Completed, marked as invalid 2,227.31 778.26 0.00 AstroPulse v6 Anonymous platform (ATI GPU) 2963067135 6887239 28 Apr 2013, 21:52:18 UTC 5 May 2013, 8:56:24 UTC Completed and validated 426,448.30 312,739.90 780.63 AstroPulse v6 v6.01 I worked out 426,448.30 seconds, it's about the same amount of time it took the 6850 to complete those ~200 Successes...that all validated...except for that one. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.