v7 cuda23 WUs getting ERR_TOO_MANY_EXITS

Message boards : Number crunching : v7 cuda23 WUs getting ERR_TOO_MANY_EXITS
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1380743 - Posted: 13 Jun 2013, 16:56:19 UTC

Just following up and checking to see if any progress has been made on a solution in the last 24 hours. Following reboots on both machines yesterday (it was that time of the month for Windows updates), BOINC again downloaded the cudart.dll and cufft.dll files, which I promptly deleted again. Since the 8600GT was in the middle of a cuda22 at the time, I suspended the project before rebooting and only resumed after deleting the "new" dll's. That worked fine, but a more hands-free solution would sure be nice.

I just spent about 15 minutes doing an admittedly unscientific survey of 100 computers belonging to current or former wingmen, just to see how many had older GPUs on stock apps and might be getting hit with cuda22 and cuda23 tasks, and to see if any were being successfully processed. Out of the 100, I found only 5 that had received those tasks, 4 of which were getting errors on all of them and only 1 of which was successfully processing them. The one that was successful appears to be the only one that had been running Anonymous Platform prior to v7, so that would fit with Claggy's scenario.

The failing hosts that I saw were:
http://setiathome.berkeley.edu/results.php?hostid=851796
http://setiathome.berkeley.edu/results.php?hostid=1934182
http://setiathome.berkeley.edu/results.php?hostid=2612987
http://setiathome.berkeley.edu/results.php?hostid=2713778

and the successful one was:
http://setiathome.berkeley.edu/results.php?hostid=2693912

So, in a way, I suppose that if only 4% of your user base is experiencing this problem, that doesn't seem too bad, but considering the size of S@H's community, perhaps it is.

Currently my 8600GT is crunching the last cuda22 in its queue, so the problem will go away for awhile for me, but.....

On a slightly different note, on that same machine I received a v6.10 cuda_fermi task in a download overnight. I've never had one of those on that host before (so the app and dll's, etc., all were downloaded, too). Will that even run on an 8600GT? It won't bubble to the top of the queue for 4 or 5 hours or so, but it has such a ridiculously long estimated run time (73 hours) that BOINC won't download any GPU tasks behind it until it either gets most of the way through it or fails).
ID: 1380743 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1380749 - Posted: 13 Jun 2013, 17:04:33 UTC - in response to Message 1380743.  

Just following up and checking to see if any progress has been made on a solution in the last 24 hours.

Heads are being scratched on three continents, but we're not there yet, sorry.

On a slightly different note, on that same machine I received a v6.10 cuda_fermi task in a download overnight. I've never had one of those on that host before (so the app and dll's, etc., all were downloaded, too). Will that even run on an 8600GT? It won't bubble to the top of the queue for 4 or 5 hours or so, but it has such a ridiculously long estimated run time (73 hours) that BOINC won't download any GPU tasks behind it until it either gets most of the way through it or fails).

It should be OK, and a lot quicker than that. The newer app will run on older hardware, but not the other way round - 608 and 609 can't run on a Fermi GPU, but yours would be OK for them, too.
ID: 1380749 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1380750 - Posted: 13 Jun 2013, 17:05:36 UTC

No, we are still fighting BOINC.

For a manual solution, untick v6 (so you won't be reissued 6.08), then reset the project. Check the folder that cudart.dll and cufft.dll have been deleted by the reset.

AFAIK 6.10 should run on the 8600 - got enough mem.
yes initial runtime estimates are often a PITA.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1380750 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1380756 - Posted: 13 Jun 2013, 17:11:46 UTC - in response to Message 1380750.  

Okay, thanks for the quick responses, Richard and William. Will just keep on keeping on!
ID: 1380756 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1381136 - Posted: 14 Jun 2013, 16:22:21 UTC

The cuda20/22/23 problem is a very much older one. All three of those main programs expect to find files named, exactly, cudart.dll and cufft.dll: but they are different files, and they have to be kept separate.

So Eric distributes them with different filenames, and copies them to the 'slot' (temporary working scratchpad) directory as the application starts, and renames them on the way. If the app finds the right dll in the slot, it should run.

It occurs to me that if the application is smart enough to copy and rename those dll's before it tries to load them, it could also be smart enough to temporarily rename the cudart.dll and cufft.dll files in the project directory to something obscure at the same time, then rename them back again once the dll's from the slot directory have been loaded. Or do I have too simplistic a view of the problem?
ID: 1381136 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1381157 - Posted: 14 Jun 2013, 17:28:40 UTC - in response to Message 1381136.  

The cuda20/22/23 problem is a very much older one. All three of those main programs expect to find files named, exactly, cudart.dll and cufft.dll: but they are different files, and they have to be kept separate.

So Eric distributes them with different filenames, and copies them to the 'slot' (temporary working scratchpad) directory as the application starts, and renames them on the way. If the app finds the right dll in the slot, it should run.

It occurs to me that if the application is smart enough to copy and rename those dll's before it tries to load them, it could also be smart enough to temporarily rename the cudart.dll and cufft.dll files in the project directory to something obscure at the same time, then rename them back again once the dll's from the slot directory have been loaded. Or do I have too simplistic a view of the problem?

It isn't the application which copies and renames those files, it's the BOINC infrastructure. And it isn't BOINC's fault that the application doesn't find the renamed files - that's behaviour governed by Windows itself (if the app finds DLLs with the names it expects in the first place it looks, it stops looking - and nobody's found a way to override that).

And the app itself can't move or rename them - by the time it's started up enough to do anything useful, it's already loaded (or tried to load) the wrong DLLs. We did consider having a separate little app to shuffle things about and then call the main app - but writing that would have been a lot of work for what we hope will be a relatively small group of users (didn't you sample a few, and find it was about 4% ?).

Still looking and thinking (and scratching heads)
ID: 1381157 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1381177 - Posted: 14 Jun 2013, 18:27:19 UTC - in response to Message 1381157.  

It isn't the application which copies and renames those files, it's the BOINC infrastructure.

Okay, didn't realize it was BOINC doing that. Oh, well, it was a thought.

for what we hope will be a relatively small group of users (didn't you sample a few, and find it was about 4% ?).

Yes, 4% in the sample I did yesterday. I just tried sampling another small group of 100 and just found 6 hosts today that were failing (bringing the average up to 5%), but none that were successfully completing cuda22/cuda23. Just out of curiosity, I got a count of those error tasks in today's sample, which of course only represents those that are still in the database and haven't already been purged after other hosts have completed and validated them. Among just those 6 hosts, I found 132 cuda22/cuda23 tasks that had failed.

Again, I would suggest that for the time being the scheduler should be prevented from sending out cuda22 and cuda23 tasks until/unless a solution is found. I mean, what's the point in going through that whole cycle for those WUs, when better than 90% of them are going to fail? Couldn't the same WUs just as easily be sent out as cuda32 tasks?
ID: 1381177 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1381445 - Posted: 15 Jun 2013, 9:55:24 UTC - in response to Message 1381157.  

And it isn't BOINC's fault that the application doesn't find the renamed files - that's behaviour governed by Windows itself (if the app finds DLLs with the names it expects in the first place it looks, it stops looking - and nobody's found a way to override that).

Looking with Process Explorer I see the Current directory is in slots:

Path: H:\BOINC-Data\projects\setiathome.berkeley.edu\AKv8c_Bb_r1846_winx86_SSE2x.exe
Command line: projects/setiathome.berkeley.edu/AKv8c_Bb_r1846_winx86_SSE2x.exe
Current directory: H:\BOINC-Data\slots\5\

Path: H:\BOINC-Data\projects\setiathome.berkeley.edu\AP6_win_x86_SSE2_OpenCL_ATI_r1843.exe
Command line: projects/setiathome.berkeley.edu/AP6_win_x86_SSE2_OpenCL_ATI_r1843.exe --device 0
Current directory: H:\BOINC-Data\slots\0\


So why 'the first place it looks' is not in slots\?\ but in projects\setiathome.berkeley.edu\ ?


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1381445 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1381473 - Posted: 15 Jun 2013, 11:27:34 UTC - in response to Message 1381445.  

And it isn't BOINC's fault that the application doesn't find the renamed files - that's behaviour governed by Windows itself (if the app finds DLLs with the names it expects in the first place it looks, it stops looking - and nobody's found a way to override that).

Looking with Process Explorer I see the Current directory is in slots:

Path: H:\BOINC-Data\projects\setiathome.berkeley.edu\AKv8c_Bb_r1846_winx86_SSE2x.exe
Command line: projects/setiathome.berkeley.edu/AKv8c_Bb_r1846_winx86_SSE2x.exe
Current directory: H:\BOINC-Data\slots\5\

Path: H:\BOINC-Data\projects\setiathome.berkeley.edu\AP6_win_x86_SSE2_OpenCL_ATI_r1843.exe
Command line: projects/setiathome.berkeley.edu/AP6_win_x86_SSE2_OpenCL_ATI_r1843.exe --device 0
Current directory: H:\BOINC-Data\slots\0\


So why 'the first place it looks' is not in slots\?\ but in projects\setiathome.berkeley.edu\ ?

Because of Search Order for Desktop Applications:

the search order is as follows:
1. The directory from which the application loaded.
...
ID: 1381473 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1381517 - Posted: 15 Jun 2013, 15:39:20 UTC

Another thought (1 per day is my limit), although you're unable to do a rename of the cudart and cufft dll's in the project directory before the cuda22 or cuda23 app loads them, would it still be possible to do such a rename before it exits? What I'm thinking is that, since BOINC (at least in 7.0.64) makes the effort to restart the app multiple times (I think I saw about 80 retries on one) before issuing the ERR_TOO_MANY_EXITS message, a rename on the first pass would ensure that the second pass would find the proper dll's, sort of like leaving yourself a note to read on your next trip through the time loop. Then, on the second pass, once the app gets past the point where it failed the first time, it could rename the dll's in the project directory back to their original names. It's a bit hokey, but seems like it might be feasible.
ID: 1381517 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 7 Mar 04
Posts: 388
Credit: 1,857,738
RAC: 0
Finland
Message 1381538 - Posted: 15 Jun 2013, 17:18:20 UTC

Richard, do you mind if I ask what kind of solution did you try?
ID: 1381538 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1381548 - Posted: 15 Jun 2013, 17:47:12 UTC - in response to Message 1381517.  

...although you're unable to do a rename of the cudart and cufft dll's in the project directory before the cuda22 or cuda23 app loads them, would it still be possible to do such a rename before it exits?

The application doesn't start (correctly) because of the wrong DLLs, so it can't do any renaming, neither on start nor on exit.

Gruß,
Gundolf
ID: 1381548 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1381557 - Posted: 15 Jun 2013, 18:17:26 UTC - in response to Message 1381548.  

...although you're unable to do a rename of the cudart and cufft dll's in the project directory before the cuda22 or cuda23 app loads them, would it still be possible to do such a rename before it exits?

The application doesn't start (correctly) because of the wrong DLLs, so it can't do any renaming, neither on start nor on exit.

Gruß,
Gundolf

I don't think Jeff's idea has been considered, and it's unclear whether the app actually performs some portion of its initialization before crashing. I think the idea may be a possibility still, but anything which requires rebuilding and testing a science application wouldn't be a very quick fix.
                                                                   Joe
ID: 1381557 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1381566 - Posted: 15 Jun 2013, 18:41:05 UTC - in response to Message 1381538.  

Richard, do you mind if I ask what kind of solution did you try?

We have three file dependencies to consider.

1) app --> cudart
2) app --> cufft
3) cufft --> cudart

We can rename both cudart and cufft, and we can change dependencies (1) and (2) legally (well, Jason could do it legally with a compiler - I took the quick and dirty route with a hex editor)

But we don't have legal rights to modify cufft.dll, or the source code to recompile it. Offline, and strictly in the privacy of my own testbench, dependency (3) can be hacked with the hex editor, and the app then works. But there's no way that UCB could be seen to be circulating hacked DLLs without the permission of the rights holder. I believe they've been asked.
ID: 1381566 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1382913 - Posted: 20 Jun 2013, 2:59:03 UTC

Okay, I can certainly understand that it might be difficult, impossible, or not worth the effort to overcome the incompatibility between v6.08 cuda and v7 cuda22/cuda23. In that case perhaps the focus should be to phase out v6.08 as quickly as possible and have BOINC simply delete all the v6.08 app and dll files. (Can BOINC be told by the project to do that?)

In the meantime, I simply cannot understand why v7 cuda22 and cuda23 tasks are continuing to be sent out when it appears that the vast majority are just going to crash and burn, wasting resources at both ends and in between, and tying up what everybody seems to say is valuable space in the project's database. I just got 15 more cuda23's tonight (12 for my 8600GT and 3 for my 405). I get the sense that I'm probably one of a very small group that's successfully processing them (and only thanks to the advice provided here by Richard and William). Hopefully there are also some who've read this thread and followed that same advice (though certainly no one who's mentioned it so far ;-)), there's the category of users that Claggy mentioned, who ran Anonymous Platform under v6, and don't have the dll conflict if they've chosen to run v7 with stock apps, and then there are brand new users, who've just signed up since the v7 rollout (and I would have to guess that very few of them might have the older GPUs affected by this problem). For everybody else with the older GPUs, sending out v7 cuda22/23 tasks is just foolish and futile.

This afternoon I took a look at 50 of my recent v7 WUs which had required more than a single wingman for reasons other than "validation inconclusive". I found 39 of those had been paired with at least one host who was simply thrashing and trashing cuda22/23 tasks. Looking at the list of Error tasks for the 26 different hosts represented in that sample, which of course only includes those tasks that had failed but not yet been validated by others and purged from the database, I counted a total of 821 cuda22 or cuda23 tasks (with one host alone accounting for 181 of them), most that had failed just within the last 3 or 4 days, but some that have been sitting in the database since shortly after the v7 rollout. Of course, a number of those hosts have since been sent more of the same, which are "in progress", i.e., guaranteed errors of the future.

Although my own 8600GT and 405 will happily crunch cuda32, cuda42, and cuda50, I noted many of these other hosts had cards, or at least very old drivers, that seemed to be only able to handle cuda22 and cuda23 under v7, so cutting them off would mean that they'd only get v6.08 resends, but...so what! If they're all failing anyway, the project isn't getting any contribution from them as it is.

I'm sorry if this comes off as a bit of a rant, but it just kind of offends my sensibilities to see resources of any kind going to waste like this. Sigh....
ID: 1382913 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1382994 - Posted: 20 Jun 2013, 11:48:59 UTC - in response to Message 1382913.  

The problem is, that even when 6.08 gets deprecated, the files remain. Boinc doesn't clean up unless you reset/detach.

Eric was trying to issue a delete on beta, see how it goes - but it will only reach newer boincs AFAIU.

As I told David, unless he's got something up his sleeve, we are well and truly f******. He didn't reply. Any new client code is useless for our purpose.

Sloppy coders >:(
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1382994 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 7 Mar 04
Posts: 388
Credit: 1,857,738
RAC: 0
Finland
Message 1383221 - Posted: 20 Jun 2013, 22:28:40 UTC
Last modified: 20 Jun 2013, 22:37:53 UTC

Weell...

As far as I can tell, BOINC has two mechanism to manage files it keeps for longer time periods.

The first is Locality scheduling / Volunteer storage. Unfortunately abusing that doesn't work here. When receiving a command to delete a file the client just unsets the file's sticky bit and leaves the actual deleting for garbage collector. The file is still referenced from app_version so doesn't get deleted.

The other is app_version handling. If BOINC receives a newer app_version for an (app, plan class, platform) combo than what it already has BOINC will delete the old app_version as soon as no workunit references it.

So what could be done is to release a 6.08b cuda version. It doesn't have to be a really new version, It can be otherwise exactly like 6.08 cuda except for the DLL names which would need to be handled with open_name/copy_file magic.

Now, BOINC only sends information about an app_version when sending a workunit to a client. The remaining v6 workunits are unlikely to provide a reliable way to deliver 6.08b to clients that need it and continuing to split work for v6 is not a very good idea either. As it happens BOINC has a feature that allows assigning work to specific hosts. So what can be done is to assign a v6 workunit to every host that trashes all v7 cuda22/23 work it receives and that workunit then pulls the updated 6.08b with it to the client.

In it's current operating mode the scheduler sends any work to the fastest app_version the host can run. For hosts that can run it, the fastest is probably 6.10 cuda_fermi. In order to get the scheduler pick 6.08b cuda it would be necessary to deprecate 6.10 cuda_fermi. The plan fails for any host that says it has an exceptionally slow GPU or fast CPU.

As a small bonus the scheduler ignores the resource type in the client's request and just picks the fastest app_version so the app_version gets updated as soon as possible.

As for what work to send. The assigned work goes through validation and assimilation just like any other work. Assigned workunits are sent to one host in the target set (here the set is just one host) at a time until the workunit validates ok. The last time I looked into MB validator it was set up so that it can do single validations so that's not a problem in itself. The target host however could be just as unreliable as one can be which makes single validation less desirable. Also it would be necessary to either split work manually or modify splitter to assign work.

So, instead of sending normal work lets make up a workunit and send copies of that. Since such result doesn't have much scientific value it should be one that completes fast, like a -9. There's plenty of -9's around so it's just a matter of picking one. Or if it doesn't matter that users get an invalid result in their task list make the workunit fail immediately for e.g. parsing error. Empty files are good for that.

Then there's still the assimilator. Anything that gets past that ends up in science database. If we are sending basically trash workunits to clients I don't think we would want the results from those added into the science db. In the past we have had workunits from the "jazz" test tape. I think at least once those were sent by accident. I'd say it'd be reasonable to assume that the gang has a way to remove those results from the science db or at least filter them out before the data gets further processing. So sending a -9 workunit from jazz-tape could be a good option.

The workunit record in BOINC database contains a pointer to a science database record. It would be possible to insert a record to the science db and have all the assigned workunits point to that record. And then put a note next to the science db to remind to remove that record. Or the pointer could be set to point to a record that doesn't exist or make the pointer otherwise invalid. That way the assimilator would fail to insert the result to the science db.


There's still that uncertainty in choosing app_version so plan b.

In BOINC's homogeneous app version mechanism each workunit is sent to just one app_version. Abusing that, if v6 is set as homogeneous app version that would allow creating the assigned workunits so that they are sent to v6.08b cuda and no other app_version.

As a very nice bonus, hr workunits are always sent to the hr class / app_version that was used when the workunit was sent for the first time, even if that app_version is deprecated.

So if 6.08b cuda is released and immediately after that deprecated so that it's not used for any normal workunit it wouldn't matter what it does. All it needs to do is to get downloaded successfully so that the client can replace the older app_version with it.

This means that instead of releasing a full blown science application it would be possible to release an application that only calls exit(FAILURE) (or even an application that doesn't start at all or anything that makes the client move to next normal workunit). That would give a very nice savings in bandwidth usage.

This does have some side effects. If, and when, any normal v6 workunit needs to be re-tried on another host it gets locked to the app_version that was used when it was sent to the backup wingman. That could suck a bit if that lone Solaris host grabs a large bunch of work and then goes awol.

Also, the scheduler doesn't send hr workunits to clients using anonymous platform which could upset some people.


Both of these plans have weak points.

The lesser problem is the comment "for now, only look for user assignments" next to the code that sends assigned work. Fortunately that's in scheduler so it can be fixed and it's even reasonably easy to fix.

The bigger problem is deleting the DLLs. Even if BOINC doesn't manage to delete the files it won't try again. That could happen if BOINC tries to start v7 cuda22/23 workunit just before it tries to delete the files. In that case Windows could have the files locked so the delete could fail.


I probably forgot something important. We'll know once half the BOINC database goes poof.


Crazy? Most definitely.

edit: typos
ID: 1383221 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1383234 - Posted: 20 Jun 2013, 23:53:52 UTC - in response to Message 1383221.  

The other is app_version handling. If BOINC receives a newer app_version for an (app, plan class, platform) combo than what it already has BOINC will delete the old app_version as soon as no workunit references it.

I had Richard check CS of a long running (Einstein) host for app_version entries of deprecated apps (boinc 6.12.34). CS has entries for every single app it ever ran (since ther last reset/attach).

Even if you make an app that deletes the files, they stay referenced in CS. and on startup BOINC will notice they are missing and (try to) DL.

That said, maybe if the files were no longer availabe from the server - that might work. I think I missed that option in my emails. Still need to deliver a 'delete app' to the hosts.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1383234 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 1383240 - Posted: 21 Jun 2013, 0:51:22 UTC - in response to Message 1383234.  

I'm ready to try anything at this point. Unfortunately I think we need to wait until we're out of S@H v6 work before we try to replace the v6 cuda apps with something that doesn't have dlls.

I need to convince David of the following

1) if an app gets a file size error or a signature error, BOINC should try to download it again once.

2) BOINC really needs a delete_app_version and a redownload_app_version message to the host.


@SETIEric@qoto.org (Mastodon)

ID: 1383240 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 7 Mar 04
Posts: 388
Credit: 1,857,738
RAC: 0
Finland
Message 1383330 - Posted: 21 Jun 2013, 10:22:15 UTC - in response to Message 1383234.  

The other is app_version handling. If BOINC receives a newer app_version for an (app, plan class, platform) combo than what it already has BOINC will delete the old app_version as soon as no workunit references it.

I had Richard check CS of a long running (Einstein) host for app_version entries of deprecated apps (boinc 6.12.34). CS has entries for every single app it ever ran (since ther last reset/attach).

Even if you make an app that deletes the files, they stay referenced in CS. and on startup BOINC will notice they are missing and (try to) DL.

I'm not very familiar with Einstein. Don't they release a new app+app_version each time they start a new research or something, whatever it is that they do.

Their Gravitational Wave S6 Directed Search (CasA) has id=24 so they have released quite a few apps in the past.


The app_version garbage collection is in client_state.cpp in lines 1469-1495. Git blame says oldest parts of it was written in April 2005 and updates in February 2009 and June 2012.

Also, like I said the code checks for app, plan_class and platform combination. (I think I saw a commit message somewhere that implied older versions of client didn't check for plan_class or/and platform.)

So the idea was not release a "science" application that deletes files but to make the client garbage collect the older version.


Just to make it clear. With app and app_version I mean what BOINC internals mean with them. Neither means the actual science application executable file.
ID: 1383330 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : v7 cuda23 WUs getting ERR_TOO_MANY_EXITS


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.