"Our greatest responsibility is to be good ancestors."

-Jonas Salk

Sunday, February 7, 2010

What does openness in science mean?

Kooiti Masuda has some very interesting points in response to the question of openness in science, especially regarding data and code. Please read it first before reading my response, which follows.

1) Commercial codes used as part of scientific process

I agree with KM's point that the fact that much of the software that scientists use is closed-source and expensive will complicate matters. In any case you will find some strong objections from the commercial software community that by virtue of its use in science their code must be released or even auditable, or that competitors to their functionality should be publicly funded.

But none of this prevents any code developed in house on the public expense from being released into open source. Such codes are substantial and are crucial to much of the controversy. Nobody is suggesting that any significant bias is introduced via bugs in Matlab, so the fact that Matlab is licensed is not practically important for these purposes.

2) Complications due to institutionally expected commercialization of
academic codes

In fields far from climate, commercialization of academic codes is the norm. Institutions, for their own interests, presume that codes developed there have commercial value unless demonstrated otherwise. Publicly funded institutions act very much like corporations in this regard, except perhaps with less agility.

Should these limitations be applied only to particular fields? How can one reasonably establish boundaries between fields where publication is expressly required and others where it is discouraged?

I am actually gearing up for negotiations with my university to permit me to release a modest piece of general-purpose code I have written. The default position of the intellectual property office of the university is "no, this code belongs to the institution, and if there is potential for outside use, they should pay us for it".

As in a corporation, at least in America the case for open source must be made explicit and focused on the needs of the university of public laboratory, not of the science or the general public. Support for the contrary may end up coming from the funding agencies, the principal investigators and/or the general public. The general public, especially people working in small, closely-held businesses, has difficulty understanding the bureaucratic barriers to open source.

3) Informal coding experiments

As for the difficulties implied by informally developed code, I actually have some technical ideas that would greatly reduce these, which (ironically I am afraid) I need to keep somewhat private at present. Hopefully I can find funding for this work and release it into the public domain. Wouldn't it be bizarre to have to close the source for a tool facilitating the open sourcing of academic software?

4) Supercomputing

Another problem you do not raise is the difficulty with very large calculations, which tend to be performed on one-of-a-kind machines. Here, the efforts may not be repeatable in practice even within the given research team, as the constantly shifting experimental platforms subvert exact repeatability and require occasional adaptation of the codes just to keep up with the requirements of the machine. As the machine is unique not only as an instance, but as a configuration, supercomputing undermines reproducibility.

5) Still, Open Science is Always Better

All this said, I remain a strong supporter of publication, documentation and reproducibility in dramatically more detail than was possible in the past. There are far more difficulties than the rather vicious critics of the field acknowledge. However, defenses from within the field that openness is technically impossible or socially undesirable are very unhelpful and in my opinion very wrong.

6) Can openness backfire?

There is certainly the risk that open science will appear to facilitate misunderstanding. The widespread misuse of the web-based portal to the MODTRAN program, by people who don't understand the precise nature of the problem it solves, is very illustrative. My feeling about that is that people who get things wrong will get things wrong no matter how little or much information you give them.

In the end, the only defense of genuine science remains peer review, though the structure of peer review may also need to change in the future. But that's another topic.

11 comments:

David B. Benson said...

MT --- I don't agree. The data is owned by the agency collecting it or by a sponsor thereof, depending upon contractual agreements. If government sponsored and the people, through legislation, agree to make the data freely available, fine. Similarly for private sponsors such as Sloan Foundaton.

Code? Much more problematic. The tradition was (still is) to publish enough detail so that another researcher, in her own lab, could reproduce the result. Not to simply let her into own's own to by simple rote repeat the technician's work, etc.

In a setting having nothing to do with climatology, a software V&V type wanted me to let him have my research code to do his thng on. I refused. I informed him that he knew the applicable equations and if he had trouble understanding the parameter estimation methods employed, I'd be happy to explain, but that the actual research code had to be his, this to maintain the concept of scientific reproducibility.

Hank Roberts said...

http://scienceblogs.com/effectmeasure/2010/02/all_the_science_news_thats_not.php

Hank Roberts said...

Useful, varied, blogging on scientists and journalists:
http://scienceblogs.com/clock/2010/02/journalism_wrap-up_from_scienc.php

Anonymous said...

I've spent the last 10 years being part of a team making climate model output and code available to the community and the public, the IPCC AR4 model output archive being the most visible example.

It's not a trivial undertaking. There are a huge number of factors involved - metadata and metadata standards (so many to choose from), registration, metrics, security, versioning, access to data in deep storage, user support (a biggie), maintaining the site, maintaining funding for many years... The list is quite long.

It's a lot more complicated than putting a bunch of files on an ftp server.

David B. Benson said...

G-Man wrote "... maintaining funding for many years..."

How is that accomplished?

G-Man said...

David Benson asked how we keep funding going - and the answer is that we have to keep re-applying for grant money every 3 years. If we don't get it, the project runs on fumes until it breaks, and then it's done.

Doesn't make sense at all. Bureaucracies rarely make sense.

David B. Benson said...

G-Man --- Astronomers have good cooperation and data sharing. Sponors include Sloan and Keck Foundations. I don't know the details, but the seems to be a expectation of continuity.

Rattus Norvegicus said...

BTW, G-Man, have you heard anytinng from Ken Burnside at data-n-demagogouges.com? They were going to try and run CCM3 on a PC, but they haven't posted any progress reports in a couple of months now.

I have my guesses as to why, but I'd like to know your take...

G-Man said...

David, that's a possibility for funding - find a foundation. I hadn't thought of that. I'll pass the idea along.

G-Man said...

Yes, Rattus, I've been in touch with Burnside. I think they won't have much success at all, but we'll see. GCMs of the class of CCSM3 really aren't designed to run on PCs.

Michael Tobis said...

They aren't "designed" at all. They are sort of emergent properties of the universe...