Five suggestions for Open Science evangelists
A historical IT perspective on how to put science raw-data acquisition on auto-pilot.
This week I attended the @boraz lecture at UNC-Charlotte, and the debate-topic of Open Science was in the air. After some thought, and an email composition to Bora, I decided I would add to the public debate. First off, while I’m hardly even a citizen-scientist – my goal here is to offer, as best I can, a software-industry perspective on the important mission of Open Science.
Background:
At #scio11 I met a number of wonderful and passionate sci-geeks under the Science Online event umbrella nurtured by Bora and Anton Zuiker – @mistersugar. There I learned a bit about Open Science, and the concept of scientists publishing, and live-pushing their raw data to the web (a topic I recall also hearing about at an earlier Science Online event in Chapel Hill).
In Bora’s talk this week he drew historical comparisons between 2012 and the 1800’s when science data was much less lock-and-key. To anyone paying attention to tech and civic trends and the progress of mankind, there really is no debate here. Science data is headed out of the closet; just like open source software (code), just like citizen journalism (blogs), and just like civil discourse (Arab Spring). The only real political questions are how to get there faster, and maybe a bit about orderly transitions.
So How About A Strategy:
Below I’ve outlined a group of interrelated ideas to help drive Open Science, that are drawn from my experience working on various Internet and IT projects.
Place = KNOWN
Open Science needs one place, or one standard for, raw-data.
The grant-funding process often drives science developments. Groups such as NIH.gov and NSF.gov now require some researchers to publish portions of their data. A quick G-search shows lots of activity surrounding Open Science, including some messy debates I won’t go into, but here’s the nut – there does not appear to be any single or organized “Place” to put open science data, in particular raw data that in some cases could even be captured, and disseminated live.
By Place I mean, I set up my instrument, or a smartphone even to capture a data set, and I’m immediately faced with the challenge of where to put, and organize my data. Of course local-store, and later upload is fine, but this issue of where to put the live data / raw data is, in my opinion, holding things back.
Brewser Kahle is fond of, and famous for driving the meme ” Universal access to all human knowledge”. It would seem the #openscience version is something like Universal storage of raw science data. Sorta like the seed people, there just needs to be a place.
Container = SIMPLE
The standard container for Open Science data should be as simple as a single blog post.
At the Scio11 event I found lots of open science brainiacs mired in categorization and containers. From a data standards, repository, and evolution-of-a-system perspective that just seemed backwards to me. When I think about packing raw science data into XML containers, and the sheer diversity of structures and forms of data, I immediately think of the evolution of RSS, and RSS 0.91.
Yes kids, Uncle Dave Winer (scripting.com) and his passion in the late 90’s for structuring unstructured data into what later became blog syndication (though not alone in that mission according to the pedia article).
So what is it about raw data, like observation-notes, or data-threads produced by instrumentation that is so hard to pack-n-store? For 90-something percent of the data sets, even RSS would likely work just fine. Thus, my suggestion here is to just vote-and-go with some raw container format, while keeping in mind my next point about instrumentation.
Default = ON
Most people don’t turn off, what is already turned on.
Remember the (PC) days before plug-n-play? Or when you setup your first Internet access and had to answer the question “Obtain an IP address automatically”? That’s the world of science instrumentation today (OK, I’m way out on a limb here, I have zero 21st century instrument experience, but please play along). I figure most modern science devices are IP based, can connect to the web, and likely have the ability to “push” data “somewhere”.
To drive Open Science, how about this device configuration model:
- The publish point should be: Place=KNOWN (agreement on the repository)
- Pack the data into a core universal format: Container=SIMPLE (stop sweating the small stuff first)
- Then setup instruments to Default=ON (make it a step harder to NOT publish)
The Internet would have like, no users if everyone had to manually configure their IP address. WordPress.com would have like, zero percentage of websites if everyone had to setup their own domain name before they built a site. Well? Open Science has like, zero public data since there is no equivalent to default=Auto-Publish methodology within reach.
App = OS-INSTALLED
Every smartphone has a calculator, and could just as easily be an instrument for raw-data.
What if, in a science-logging world every smartphone was an instrument (it is)? And every device came with a “calculator-grade” app to turn on and publish raw data to a known repository?
I’m sure there are some fine people out there working on various science-logging methods in the app space. I know when I wanted a compass for my Samsung I had a boatload of options, though I can’t say they were instrument-grade. Regardless, quality generally matures with market-reach, and the core point is that IF the app kids wanted to create a robust instrument-suite for smartphones, they are, in a word, relegated to the marketplaces, and years of slogging it out until a few standard-bearers take root.
Someone at the OS level needs to take note and fold instrument-to-raw-data capabilities into the various devices available to consumers. Clearly instrumentation is not a marketplace differentiator on Carrier and OS-team roadmaps, but in a slightly-more-perfect world, I would hope the Sci-Geeks and the Phone-Mates could actually collaborate on instruments and repository features.
Announcement = RAW-LOG-bub:
What are you raw-publishing now?
So lets pull all this together. The iPhones and the instruments collect data; the format is RSS/XML-Raw ish; the repository is Archive.org / Apache / Wikipedia.org ish, and default=ON (hopefully) turns the Open Data stream into 10x the Twitter fire hose (Moores Law for instrument data anybody?).
How about we make all this data discoverable via an announcement system – a “just the facts kid” approach to announcing these data sets? How about a custom Tweet-style platform for Open Science Data? Yes, Twitter can serve that purpose, but as the pro-catalogers get engaged; as the meta about raw-science-data grows, the data model of Twitter easily under-serves the Open Science mission.
Now, if you’re playing along at home it’s easy to see a lot of people lining up to debate features, the necessities of, and so on, around micro-content systems and raw-sci announcements. Undoubtedly, there are already a batch of these systems somewhere (that I’ve never seen), but commonality-wise, what to do?
Technically, this starts to look a lot like what Google tried to do / did with real time and blog-comm in their attempt to normalize Twitter with WordPress (and I suppose other micro-content systems). PubSubHubbub, RSSCloud, that stuff (and I have no idea who ran what, or who should be attributed across that spectrum).
Point being, Twitter got big-fast, and the idea at the time was that others who do “similar” things should be encouraged. Ie. Let’s leave some room for a real-time multi-source solution and not make Twitter-proper, the global standard for real-time science data announcements. That seems to be what G did when faced with Twitters success, and Openly, that’s probably what should happen with real-time sci-logs. (plus/or maybe a way to collapse the “repository” with the “announcement system” together, or via some common standard)
Wrap up (please).
Data is money. Science is important. Science data is often public-funded. The move to Open Science is inevitable. Maybe even something of a data-right in the years to come.
Bora reports from the trenches that less-competitive (less-patentable) science verticals, like dinosaur digs and astro-firsts are leading the open-data movement. But the larger Open Science debate is likely moving from science-esoteric, to political hot potato, since the Patent crowd is charged with, and will want to govern every-single-“bit”.
Meanwhile, the core Open Science developments are mired in categorization and storage debates – that to this IT professional – somewhat mirrors the early days of blog-standards, where a Just-Get-Over-It mentality governed initial advancements quite well.
Thus, I believe #openscience solution-thinkers have a nice history-model to reference that could help drive the important mission of publishing raw-science data – in the clear – and for the long-long term.
My hope here is that these notes help the Open Science evangelists, and that the debates begin to take deeper roots with the we-can-do-something-about-that Internet folks like Wales, Kahle, and the OS-Houses (hint, hint, hint, hint) where some small, but strategic decisions can help push the majority of today’s Open Science data challenges into history.
Attribution Links:
http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001195 http://www.isitopendata.org/about/
http://en.wikipedia.org/wiki/Data_sharing
Posted on February 26, 2012, in Uncategorized. Bookmark the permalink. 9 Comments.
Publicly available data is a great resource (that I use myself) – but what you describe won’t work without the active participation of academic scientists. A very small proportion of biology data is collected in a way that’s both automated AND doesn’t require massive amounts of storage space. Even if you could handle the massive raw data files from next gen seq (or even lcms), it’d be useless without comprehensive and time consuming annotation of the data (sample type, the way it was prepared). Digital lab notebooks could reconcile this but everyone I know still use paper (can get wet in the lab, can tape in gel photos etc). More importantly, the career of an academic is incredibly precarious. I don’t think anyone would risk allowing their own data to get scooped unless they get some real tangible benefit (which I can’t imagine).
As a scientist who uses public data, I’m happy to wait till a paper is published to get access. This also makes it much more likely that data you find was actually collected carefully. I’d like to see more support for things like the geo omnibus and short reads databases.
Thank you Matt for the comment, and insight. I’m at the fringes of these challenges, and am only hoping to help bring additional IT thinking into the discussion.
Good piece, I don’t know what the relationship between open science and patents are though? Sharing data, that makes sense, but that is exactly the morication for patents, full disclosure for rights to profit own work.
Nice point on patents.
There is a silly story being told that open access does not result in winners and loosers. It certainly does. It is just done in a different way. But make no mistake, the winners in the open environment also are protected by a thicket. It may not be a ip thicket, but it is a thicket. My sense is that the story being told that the competitive ip based environent is less efficient the non ip environment needs a little more intellectual honesty.
As in, just ask Red Hat about the other osS, open source Software.
All,
We have started a repository for cell image data, thus creating the place. The Cell: An Image Library. http://www.cellimagelibrary.org. We can deal with 106 different file formats using Bio-Formats.
We are now working on ways to have the community assist in the annotation process.
In addition to raw data sets we are also a place to put the image data not published in the journal article.
brilliant
Pingback: News Round-Up: February 2012 4th Edition « The Amazing World of Psychiatry: A Psychiatry Blog