The Role of Research Funding and Policy Community in Data Citation — Rewards, Incentives, and Infrastructure

August 25, 2016 Leave a comment

Infrastructure and practices for data citation have made substantial progress over the last decade. This increases the potential rewards for data publication and reproducible science, however overall incentives remain relatively weak for many researchers.

This blog post summarizes a presentation given at the National Academies of Sciences as part of  Data Citation Workshop: Developing Policy And Practice.  The slides from the talk are embedded below:

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


Principles

Academic researchers as a class are drawn to research and scholarship through an interest in puzzle-solving, but they are also substantially incented by recognition and money.  Typically, these incentives are shaped and channeled through the processes and institutions of tenure and review; publication; grant, awards and prizes; industry consulting; and professional collaboration and mentoring. [1]

Citations have been described as academic “currency”, and while this is not literally true, they are a particularly visible form of recognition in the academy, and increasingly tied to monetary incentives as well. [2] Thus rules, norms, and institutions that affect citation practices have a substantial potential to change incentives.

When effort is invisible it is apt to be undervalued. Data has been the “dark matter” of the scholarly ecosystem — data citation aims to make the role of data visible.  While the citation of data is not entirely novel, there has been a concerted effort across researchers, funders, and publishers over approximately the last decade to reengineer data citation standards and tools to create more rational incentives to create reusable and reproducible research. [3]  In more formal terms, the proximate aim of the current data citation movement is to make transparent the linkages between research claims, the evidence base on which these claims are based; and the contributors who are responsible for that evidence base. The longer term aim is to shift the equilibrium of incentives so that building the common scientific evidence base is rewarded in proportion to its benefit to the overall scientific community.

Progress

There has been notable progress in standards, policies, and tools for data citation since the ‘bad old days’ of 2007, which Gary King and I grimly characterized at the time [4]:

How much slower would scientific progress be if the near universal standards for scholarly citation of articles and books had never been developed? Suppose shortly after publication only some printed works could be reliably found by other scholars; or if researchers were only permitted to read an article if they first committed not to criticize it, or were required to coauthor with the original author any work that built on the original. How many discoveries would never have been made if the titles of books and articles in libraries changed unpredictably, with no link back to the old title; if printed works existed in different libraries under different titles; if researchers routinely redistributed modified versions of other authors’ works without changing the title or author listed; or if publishing new editions of books meant that earlier editions were destroyed? …

Unfortunately, no such universal standards exist for citing quantitative data, and so all the problems listed above exist now. Practices vary from field to field, archive to archive, and often from article to article. The data cited may no longer exist, may not be available publicly, or may have never been held by anyone but the investigator. Data listed as available from the author are unlikely to be available for long and will not be available after the author retires or dies. Sometimes URLs are given, but they often do not persist. In recent years, a major archive renumbered all its acquisitions, rendering all citations to data it held invalid; identical data was distributed in different archives with different identifiers; data sets have been expanded or corrected and the old data, on which prior literature is based, was destroyed or renumbered and so is inaccessible; and modified versions of data are routinely distributed under the same name, without any standard for versioning. Copyeditors have no fixed rules, and often no rules whatsoever. Data are sometimes listed in the bibliography, sometimes in the text, sometimes not at all, and rarely with enough information to guarantee future access to the identical data set. Replicating published tables and figures even without having to rerun the original experiment, is often difficult or impossible.

A decade ago, while some publishers had data transparency policies — they were routinely honored in the breach. Now, a number of high profile journals both require that authors cite or include the data on which their publications rest; and have mechanisms to enforce this. PLOS is a notable example — its Data Availability Statement [5] states not only that data should be shared, but that articles should provide the persistent identifiers of shared data, and that these should resolve to well-known repositories.

A decade ago, the only major funder that had an organization-wide data sharing policy was NIH [6] — and this policy had notable limitations —  it was limited to large grants, and the resource sharing statements it required were brief, not peer reviewed, and not monitored. Today, as Jerry Sheehan noted in his presentation on Increasing Access to the Results of Federally Funded Scientific Research: Data Management and Citation, almost all federal support for research now complies with the Holdren memo, which requires policies and data management plans “describing how they will provide for long-term preservation of, and access to, scientific data”. [7]  A number of foundation funders have adopted similar policies. Furthermore as, panelist Patricia Knezek noted, data management plans are now part of the peer review process at the National Science Foundation, and datasets may be included in the biosketches that are part of the funding application process.   

A decade ago, few journals published replication data, and no high-profile journals existed that published data.  Over the last several years, the number of data journals has increased, and Nature Research launched Scientific Data — which has substantially raised the visibility of data publications.

A decade ago, tools for data citation were non-existent, and the infrastructure for general open data sharing outside of specific community collections was essentially limited to ICPSR’s publication related archive [8] and Harvard’s’ Virtual Data Center [9] (which later became the Dataverse Network). Today as panelists throughout the day noted [10] infrastructure such as CKAN, Figshare, and close to a dozen Dataverse-based archives accept open data from any field; [11] there are rich public directories of archives such as RE3data; and large data citation indices such Datacite, and the TR Data Citation Index, enable data citations to be discovered and evaluated. [12]

These new tools are critical for creating data sharing incentives and rewards.  They allow data to be shared and discovered for reuse, reuse to be attributed, and that attribution to be incorporated into metrics of scholarly productivity and impact. Moreover much of this infrastructure exists in large part because they received substantial startup support from the research funding community.

Perforations

While open repositories and data citation indices enable researchers to more effectively get credit for data that is cited directly, and there is limited evidence that sharing research data is associated with higher citation rates, data sharing and citation remains quite limited. [13] As Marsha McNutt notes in her talk on Data Sharing: Some Cultural Perspectives, progress likely depends at least as much on cultural and organizational change, as on technical advances.   

Formally, the indexing of data citations enables citations to data to contribute to a researcher’s H-index, and other measures of scholarly productivity. As speaker Dianne Martin noted in the panel on Reward/Incentive Structures that her institution  (George Washington University) had begun to recognize data sharing and citation in the tenure and review process.

Despite the substantial progress over the last decade, there is little evidence that the incorporation of data citation and publication into tenure and review is yet either systematic or widespread.  Overall, positive incentives for citing data still appear to remain relatively weak:

  1. It remains the case that data is often used without being cited.[14]
  2. Even where data is cited — most individual data publications  (with notable exceptions primarily in the category of large community-databases) are neither in high impact publications nor highly cited.  Since scientists achieve significant recognition most often through publication in “high-impact” journals, and increasingly through publishing articles that are highly-cited — devoting effort to data publishing has a high opportunity cost.
  3. Even when cited, publishing one’s data is often perceived as increasing the likelihood that others will “leapfrog” your research, and publish high-impact articles with priority. Since scientific recognition relies strongly on priority of publication, this risk is a disincentive.
  4. While data citation and publication likely strengthens reproducibility, it also makes it easier for others to criticize published work. In the absence of strong positive rewards for reproducible research, this risk may be a disincentive overall.  

 

Funders and policy-makers have the potential to do more to strengthen positive incentives. Funders should support for mechanisms to assess and assign “transitive” credit, which would provides some share of the credit for publications to the other data and publications on which they would rely. [15] And funders and policy makers should support strong positive incentives for reproducible research — such as funding, and explicit recognition. [16]

Thus far, much of the efforts by funders, who are key stakeholders, focus on compliance. And in general, compliance has substantial limits as a design principle:

  • Compliance generates incentives to follow a stated rule — but not generally to go beyond to promote the values that motivated the rule.
  • Actors still need resources to comply, and as Chandler and other speakers on the panel on Supporting Organizations Facing The Challenges of Data Citation, compliance with data sharing is often viewed as an unfunded mandate.
  • Compliance-based incentives are prone to failure where the standards for compliance are ambiguous or conflicting.
  • Further, actors have incentives to comply with rules only when they have an expectation that behavior can be monitored, that the rule-maker will monitor behavior, and that violations of the rules will be penalized.
  • Moreover, external incentives, such as compliance, can displace existing internal motivations and social norms [17] — yielding a reduction in the desired behavior. Thus we should expect to promote the value the rule supports.

Journals have increased the monitoring of data transparency and sharing — primarily through policies like PLOS’s that require the author to supply before publication a replication data set and/or an explicit data citation or persistent identifiers that resolves to data in a well-known repository. This appears to be substantially increasing compliance with journal policies that had been on the books for over a decade.

However, neither universities nor funders are routinely auditing or monitoring compliance with data management plans.  As panelist Patricia Knezek emphasizes,  there are many questions about how funders will monitor compliance, how to incent compliance after the award is complete, and regarding uncertainties about the division of responsibility for compliance between the funded institution and the funded investigator.  Further, as noted in the panelists discussion with the workshop audience, data management plans for funded research made available to the public along with the abstracts creates a barrier to community-based monitoring and norms; scientists in the federal government are not currently subject to the same data sharing and management requirements as scientists in academia; and there is a need to support ‘convening’ organizations such as FORCE11, and the Research Data Alliance to bring multiple stakeholders to the table to align strategies on incentives and compliance .

Finally, as Cliff Lynch noted in the final discussion session of the workshop, compliance with data sharing requirements often comes into conflict with confidentiality requirements for the protection of data obtained from individuals and businesses, especially in the social, behavioral, and health sciences.  This is not a fundamental conflict  — it is possible to enable access to data without any intellectual property restrictions while still maintaining privacy. [18] However, absent common policies and legal instruments for intellectually-open but personally-confidential data, confidentiality requirements are a barrier (or sometimes an excuse) to open data.

References

[1] See for a review: Stephan PE. How economics shapes science. Cambridge, MA: Harvard University Press; 2012 Jan 15.

[2] Cronin B. The citation process. The role and significance of citations in scientific communication. London: Taylor Graham, 1984. 1984;1.

[3] Altman M, Crosas M. The evolution of data citation: from principles to implementation. IAssist Quarterly. 2013 Mar 1;37(1-4):62-70.

[4] Altman, Micah, and Gary King. “A proposed standard for the scholarly citation of quantitative data.” D-lib Magazine 13.3/4 (2007).

[5] See: http://journals.plos.org/plosone/s/data-availability

[6] See Final NIH Statement on Sharing Research Data, 2003, NOT-OD-03-032. Available from: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html

[7]  Holdren, J.P. 2013, “Increasing Access to the Results of Federally Funded Scientific Research “, OSTP. Available from: https://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf

[8]  King, Gary. “Replication, replication.” PS: Political Science & Politics 28.03 (1995): 444-452.

[9]  Altman M. Open source software for Libraries: from Greenstone to the Virtual Data Center and beyond. IASSIST Quarterly. 2002;25.

[10]  See particularly the presentation and discussion on Tools and Connections; Supporting Organizations Facing The Challenges of Data Citation, and Reward/Incentive Structures.  

[11] See http://dataverse.org/ , http://ckan.org/ , https://figshare.com/

[12] See http://www.re3data.org/ ,https://www.datacite.org/ , http://wokinfo.com/products_tools/multidisciplinary/dci/

[13]  Borgman CL. The conundrum of sharing research data. Journal of the American Society for Information Science and Technology. 2012 Jun 1;63(6):1059-78.

[14]  Read, Kevin B., Jerry R. Sheehan, Michael F. Huerta, Lou S. Knecht, James G. Mork, and Betsy L. Humphreys. “Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study.” PloS one10, no. 7 (2015): e0132735.

[15] See Katz, D.S., Choi, S.C.T., Wilkins-Diehr, N., Hong, N.C., Venters, C.C., Howison, J., Seinstra, F., Jones, M., Cranston, K., Clune, T.L. and de Val-Borro, M., 2015. Report on the second workshop on sustainable software for science: Practice and experiences (WSSSPE2). arXiv preprint arXiv:1507.01715.

[16]  See Nosek BA, Spies JR, Motyl M. Scientific utopia II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science. 2012 Nov 1;7(6):615-31., Brandon, Alec, and John A. List. “Markets for replication.” Proceedings of the National Academy of Sciences 112.50 (2015): 15267-15268.

[17] Gneezy, U. and Rustichini, A., 2000. A fine is a price, . J. Legal Studies, 29,

Categories: Uncategorized

“It’s Tough to Make Predictions, Especially About the Future”* (of the Internet)

July 21, 2016 Leave a comment

Elon University’s Imaging the Internet Center aims to provide insights into emerging network innovations, global development, dynamics, diffusion, and governance. For over a decade, they have been collaborating with the Pew Research center to conduct regular expert surveys to support predictions in these areas.  

Experts responding to the last survey, conducted in 2014, yielded over twenty themes for the next decade, including:

  • “The spread of the Internet will enhance global connectivity that fosters more planetary relationships and less ignorance.”
  • “The Internet of Things, artificial intelligence, and big data will make people more aware of their world and their own behavior.”
  • “The spread of the ‘Ubernet’ will diminish the meaning of borders, and new ‘nations’ of those with shared interests may emerge and exist beyond the capacity of current nation-states to control. “
  • “Dangerous divides between haves and have-nots may expand, resulting in resentment and possible violence.”
  • “Abuses and abusers will ‘evolve and scale.”
  • “Most people are not yet noticing the profound changes today’s communications networks are already bringing about; these networks will be even more disruptive in the future.”

 

The next wave of the survey is underway, and I was asked to contribute predictions for change in the next decade, as an expert respondent.  I look forward to seeing the cumulative results of the survey, which should emerge next year. In the interim, I share my formative responses to questions about the next decade:

… the next decade of public discourse… will public discourse online become more or less shaped by bad actors?

The design of current social media systems is heavily influenced by a funding model based on advertisement revenue. Consequences of this have been that these systems emphasize “viral” communication that allows a single communicator to reach a large but interested audience, and devalue privacy, but are not designed to enable large scale collaboration and discourse.  

While the advertising model remains firmly in place, there hasbeen increasing public attention to privacy and to the potential for manipulating attitudes enabled by algorithmic curation.  Yet, I am optimistic. I am optimistic that in the next decade social media systems will give participants more authentic control over sharing their information and will begin to facilitate deliberation at scale.

… the next decade of online education, and credentialing… which skills will be most difficult to teach at scale?

Over the last fifteen years we have seen increasing success in making open course content available, followed by success teaching classes on-line at scale (e.g. Coursera, EdX). The next part of this progression will be online credentialing — last year, Starbucks’s partnership with ASU to provide long numbers of its employees with the opportunity to earn a full degree online is indicative of this shift.

Progress in on-line credentialing will be slower than progress in online delivery, because of the need to comply with or modify regulation, establish reputation, and overcome entrenched institutional interests in residential education. Notwithstanding, I am optimistic we will see substantial progress in the next decade — including more rigorous and widely accepted competency-based credentialing.

Given the increased rate of technical change, and the regular disruptions this creates in established industries — the most important skills for workforces in developed countries are those that support adaptability, and which enable workers to engage with new technologies (and especially information and communication technologies) and to effectively collaborate in different organizational structures.

While specific technical skills are well-fitted to a self-directed experience, some important skills — particularly metacognition, collaboration, and “soft” (emotional/social intelligences) skills  — which are particularly important for long-term success, require individualized guidance, (currently) a human instructor in the loop, and the opportunity to interact richly with other learners.


… the next decade of algorithms —  will the net overall effect be positive or negative?

Algorithms are defined essentially as mathematical tools designed to solve problems. Generally, improvements in problem solving tools — especially in the mathematical and computational fields have yielded huge benefits in science, technology, and health, and will most likely continue to do so.

The key policy question is really how we will choose to hold government and corporate actors responsible for the choices that they delegate to algorithms.  There is increasing understanding that each choice of algorithms embody a specific set of choices over what criteria are important to “solving” a problem, and what can be ignored. To incent better choices in algorithms will likely require actors using them to provide more transparency, to explicitly design algorithms with privacy and fairness in mind, and to holding actors who use algorithms meaningfully responsible for their consequences.


… the next decade of trust —  will people disengage with social networks, the internet, and the Internet of Things?

It appears very likely that because of network effects people’s general use of these systems will continue to increase — whether or not the systems themselves actually become more trustworthy. The value of online markets (etc.) are often a growing function of their size — which creates to a form of natural monopoly making these systems increasingly valuable, ubiquitous, and unavoidable.

The trustworthiness of this systems remains in doubt. It could be greatly improved by providing users with more transparency, control, and accountability over such systems. Technologies such as secure multi-party computing, functional encryption, unalterable blockchain ledgers, and differential privacy have great potential to strengthen systems — but so far the incentives to deploy them at wide scale are missing.

Structurally, many of the same forces will drive this as drive use of online networks. It appears very likely that because of network effects people’s general use of connected devices will continue to increase — whether or not the systems themselves are actually become more trustworthy.

The network of IoT is at an earlier stage than that of social networks — and there is less immediate value returned, and not yet a dominant “network” of these devices. It may take some time for a valuable network to emerge, and so the incentives to use IoT seem so far small for the end-consumer, while the security issues loom large, given the current lack of attention to systematic security engineering in design and implementation of these systems. (The lack of visibility of security reduces the incentives for such design). However, it seems likely that within the next decade the value of connected devices will become sufficient to drive people to use, regardless of the security risks, which may remain serious, but are often less immediately visible.

Reflecting on my own answers, I suspect I am reacting more as a “hedgehog” than as a “fox” — and thus am quite likely to be wrong (on this, see Phillip Tetlock’s excellent book on Expert Political Judgements).  I will in my defense recall the phrase that, physicist Dennis Gabor once wrote, “we cannot predict the future, but we can invent it.” — this is very much in the spirit of MIT. And as we argue in Information Wants Someone Else to Pay for It, the future of the scholarly communications and the information commons will be a happier one if libraries take their part in inventing it.

* This quote is most often attributed to Yogi Berra, but he denied it, at least in e-mail correspondence with me in 1997. It has also been attributed (with disputation) to Woody Allen,, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White.

Categories: Uncategorized

Can Computers be Feminist? — Comments on a Brown Bag talk by Gillian Smith

June 21, 2016 Leave a comment

Can Computers be Feminist? — Comments on a Brown Bag talk by Gillian Smith

Gillian Smith who an Assistant Professor in Art+Design and Computer Science at Northeastern University, gave this talk entitled Can Computers be Feminist? Procedural Politics and Computational Creativity, as part of the Program on Information Science Brown Bag Series.

In the talk, illustrated through the slides below, Gillian presented a perspective on computing viewed as a co-creation of developer, algorithm, data,  and user. And the talk argued that developers embed, through the selection of algorithm and data, specific ethical and epistemic commitments within the software they produce.  

In her abstract, Gillian summarizes as follows:

Computers are increasingly taking on the role of a creator—making content for games, participating on twitter, generating paintings and sculptures. These computationally creative systems embody formal models of both the product they are creating and the process they follow. Like that of their human counterparts, the work of algorithmic artists is open to criticism and interpretation, but such analysis requires a framework for discussing the politics embedded in procedural systems. In this talk, I will examine the politics that are (typically implicitly) represented in computational models for creativity, and discuss the possibility for incorporating feminist perspectives into their underlying algorithmic design.

There were a wide range of provocative games, tools, and concepts referenced in the talk that were particularly intriguing, including:

  • A strategy game called ThreadSteading, developed by Smith and collaborators at Disney Research Pittsburgh, that is played with cloth tiles,   and which, at the end of the game is mechanically sewn into a quilt — tranforming strategy (gameplay) to  information (the trace of the play, as reflected by the final state of the board) to art.
  • A work of computational art, developed by Smith .. mapping from one space to another — from color -> emotion -> shape.
  • Alice & Kev — a model of a homeless, abusive family created within the Sims, which highlights both what behaviors can emerge from the model of “life” embedded in the sims, and what that model clearly elides.  
  • Instances  of creative software such as TinyGallery and DeepForger, and software frameworks such as Tracery which facilitate generating creative content that blends algorithmic and human choices.

The full range of topics covered are impossible to summarize in a concise summary — I recommend readers follow the links and references in the slides.

The talk raised a number of themes: How software must be understood as a complex the interaction among information (data), structure (software), and behavior (use); how games and software embed epistemic and ethical models of the player/user,  context, and society; how authorship and labor entwines complex relationships among authors and pay, blaim, credit, and work.

Dr Smith’s talk also raised a number of provocative questions (which I paraphrase): How can software support richer identity models incorporating a broad spectrum of gender and sexuality? How can diversity in software authorship be achieved? How do we surface and evaluate the biases implicit in software? And what mechanisms beyond content filtering can we use to mitigate these biases? How do we assign responsibility for software, algorithms, and the resulting outputs? How can we integrate empathy into software, algorithms, and data systems?

Smith’s talk claims that one of the societal goals that art serves is to transmit core values through creating emotion. And I have heard said, and believe, that part of the power of art is its ability to engage us in and communicate to  us the true emotional complexity of life.

Although art and information science have different goals both act as both a mirror and lens to the ethical values and epistemic commitments of the cultures and institutions within they are embedded.  Broadly conceived, Smith’s provocations apply also to the development of library software, collections and services.  The research  the Program on Information Science has connected with these questions at a number of points, and we aspire to engage more generally in the future by furthering our field’s understanding of how library systems can reflect and support diverse perspectives; can shed light on the biases embedded in information systems, services, and collections; and can incorporate within them understandings of emotion, embodiment, and identity.

.

Categories: Uncategorized

Visual Representations in Science — Comments on a Brown Bag Talk by Felice Frankel

May 21, 2016 Leave a comment

Science photographer Felice Frankel who is a research scientist in the Center for Materials Science and Engineering at the Massachusetts Institute of Technology, gave this talk on The Visual Component: More Than Pretty Pictures, as part of the Program on Information Science Brown Bag Series.

In the talk, illustrated through the slides below, Felice made the argument that images and figures are first class intellectual objects — and should be considered just as important as words in publication, learning, and thinking.

In her abstract, Felice summarizes as follows:

Visual representation of all kinds are becoming more important in our ever growing image-based society, especially in science and technology.  Yet there has been little emphasis on developing standards in creating or critiquing those representations.  We must begin to consider images as more than tangential components of information and find ways to seamlessly search for accurate and honest depictions of complex scientific phenomena.  I will discuss a few ideas to that end and show my own process of making visual representations in sciences and engineering.  I will also make the case that representations are just as “intellectual” as text.

The talk presented many visual representations from a huge variety of scientific domains and projects. Across these projects, the talk returned to a number of key themes.

  • When you develop a visual representations it is vital for you to identify the overall purpose of the graphic (for example, whether it is explanatory or exploratory); the key ideas that the representation should communicate about the science; and the context in which the representation will be viewed.
  • There are a number of components of visual design that are universal across subject domain, including: composition, abstraction, coloring, and layering. And small, incremental refinements in the representation can dramatically improve the quality of the representation.
  • The process of developing visual representations engages both students and researchers in critical thinking about science; and this process can be used as a mechanism for research collaboration.
  • Representations are not the science, they are communications of the science; and all representations involve design and manipulation. Maintaining scientific integrity requires transparency about what is included, what is excluded, and what manipulations were used in preparing the representation.

 

In my observation, information visualization is becoming increasingly popular, and tools for creating visualizations are increasingly accessible to a broad set of contributors. Universities would benefit from supporting students and faculty in visual design for research, publication, and teaching; and in supporting the discovery and curation of collections of representations.

Library engagement in this area is nascent, and there are many possible routes for engagement. Library support for scientific representations is often limited — especially compared to the support for pdf documents or bibliographic citations. I speculate that there are at least five productive avenues for involvement.

  1. Libraries could provide support for researchers in curating personal collections of representations; in sharing them for collaboration; and in publishing them as part of research and educational content. Further researchers have increasing opportunities to cycle between physical and virtual representations of information, thus support for curating information representations can dovetail with library support for making and makerspaces.
  2. Library information systems seldom incorporate information visualization effectively in support of resource discovery and navigation. New information and visualization technologies and methods offer increased opportunities to make library more accessible and more engaging.
  3. Image-based searching is another area that demonstrates that search is not a solved problem. Image-based search provides a powerful means of discovering content that is almost completely absent from current library information systems.
  4. Visual design and communication skills are seldom explicitly documented or transmitted in the academy. Libraries have a vital role to play in making accessible the body of hidden (“tacit”) knowledge and skills that are critical for success in developing careers.
  5. Libraries have a role in helping researchers to engage in evolving systems of credit and attribution. For example, the CredIT taxonomy (which we helped to develop, and which is being adopted by scholarly journals such as Cell and PLOS) provides a way to formally record attribution for those who contribute scientific visualizations.

 

Categories: Uncategorized

Guest Post: Diana Hellyar on Library Use of New Visualization Technologies

April 26, 2016 Leave a comment

Diana Hellyar  who is a Graduate Research Intern in the program, reflects on her investigations into augmented reality, virtual reality, and related technologies

Libraries Can Use New Visualization Technology to Engage Readers

My research as a Research Intern for the MIT Libraries Program on Information Science is focused on the applications of emerging virtual reality and visualization technology to library information discovery. The area of virtual reality and other visualization technology is a rapidly changing field. Staying on top of these technologies and applying them into libraries can be difficult since there is little research on the topic. While I was researching the uses of virtual reality in libraries, I came across an example of how some libraries were able to incorporate augmented reality into their children’s department. Out of a dozen examples, this one caught my attention for many reasons. This example is not just a prototype; it was being used in multiple libraries. It was also easily adopted by non-technical librarians and was easy enough to be used by children.

The mythical maze app (available here) has been downloaded more than 10,000 times to date. Across the United Kingdom children participated in the Reading Agency’s 2014 Summer Reading Challenge, Mythical Maze, by downloading the Mythical Maze app on their mobile devices. Liz McGettigan discusses the app in an article published on the Charter Institute of Library and Information Professions website by explaining how it uses augmented reality to make posters and legend cards around the library come to life. The article links to The Reading Agency’s promotional video (watch it here). The video discusses how mythical creatures are hidden around the library and how children can look for these mythical creatures with their app. If they find the creatures, they can use the app to unlock mini-games. The app also allows children to scan stickers they receive from reading books, which unlocks rewards and allows children to learn more about the mythical creatures.

Using apps and integrating augmented reality is a fun way to do a summer reading challenge. The Reading Agency reported that 2014 was record-breaking year for their program. They state that participation increased by 3.6% and that 81,908 children joined the library to participate in the program, up 22.7% from the previous year. These statistics show that children are responding positively to augmented reality in their libraries.

I think that the best part about this app is that it allows the children’s room to come alive. Children can interact with the library in a way they never have been before.  Encouraging children to use their devices in the library in a fun and educational way is groundbreaking. They may never have been allowed to play with and learn from their devices at the library before.

The article about the summer reading challenge also discussed the idea of “transliteracy”. The author, Liz McGettigan, says that transliteracy is defined as the “ability to read, write and interact across a range of platforms and tools”. It’s important to encourage children to learn how to use their devices to find the information they are looking for. Encouraging children to use their devices for the summer reading challenge helps them to learn how to do this.

What can libraries do with this? I think that libraries can learn from this example and not just for a summer reading program. The librarians can create scavenger hunts for kids that are either for fun or to help them learn about the library and its services. Children can collect prizes for the things they find in the library using the app. Librarians can even use it to have kids react to and rate the books they read. An app can be designed so that if a child hovers their device over a book they can see other children’s ratings and comments about the book. They can do any of these things and more to create new excitement for their library.

One way for this to work would be if publishers teamed up with libraries to create content for similar apps. Then, there would be many more possibilities for interactive content without worrying about copyright issues. Libraries could create a small section of books that would be able to interact with the app. Then, with the device hovered over a book, the story comes to life and is read to them.

There are so many possibilities for teaching, learning, and reading  while using augmented reality in children’s departments of libraries. The Mythical Maze summer reading program is hopefully only the beginning in terms of using this technology to engage children. With the success of the summer reading challenge, I hope other libraries will consider including it in their programming. Using this technology will only enhance learning and will create fun new ways to get children excited about reading.

This example illustrates the possibility of using augmented reality to engage in new visualization technologies. Many types of libraries can implement this technology and allow their users to interact with physical materials in a way they never have before.

Additional Resources:

 

Categories: Uncategorized

Guest Post: Lucy Taylor on LibrePlanet 2016, Software Curation, and Presevation

April 21, 2016 Leave a comment

Lucy Taylor,  who is a Graduate Research in the program, reflects on software curation at the recent LibrePlanet Conference:

LibrePlanet 2016, Software Curation and Preservation

This year’s LibrePlanet conference, organized by the Free Software Foundation, touched on a number of themes that relate to research on software curation and preservation taking place at MIT’s Program on Information Science.

The two day conference hosted at MIT aimed to “examine how free software creates the opportunity of a new path for its users, allows developers to fight the restrictions of a system dominated by proprietary software by creating free replacements, and is the foundation of a philosophy of freedom, sharing, and change.” In a similar way, at the MIT program on Information Science, we are investigating the ways in which sustainable software might positively impact academic communities and shape future scholarly research practices. This was a great opportunity to compare and contrast the concerns and goals of the Free Software movement with those who use software in research.

A number of recurring themes emerged over the course of the weekend that could inform research on software curation. The event kicked off with a conversation between Edward Snowden and Daniel Kahn Gillimor. They tackled privacy and security, and spoke at length about how current digital infrastructures limit our freedoms. Interestingly, they also touched on how to expand the Free Software community and raise awareness with non technical folks about the need to create, and use, Free Software. A lack of incentives for “newbies” inhibits the growth of the Free Software movement; Free Software needs to compete with proprietary software’s low entry levels and user experience. Similarly, the growth of sustainable, reusable, academic software through better documentation, storage, and visibility is inhibited by a lack of incentives for researchers and libraries to improve software development practices and create curation services.

The talks “Copyleft for the next decade: a comprehensive plan” by Bradley Kuhn and “Will there be a next great Copyright Act?” by Peter Higgins both examined the ways in which licensing and copyright are impacting the Free Software movement. The future seems somewhat bleak for GPL licensing and copyleft  with developers being discouraged from using this license, and instead putting their work under more permissive licenses which then allow companies to use and profit from other’s software. In comparison, research gateways like NanoHub and HubZero encounter the same difficulties in encouraging researchers to make their software freely available to others to use and modify. As both speakers touched on, the general lack of understanding, and also fear, surrounding copyright needs to be remedied. Scihub was also mentioned as an example of a tool that, whilst breaking copyright law, is also revolutionary in nature in that no library has ever aggregated more scientific literature on one platform. How can we create technologies that make scholarly communication more open in the future? Will the curation of software contribute to these aims? Within wider discussions on open access, it is also worthwhile to think about how software can often be a research object in its own right that merits the same curation and concern as journal papers and datasets.

The ideas discussed in the session “Getting the academy to support free software and open science” had many parallels to the research being carried out here at the MIT Program on Information Science. The three speakers spoke about Free Software activities within their home institutions and the barriers that are created by the heavy use of proprietary software at universities. Not only does the continued use of this software result in high costs and the perpetuation of the “centralized web” that relies on companies like Google, Microsoft, and Apple, but this also encourages students to think passively about the technologies they use. Instead, how can we encourage students to think of software as something they can build on and modify through the use of Free Software? Can we develop more engaged academic communities who think and use software critically through the development of software curation services and sustainable software practices? This was a really interesting discussion that explored problematic infrastructures in higher education.

Finally, Alison Macrina and Nima Fatemi’s talk on the “Library Freedom Project: the long overdue partnership between libraries and free software” put the library front and centre in the role of engaging the wider community in Free Software and advocating for better privacy and more freedom. The Library Freedom Project not only educates librarians and patrons on internet privacy but has also rolled out Tor browsers in a few public libraries. What can academic libraries do to build on this important work and to increase awareness about online freedom within our communities?

The conference was a great way to gain insight into the wider activities of the software community and to talk with others from a multitude of different disciplines. It was interesting to think about how research on software curation services could be informed by these broader discussions on the future of Free Software. Academic librarians should also think about how they can advocate for Free Software in their institutions to encourage better understanding of privacy and to foster environments in which software is critically evaluated to meet user needs. Can libraries embrace the Free Software movement as they have the Open Access movement?

Categories: Uncategorized

Why search is not a solved (by google) problem, and why Universities Should Care: Ophir Frieder’s Talk

March 18, 2016 Leave a comment

Ophir Frieder, who holds the Robert L. McDevitt, K.S.G., K.C.H.S. and Catherine H. McDevitt L.C.H.S. Chair in Computer Science and Information Processing at Georgetown University and is Professor of Biostatistics, Bioinformatics, and Biomathematics at the Georgetown University Medical Center,  gave this talk on  Searching in Harsh Environments as part of the Program on Information Science Brown Bag Series.

In the talk, illustrated by the slides below, Ophir  rebuts the myth that “Google has solved search”, and discusses the challenges of searching for complex objects, through hidden collections, and in harsh environments

In his abstract, Ophir summarizes as follows:

Many consider “searching” a solved problem, and for digital text processing, this belief is factually based.  The problem is that many “real world” search applications involve “complex documents”, and such applications are far from solved.  Complex documents, or less formally, “real world documents”, comprise of a mixture of images, text, signatures, tables, etc., and are often available only in scanned hardcopy formats.   Some of these documents are corrupted.  Some of these documents, particularly of historical nature, contain multiple languages.  Accurate search systems for such document collections are currently unavailable.

The talk discussed three projects. The first project involved developing methods to search collections of complex digitized documents which varied in format, length, genre, and digitization quality; contained diverse fonts, graphical elements, and handwritten annotations; and were subject to errors due to document deterioration and from the digitization process. A second project involved developing methods to enable searchers who arrive with sparse, fragmentary, error-ridden clues  about places and people to successfully find relevant  connected  information in the Archives Section of the United States Holocaust Memorial Museum. A third project involved monitoring Twitter for public health events without relying on a prespecified hypothesis.

Across these projects, Frieder raised a number of themes:

  • Searching on complex objects is very different from searching the web. Substantial portions of complex objects are invisible to current search. And current search engines do understand the semantics of relationships within and among objects — making the right answers hard to find.
  • Searching across most online content now depends on proprietary algorithms, indices, and logs.
  • Researchers need to be able to search collections of content that may never be made available publicly online by Google or other companies.

Despite the increasing amount of born digital material, I speculate that these issues will become more salient to research, and that libraries have a role to play in addressing them.

While much of the “scholarly record” is currently being produced in the form of “pdf”s, which are amenable to the Google searching approach, much web-based content is dynamically generated and customized, and scholarly publications are increasingly incorporating dynamic and interactive features. Searching these will effectively will require engaging with scientific output as complex objects

Further, some areas of science, such as the social sciences, increasingly rely on proprietary collections of big data from commercial sources. Much of this growing evidence base is currently accessible only through proprietary API’s. To meet the heightened requirements for transparency and reproducibility, stewards are needed for these data who can ensure nondiscriminatory long-term research access.

More generally, it is increasingly well recognized that the evidence base of science not only includes published articles, community datasets (and benchmarks); but also may extends to scientific software, replication data, workflows, and even electronic lab notebooks. The article produced at the end is simply a summary description of one pathway the evidence reflected in theses scientific objects. Validating, reproducing, and building on science may increasingly require access to, search over, and understanding of this entire complex set.  

Categories: Uncategorized