Why search is not a solved (by google) problem, and why Universities Should Care: Ophir Frieder’s Talk
Ophir Frieder, who holds the Robert L. McDevitt, K.S.G., K.C.H.S. and Catherine H. McDevitt L.C.H.S. Chair in Computer Science and Information Processing at Georgetown University and is Professor of Biostatistics, Bioinformatics, and Biomathematics at the Georgetown University Medical Center, gave this talk on Searching in Harsh Environments as part of the Program on Information Science Brown Bag Series.
In the talk, illustrated by the slides below, Ophir rebuts the myth that “Google has solved search”, and discusses the challenges of searching for complex objects, through hidden collections, and in harsh environments
In his abstract, Ophir summarizes as follows:
Many consider “searching” a solved problem, and for digital text processing, this belief is factually based. The problem is that many “real world” search applications involve “complex documents”, and such applications are far from solved. Complex documents, or less formally, “real world documents”, comprise of a mixture of images, text, signatures, tables, etc., and are often available only in scanned hardcopy formats. Some of these documents are corrupted. Some of these documents, particularly of historical nature, contain multiple languages. Accurate search systems for such document collections are currently unavailable.
The talk discussed three projects. The first project involved developing methods to search collections of complex digitized documents which varied in format, length, genre, and digitization quality; contained diverse fonts, graphical elements, and handwritten annotations; and were subject to errors due to document deterioration and from the digitization process. A second project involved developing methods to enable searchers who arrive with sparse, fragmentary, error-ridden clues about places and people to successfully find relevant connected information in the Archives Section of the United States Holocaust Memorial Museum. A third project involved monitoring Twitter for public health events without relying on a prespecified hypothesis.
Across these projects, Frieder raised a number of themes:
- Searching on complex objects is very different from searching the web. Substantial portions of complex objects are invisible to current search. And current search engines do understand the semantics of relationships within and among objects — making the right answers hard to find.
- Searching across most online content now depends on proprietary algorithms, indices, and logs.
- Researchers need to be able to search collections of content that may never be made available publicly online by Google or other companies.
Despite the increasing amount of born digital material, I speculate that these issues will become more salient to research, and that libraries have a role to play in addressing them.
While much of the “scholarly record” is currently being produced in the form of “pdf”s, which are amenable to the Google searching approach, much web-based content is dynamically generated and customized, and scholarly publications are increasingly incorporating dynamic and interactive features. Searching these will effectively will require engaging with scientific output as complex objects
Further, some areas of science, such as the social sciences, increasingly rely on proprietary collections of big data from commercial sources. Much of this growing evidence base is currently accessible only through proprietary API’s. To meet the heightened requirements for transparency and reproducibility, stewards are needed for these data who can ensure nondiscriminatory long-term research access.
More generally, it is increasingly well recognized that the evidence base of science not only includes published articles, community datasets (and benchmarks); but also may extends to scientific software, replication data, workflows, and even electronic lab notebooks. The article produced at the end is simply a summary description of one pathway the evidence reflected in theses scientific objects. Validating, reproducing, and building on science may increasingly require access to, search over, and understanding of this entire complex set.
Julia Flanders, who is the Director of the Digital Scholarship Group in the Northeastern University Library, and a Professor of Practice in Northeastern’s English Department gave a talk on Jobs, Roles, Skills, Tools: Working in the Digital Academy as part of the Program on Information Science Brown Bag Series.
In the talk, illustrated by the slides below, Julia discusses the evolving landscape of digital humanities (and digital scholarship more broadly) and considers the relationship between technology, tool development, and professional roles.
In her abstract, Julia summarizes as follows:
Twenty-five years ago, jobs in humanities computing were largely interstitial: located in fortuitous, anomalous corners and annexes where specific people with idiosyncratic skill profiles happened to find a niche. One couldn’t train for such jobs, let alone locate them in a market. The emergence of the field of “digital humanities” since that time may appear to be a disciplinary and methodological phenomenon, but it also has to do with labor: with establishing a new set of jobs for which people can be trained and hired, and which define the contours of the work we define as “scholarship.”
In the research described in her talk Julia identifies seven different roles involved in digital humanities scholarship: developer, administrator, manager, scholar, analyst, data creator, and information manager. She then describes the various skills and metaknowledge required for each and how these roles interact.
(I will note here that the libraries and press have conducted complementary research and engaged in standardization around describing contributorship roles. For more information on this see the Project CREDIT site.)
The talk notes the tensions that develop when these roles are out of balance in a project, and particularly the need for balance among scholar, developer, and anlayst roles. Her talk notes that a combination of scholar, developer, and analyst in a single person is very productive but rare. More typically, early career researchers start as data creators/coders, learn a particular tool set, and evolve into scholars. In the absence of a strong analyst role this creates “a peculiar relationship with tools: a kind of distance (on the scholar’s part) and on the other hand an intensive proximity (on the coder’s part) that may not yet have critical distance or meta-knowledge: the awareness needed to use the tools in a fully knowing way.”
Observing commercial and research software development projects over thirty years — one of the most common causes of catastrophic failure is the gap between the developer’s understanding of the problem being solved and the customer’s understanding of the same problem. A good analyst (often holding a “product manager” title in the corporate world) has the skills to understand both the business and technical domains sufficiently to probe for these misunderstandings and ensure that discussion converges to a common understanding. In addition the analyst aids in abstracting both the technical and domain problems so that the eventual software solution not only meets the needs of the small number of customers in the loop, but is broad enough for a target community. Moreover, librarians often have knowledge in components of the technical domain and in the subject domain — which can serve libraries with particular competitive advantage in developing people in these critical bridge roles.