Category Archives: what-if

Collecting Data: How Much do We Really Need?

This originally appeared on the ACRL TechConnect blog.

Many of us have had conversations in the past few weeks about data collection due to the reports about the NSA’s PRISM program, but ever since April and the bombings at the Boston Marathon, there has been an increased awareness of how much data is being collected about people in an attempt to track down suspects–or, increasingly, stop potential terrorist events before they happen. A recent Nova episode about the manhunt for the Boston bombers showed one such example of this at the New York Police Department. This program is called the Domain Awareness System at the New York Police Department, and consists of live footage from almost every surveillance camera in the New York City playing in one room, with the ability to search for features of individuals and even the ability to detect people acting suspiciously. Added to that a demonstration of cutting edge facial recognition software development at Carnegie Mellon University, and reality seems to be moving ever closer to science fiction movies.

Librarians focused on technical projects love to collect data and make decisions based on that data. We try hard to get data collection systems as close to real-time as possible, and work hard to make sure we are collecting as much data as possible and analyzing it as much as possible. The idea of a series of cameras to track in real-time exactly what our patrons are doing in the library in real-time might seem very tempting. But as librarians, we value the ability of our patrons to access information with as much privacy as possible–like all professions, we treat the interactions we have with our patrons (just as we would clients, patients, congregants, or sources) with care and discretion (See Item 3 of the Code of Ethics of the American Library Association). I will not address the national conversation about privacy versus security in this post–I want to address the issue of data collection right where most of us live on a daily basis inside analytics programs, spreadsheets, and server logs.

What kind of data do you collect?

Let’s start with an exercise. Write a list of all the statistical reports you are expected to provide your library–for most of us, it’s probably a very long list. Now, make a list of all the tools you use to collect the data for those statistics.

Here are a few potential examples:

Website visitors and user experience

  • Google Analytics or some other web analytics tool
  • Heat map tool
  • Server logs
  • Surveys

Electronic resource access reports

  • Electronic resources management application
  • Vendor reports (COUNTER and other)
  • Link resolver click-through report
  • Proxy server logs

The next step may require a little digging. For library created tools, do you have a privacy policy for this data? Has it gone through the Institutional Review Board? For third-party tools, is there a privacy policy? What are the terms or use or user license? (And how many people have ever read the entire terms of service?). We will return to this exercise in a moment.

How much is enough?

Think about with these tools what type of data you are collecting about your users. Some of it may be very private indeed. For instance, the heat map tool I’ve recently started using (Inspectlet) not only tracks clicks, but actually records sessions as patrons use the website. This is fascinating information–we had, for instance, one session that was a patron opening the library website, clicking the Facebook icon on the page, and coming back to the website nearly 7 hours later. It was fun to see that people really do visit the library’s Facebook page, but the question was immediately raised whether it was a visit from on campus. (It was–and wouldn’t have taken long to figure out if it was a staff machine and who was working that day and time). IP addresses from off campus are very easy to track, sometimes down to the block–again, easy enough to tie to an individual. We like to collect IP addresses for abusive or spamming behavior and block users based on IP address all the time. But what about in this case? During the screen recordings I can see exactly what the user types in the search boxes for the catalog and discovery system. Luckily, Inspectlet allows you to obscure the last two octets (which is legally required some places) of the IP address, so you can have less information collected. All similar tools should allow you the same ability.

Consider another case: proxy server logs. In the past when I did a lot of EZProxy troubleshooting, I found the logs extremely helpful in figuring out what went wrong when I got a report of trouble, particularly when it had occurred a day or two before. I could see the username, what time the user attempted to log in or succeeded in logging in, and which resources they accessed. Let’s say someone reported not being able to log in at midnight– I could check to see the failed logins at midnight, and then that username successfully logging in at 1:30 AM. That was a not infrequent occurrence, as usually people don’t think to write back and say they figured out what they did wrong! But I could also see everyone else’s logins and which articles they were reading, so I could tell (if I wanted) which grad students were keeping up with their readings or who was probably sharing their login with their friend or entire company. Where I currently work, we don’t keep the logs for more than a day, but I know a lot of people are out there holding on to EZProxy logs with the idea of doing “something” with them someday. Are you holding on to more than you really want to?

Let’s continue our exercise. Go through your list of tools, and make a list of all the potentially personally identifying information the tool collects, whether or not you use them. Are you surprised by anything? Make a plan to obscure unused pieces of data on a regular basis if it can’t be done automatically. Consider also what you can reasonably do with the data in your current job requirements, rather than future study possibilities. If you do think the data will be useful for a future study, make sure you are saving anonymized data sets unless it is absolutely necessary to have personally identifying information. In the latter case, you should clear your study in advance with your Institutional Review Board and follow a data management plan.

A privacy and data management policy should include at least these items:

  • A statement about what data you are collecting and why.
  • Where the data is stored and who has access to it.
  • A retention timeline.

F0r example, in the past I collected all virtual reference transaction logs for studying the effectiveness of a new set of virtual reference services. I knew I wanted at least a year’s worth of logs, and ideally three years to track changes over time. I was able to save the logs with anonymized IP addresses and once I had the data I needed I was able to delete the actual transcripts. The privacy policy described the process and where the data would be stored to ensure it was secure. In this case, I used the RUSA Guidelines for Implementing and Maintaining Virtual Reference Services as a guide to creating this policy. Read through the ALA Guidelines to Drafting a Library Privacy Policy for additional specific language and items you should include.

What we can do with data

In all this I don’t at all mean to imply that we shouldn’t be collecting this data. In both the examples I gave above, the data is extremely useful in improving the patron experience even while giving identifying details away. Not collecting data has trade-offs. For years, libraries have not retained a patron’s borrowing record to protect his or her privacy. But now patrons who want to have an online record of what they’ve borrowed from the library must use third-party services with (most likely) much less stringent privacy policies than libraries. By not keeping records of what users have checked out or read through databases, we are unable to provide them personalized automated suggestions about what to read next. Anyone who uses Amazon regularly knows that they will try to tempt you into purchases based on your past purchases or books you were reading the preview of–even if you would rather no one know that you were reading that book and certainly don’t want suggestions based on it popping up when you are doing a collection development project at work and are logged in on your personal account. In all the decisions we make about collecting or not collecting data, we have to consider trade-offs like these. Is the service so important that the benefits of collecting the data outweigh the risks? Or, is there another way to provide the service?

We can see some examples of this trade-off in two similar projects coming out of Harvard Library Labs. One, Library Hose, was a Twitter stream with the name of every book being checked out. The service ran for part of 2010, and has been suspended since September of 2010. In addition to daily tweet limits, this also was a potential privacy violation–even if it was a fun idea (this blog post has some discussion about it). A newer project takes the opposite approach–books that a patron thinks are “awesome” can be returned to the Awesome Box at the circulation desk and the information about the book is collected on the Awesome Box website. This is a great tweak to the earlier project, since this advertises material that’s now available rather than checked out, and people have to opt in by putting the item in the box.

In terms of personal recommendations, librarians have the advantage of being able to form close working relationships with faculty and students so they can make personal recommendations based on their knowledge of the person’s work and interests. But how to automate this without borrowing records? One example is a project that Ian Chan at California State University San Marcos has done to use student enrollment data to personalize the website based on a student’s field of study. (Slides). This provides a great deal of value for the students, who need to log in to check their course reserves and access articles from off campus anyway. This adds on top of that basic need a list of recommended resources for students, which they can choose to star as favorites.


In thinking about what type of data you collect, whether on purpose or accidentally, spend some time thinking about what is strictly necessary to accomplish the work that you need to do. If you don’t need a piece of data but can’t avoid collecting it (such as full IP addresses or usernames), make sure you have a privacy policy and retention schedule, and ensure that it is not accessible to more people than absolutely necessary.

Work to educate your patrons about privacy, particularly online privacy. ALA has a Choose Privacy Week, which is always the first week in May. The site for that has a number of resources you might want to consult in planning programming. Academic librarians may find it easiest to address college students in terms of their presence on social media when it comes to future job hunting, but this is just an opening to larger conversations about data. Make sure that when you ask patrons to use a third party service (such as a social network) or recommend a service (such as a book recommending site) that you make sure they are aware of what information they are sharing.

We all know that Google’s slogan is “Don’t be evil”, but it’s not always clear if they are sticking to that. Make sure that you are not being evil in your own data collection.

PeerJ: Could it Transform Open Access Publishing?

Open access publication makes access to research free for the end reader, but in many fields it is not free for the author of the article. When I told a friend in a scientific field I was working on this article, he replied “Open access is something you can only do if you have a grant.” PeerJ, a scholarly publishing venture that started up over the summer, aims to change this and make open access publication much easier for everyone involved.

While the first publication isn’t expected until December, in this post I want to examine in greater detail the variation on the “gold” open-access business model that PeerJ states will make it financially viable 1, and the open peer review that will drive it. Both of these models are still very new in the world of scholarly publishing, and require new mindsets for everyone involved. Because PeerJ comes out of funding and leadership from Silicon Valley, it can more easily break from traditional scholarly publishing and experiment with innovative practices. 2

PeerJ Basics

PeerJ is a platform that will host a scholarly journal called PeerJ and a pre-print server (similar to arXiv) that will publish biological and medical scientific research. Its founders are Peter Binfield (formerly of PLoS ONE) and Jason Hoyt (formerly of Mendeley), both of whom are familiar with disruptive models in academic publishing. While the “J” in the title stands for Journal, Jason Hoyt explains on the PeerJ blog that while the journal as such is no longer a necessary model for publication, we still hold on to it. “The journal is dead, but it’s nice to hold on to it for a little while.” 3. The project launched in June of this year, and while no major updates have been posted yet on the PeerJ website, they seem to be moving towards their goal of publishing in late 2012.

To submit a paper for consideration in PeerJ, authors must buy a “lifetime membership” starting at $99. (You can submit a paper without paying, but it costs more in the end to publish it). This would allow the author to publish one paper in the journal a year. The lifetime membership is only valid as long as you meet certain participation requirements, which at minimum is reviewing at least one article a year. Reviewing in this case can mean as little as posting a comment to a published article. Without that, the author might have to pay the $99 fee again (though as yet it is of course unclear how strictly PeerJ will enforce this rule). The idea behind this is to “incentivize” community participation, a practice that has met with limited success in other arenas. Each author on a paper, up to 12 authors, must pay the fee before the article can be published. The Scholarly Kitchen blog did some math and determined that for most lab setups, publication fees would come to about $1,124 4, which is equivalent to other similar open access journals. Of course, some of those researchers wouldn’t have to pay the fee again; for others, it might have to be paid again if they are unable to review other articles.

Peer Review: Should it be open?

PeerJ, as the name and the lifetime membership model imply, will certainly be peer-reviewed. But, keeping with its innovative practices, it will use open peer review, a relatively new model. Peter Binfield explained in this interview PeerJ’s thinking behind open peer review.

…we believe in open peer review. That means, first, reviewer names are revealed to authors, and second, that the history of the peer review process is made public upon publication. However, we are also aware that this is a new concept. Therefore, we are initially going to encourage, but not require, open peer review. Specifically, we will be adopting a policy similar to The EMBO Journal: reviewers will be permitted to reveal their identities to authors, and authors will be given the choice of placing the peer review and revision history online when they are published. In the case of EMBO, the uptake by authors for this latter aspect has been greater than 90%, so we expect it to be well received. 5

In single blind peer review, the reviewers would know the name of the author(s) of the article, but the author would not know who reviewed the article. The reviewers could write whatever sorts of comments they wanted to without the author being able to communicate with them. For obvious reasons, this lends itself to abuse where reviewers might not accept articles by people they did not know or like or tend to accept articles from people they did like 6 Even people who are trying to be fair can accidentally fall prey to bias when they know the names of the submitters.

Double blind peer review in theory takes away the ability for reviewers to abuse the system. A link that has been passed around library conference planning circles in the past few weeks is the JSConf EU 2012 which managed to improve its ratio of female presenters by going to a double-blind system. Double blind is the gold standard for peer review for many scholarly journals. Of course, it is not a perfect system either. It can be hard to obscure the identity of a researcher in a small field in which everyone is working on unique topics. It also is a much lengthier process with more steps involved in the review process.  To this end, it is less than ideal for breaking medical or technology research that needs to be made public as soon as possible.

In open peer review, the reviewers and the authors are known to each other. By allowing for direct communication between reviewer and researcher, this speeds up the process of revisions and allows for greater clarity and speed 7.  Open peer review doesn’t affect the quality of the reviews or the articles negatively, it does make it more difficult to find qualified reviewers to participate, and it might make a less well-known researcher more likely to accept the work of a senior colleague or well-known lab.  8.

Given the experience of JSConf and a great deal of anecdotal evidence from women in technical fields, it seems likely that open peer review is open to the same potential abuse of single peer review. While  open peer review might make the rejected author able to challenge unfair rejections, this would require that the rejected author feels empowered enough in that community to speak up. Junior scholars who know they have been rejected by senior colleagues may not want to cause a scene that could affect future employment or publication opportunities. On the other hand, if they can get useful feedback directly from respected senior colleagues, that could make all the difference in crafting a stronger article and going forward with a research agenda. Therein lies the dilemma of open peer review.

Who pays for open access?

A related problem for junior scholars exists in open access funding models, at least in STEM publishing. As open access stands now, there are a few different models that are still being fleshed out. Green open access is free to the author and free to the reader; it is usually funded by grants, institutions, or scholarly societies. Gold open access is free to the end reader but has a publication fee charged to the author(s).

This situation is very confusing for researchers, since when they are confronted with a gold open access journal they will have to be sure the journal is legitimate (Jeffrey Beall has a list of Predatory Open Access journals to aid in this) as well as secure funding for publication. While there are many schemes in place for paying publication fees, there are no well-defined practices in place that illustrate long-term viability. Often this is accomplished by grants for the research, but not always. The UK government recently approved a report that suggests that issuing “block grants” to institutions to pay these fees would ultimately cost less due to reduced library subscription fees.  As one article suggests, the practice of “block grants” or other funding strategies are likely to not be advantageous to junior scholars or those in more marginal fields 9. A large research grant for millions of dollars with the relatively small line item for publication fees for a well-known PI is one thing–what about the junior humanities scholar who has to scramble for a few thousand dollar research stipend? If an institution only gets so much money for publication fees, who gets the money?

By offering a $99 lifetime membership for the lowest level of publication, PeerJ offers hope to the junior scholar or graduate student to pursue projects on their own or with a few partners without worrying about how to pay for open access publication. Institutions could more readily afford to pay even $250 a year for highly productive researchers who were not doing peer review than the $1000+ publication fee for several articles a year. As above, some are skeptical that PeerJ can afford to publish at those rates, but if it is possible, that would help make open access more fair and equitable for everyone.


Open access with low-cost paid up front could be very advantageous to researchers and institutional  bottom lines, but only if the quality of articles, peer reviews, and science is very good. It could provide a social model for publication that will take advantage of the web and the network effect for high quality reviewing and dissemination of information, but only if enough people participate. The network effect that made Wikipedia (for example) so successful relies on a high level of participation and engagement very early on to be successful [Davis]. A community has to build around the idea of PeerJ.

In almost the opposite method, but looking to achieve the same effect, this last week the Sponsoring Consortium for Open Access Publishing in Particle Physics (SCOAP3) announced that after years of negotiations they are set to convert publishing in that field to open access starting in 2014. 10 This means that researchers (and their labs) would not have to do anything special to publish open access and would do so by default in the twelve journals in which most particle physics articles are published. The fees for publication will be paid upfront by libraries and funding agencies.

So is it better to start a whole new platform, or to work within the existing system to create open access? If open (and through a commenting s system, ongoing) peer review makes for a lively and engaging network and low-cost open access  makes publication cheaper, then PeerJ could accomplish something extraordinary in scholarly publishing. But until then, it is encouraging that organizations are working from both sides.

  1. Brantley, Peter. “Scholarly Publishing 2012: Meet PeerJ.”, June 12, 2012.
  2. Davis, Phil. “PeerJ: Silicon Valley Culture Enters Academic Publishing.” The Scholarly Kitchen, June 14, 2012.
  3. Hoyt, Jason. “What Does the ‘J’ in ‘PeerJ’ Stand For?” PeerJ Blog, August 22, 2012.
  5. Brantley
  6. Wennerås, Christine, and Agnes Wold. “Nepotism and sexism in peer-review.” Nature 387, no. 6631 (May 22, 1997): 341–3.
  7. For an ingenious way of demonstrating this, see Leek, Jeffrey T., Margaret A. Taub, and Fernando J. Pineda. “Cooperation Between Referees and Authors Increases Peer Review Accuracy.” PLoS ONE 6, no. 11 (November 9, 2011): e26895.
  8. Mainguy, Gaell, Mohammad R Motamedi, and Daniel Mietchen. “Peer Review—The Newcomers’ Perspective.” PLoS Biology 3, no. 9 (September 2005).
  9. Crotty, David. “Are University Block Grants the Right Way to Fund Open Access Mandates?” The Scholarly Kitchen, September 13, 2012.
  10. Van Noorden, Richard. “Open-access Deal for Particle Physics.” Nature 489, no. 7417 (September 24, 2012): 486–486.