Categories
technology What I've been reading lately

Review of Using OpenRefine by Ruben Verborgh and Max De Wilde

Using OpenrefineUsing Openrefine by Ruben Verborgh

My rating: 4 of 5 stars

Disclosure: the publisher of this book provided me with a free copy in exchange for a review. The opinions expressed in the review are my own.

While OpenRefine is an extremely useful “power tool for messy data”, its power can be difficult to master without a great deal of trial and error on the part of the user. Part of this stems from the evolving nature of the tool. It began life as Freebase Gridworks, with the purpose of cleaning up data in order to run it against linked data in Freebase. When the Freebase parent organization was acquired by Google, they rebranded the tool as Google Refine, but as Google’s priorities shifted, they stopped working on the tool and it became the open source OpenRefine. This legacy means that the tool has many pieces created by different people for different purposes. While there is quite a lot of good documentation out there on the OpenRefine site and elsewhere, this book puts it together in a easy to follow format. Like a lot of OpenRefine documentation, it is a series of “recipes” that explain how to do one specific task, but is written with the cover to cover reader in mind as well. The Google produced tutorial videos have similar coverage, but the book is more in depth, and has the advantage for readers coming from the cultural institution side of using a museum data set for examples. Another advantage is that the authors of the book have a particular interest in named entity recognition (part of the book covers the tool that one of them produced), which is particularly helpful for more abstract data sets with cultural data.

Using OpenRefine is useful for beginner or intermediate users of OpenRefine. As someone who has used OpenRefine for awhile and written about its use in libraries, this was more helpful than I expected initially, since there were pieces of functionality I’d not yet encountered in experimentation or documentation so far. My one criticism is that much of the book promises a complete explanation in the appendix of regular expressions and the Google Refine Expression Language that powers the software, but I found that the GREL documentation was less useful than I hoped, though I still learned from it. I would have preferred if that section had been earlier in the book. That aside, I would recommend this book to anyone who has been using OpenRefine or thinking about using it, and additionally for library and museum professional development collections.

View all my reviews

Categories
hacking technology

A Brief Look at Cryptography for Librarians

This originally appeared on the ACRL TechConnect blog.

You may not think much about cryptography on a daily basis, but it underpins your daily work and personal existence. In this post I want to talk about a few realms of cryptography that affect the work of academic librarians, and talk about some interesting facets you may never have considered. I won’t discuss the math or computer science basis of cryptography, but look at it from a historical and philosophical point of view. If you are interested in the math and computer science, I have a few a resources listed at the end in addition to a bibliography.

Note that while I will discuss some illegal activities in this post, neither I nor anyone connected with the ACRL TechConnect blog is suggesting that you actually do anything illegal. I think you’ll find the intellectual part of it stimulation enough.

What is cryptography?

Keeping information secret is as simple as hiding it from view in, say, an envelope, and trusting that only the person to whom it is addressed will read that information and then not tell anyone else. But we all know that this doesn’t actually work. A better system would only allow a person with secret credentials to open the envelope, and then for the information inside to be in a code that only she could know.

The idea of codes to keep important information secret goes back thousands of years , but for the purposes of computer science, most of the major advances have been made since the 1970s. In the 1960s with the advent of computing for business and military uses, it was necessary to come up with ways to encrypt data. In 1976, the concept of public-key cryptography was developed, but it wasn’t realized practically until 1978 with the paper by Rivest, Shamir, and Adleman–if you’ve ever wondered what RSA stood for, there’s the answer. There were some advancements to this system, which resulted in the digital signature algorithm as the standard used by the federal government.1 Public-key systems work basically by creating a private and a public key–the private one is known only to each individual user, and the public key is shared. Without the private key, however, the public key can’t open anything. See the resources below for more on the math that makes up these algorithms.

Another important piece of cryptography is that of cryptographic hash functions, which were first developed in the late 1980s. These are used to encrypt blocks of data– for instance, passwords stored in databases should be encrypted using one of these functions. These functions ensure that even if someone unauthorized gets access to sensitive data that they cannot read it. These can also be used to verify the identify of a piece of digital content, which is probably how most librarians think about these functions, particularly if you work with a digital repository of any kind.

Why do you care?

You probably send emails, log into servers, and otherwise transmit all kinds of confidential information over a network (whether a local network or the internet). Encrypted access to these services and the data being transmitted is the only way that anybody can trust that any of the information is secret. Anyone who has had a credit card number stolen and had to deal with fraudulent purchases knows first-hand how upsetting it can be when these systems fail. Without cryptography, the modern economy could not work.

Of course, we all know a recent example of cryptography not working as intended. It’s no secret (see above where keeping something a secret requires that no one who knows the information tells anyone else) by now that the National Security Agency (NSA) has sophisticated ways of breaking codes or getting around cryptography though other methods 2 Continuing with our envelope analogy from above, the NSA coerced companies to allow them to view the content of messages before the envelopes were sealed. If the messages were encoded, they got the keys to decode the data, or broke the code using their vast resources. While these practices were supposedly limited to potential threats, there’s no denying that this makes it more difficult to trust any online communications.

Librarians certainly have a professional obligation to keep data about their patrons confidential, and so this is one area in which cryptography is on our side. But let’s now consider an example in which it is not so much.

Breaking DRM: e-books and DVDs

Librarians are exquisitely aware of the digital rights management realm of cryptography (for more on this from the ALA, see The ALA Copyright Office page on digital rights ). These are algorithms that encode media in such a way that you are unable to copy or modify the material. Of course, like any code, once you break it, you can extract the material and do whatever you like with it. As I covered in a recent post, if you purchase a book from Amazon or Apple, you aren’t purchasing the content itself, but a license to use it in certain proscribed ways, so legally you have no recourse to break the DRM to get at the content. That said, you might have an argument under fair use, or some other legitimate reason to break the DRM. It’s quite simple to do once you have the tools to do so. For e-books in proprietary formats, you can download a plug-in for the Calibre program and follow step by step instructions on this site. This allows you to change proprietary formats into more open formats.

As above, you shouldn’t use software like that if you don’t have the rights to convert formats, and you certainly shouldn’t use it to pirate media. But just because it can be used for illegal purposes, does that make the software itself illegal? Breaking DVD DRM offers a fascinating example of this (for a lengthy list of CD and DVD copy protection schemes, see here and for a list of DRM breaking software see here). The case of CSS (Content Scramble System) descramblers illustrates some of the strange philosophical territory into which this can end up. The original code was developed in 1999, and distributed widely, which was initially ruled to be illegal. This was protested in a variety of ways; the Gallery of CSS Descramblers has a lot more on this 3. One of my favorite protest CSS descramblers is the “illegal” prime number, which is a prime number that contains the entire code for breaking the CSS DRM. The first illegal prime number was discovered in 2001 by Phil Carmody (see his description here) 4. This number is, of course, only illegal inasmuch as the information it represents is illegal–in this case it was a secret code that helped break another secret code.

In 2004, after years of court hearings, the California Court of Appeal overturned one of the major injunctions against posting the code, based on the fact that  source code is protected speech under the first amendment , and that the CSS was no longer a trade secret. So you’re no longer likely to get in trouble for posting this code–but again, using it should only be done for reasons protected under fair use. 5 One of the major reasons you might legitimately need to break the DRM on a DVD is to play DVDs on computers running the Linux operating system, which still has no free legal software that will play DVDs (there is legal software with the appropriate license for $25, however). Given that DVDs are physical media and subject to the first sale doctrine, it is unfair that they are manufactured with limitations to how they may be played, and therefore this is a code that seems reasonable for the end consumer to break. That said, as more and more media is streamed or otherwise licensed, that argument no longer applies, and the situation becomes analogous to e-book DRM.

Learning More

The Gambling With Secrets video series explains the basic concepts of cryptography, including the mathematical proofs using colors and other visual concepts that are easy to grasp. This comes highly recommended from all the ACRL TechConnect writers.

Since it’s a fairly basic part of computer science, you will not be surprised to learn that there are a few large open courses available about cryptography. This Cousera class from Stanford is currently running, and this Udacity class from University of Virginia is a self-paced course. These don’t require a lot of computer science or math skills to get started, though of course you will need a great deal of math to really get anywhere with cryptography.

A surprising but fun way to learn a bit about cryptography is from the NSA’s Kids website–I discovered this years ago when I was looking for content for my X-Files fan website, and it is worth a look if for nothing else than to see how the NSA markets itself to children. Here you can play games to learn basics about codes and codebreaking.

  1. Menezes, A., P. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. CRC Press, 1996. http://cacr.uwaterloo.ca/hac/. 1-2.
  2. See the New York Times and The Guardian for complete details.
  3. Touretzky, D. S. (2000) Gallery of CSS Descramblers. Available: http://www.cs.cmu.edu/~dst/DeCSS/Gallery, (September 18, 2013).
  4. For more, see Caldwell, Chris. “The Prime Glossary: Illegal Prime.” Accessed September 17, 2013. http://primes.utm.edu/glossary/xpage/Illegal.html.
  5. “DVDCCA v Bunner and DVDCCA v Pavlovich.” Electronic Frontier Foundation. Accessed September 23, 2013. https://www.eff.org/cases/dvdcca-v-bunner-and-dvdcca-v-pavlovich.
Categories
academic librarianship coding technology

Reflections on Code4Lib 2013

This originally appeared on the ACRL TechConnect blog.

Disclaimer: I was on the planning committee for Code4Lib 2013, but this is my own opinion and does not reflect other organizers of the conference.

We have mentioned Code4Lib before on this blog, but for those who are unfamiliar, it is a loose collective of programmers working in libraries, librarians, and others interested in code and libraries. (You can read more about it on the website.) The Code4Lib conference has emerged as a venue to share very new technology and have discussions with a wide variety of people who might not attend conferences more geared to librarians. Presentations at the conference are decided by the votes of anyone interested in selecting the program, and additionally lightning talks and breakout sessions allow wide participation and exposure to extremely new projects that have not made it into the literature or to conferences with a longer lead time. The Code4Lib 2013 conference ran February 11-14 at University of Illinois Chicago. You can see a list of all programs here, which includes links to the video archive of the conference.

While there were many types of projects presented, I want to focus on those talks which illustrated what I saw as thread running through the conference–care and emotion. This is perhaps unexpected for a technical conference. Yet those themes underlie a great deal of the work that takes place in academic library technology and the types of projects presented at Code4Lib. We tend to work in academic libraries because we care about the collections and the people using those collections. That intrinsic motivation focuses our work.

Caring about the best way to display collections is central to successful projects. Most (though not all) the presenters and topics came out of academic libraries, and many of the presentations dealt with creating platforms for library and archival metadata and collections. To highlight a few: Penn State University has developed their own institutional repository application called ScholarSphere that provides a better user experience for researchers and managers of the repository. The libraries and archives of the Rock and Roll Hall of Fame dealt with the increasingly common problem of wanting to present digital content alongside more traditional finding aids, and so developed a system for doing so. Corey Harper from New York University presented an extremely interesting and still experimental project to use linked data to enrich interfaces for interacting with library collections. Note that all these projects combined various pieces of open source software and library/web standards to create solutions that solve a problem facing academic or research libraries for a particular setting. I think an important lesson for most academic librarians looking at descriptions of projects like this is that it takes more than development staff to make projects like this. It takes purpose, vision, and dedication to collecting and preserving content–in other words, emotion and care. A great example of this was the presentation about DIYHistory from the University of Iowa. This project started out initially as an extremely low-tech solution for crowdsourcing archival transcription, but got so popular that it required a more robust solution. They were able to adapt open source tools to meet their needs, still keeping the project very within the means of most libraries (the code is here).

Another view of emotion and care came from Mark Matienzo, who did a lightning talk (his blog post gives a longer version with more details). His talk discussed the difficulties of acknowledging and dealing with the emotional content of archives, even though emotion drives interactions with materials and collections. The records provided are emotionless and affectless, despite the fact that they represent important moments in history and lives. The type of sharing of what someone “likes” on Facebook does not satisfactorily answer the question of what they care about,or represent the emotion in their lives. Mark suggested that a tool like Twine, which allows writing interactive stories could approach the difficult question of bringing together the real with the emotional narrative that makes up experience.

One of the ways we express care for our work and for our colleagues is by taking time to be organized and consistent in code. Naomi Dushay of Stanford University Library presented best practices for code handoffs, which described some excellent practices for documenting and clarifying code and processes. One of the major takeaways is that being clear, concise, and straightforward is always preferable, even as much as we want to create cute names for our servers and classes. To preserve a spirit of fun, you can use the cute name and attach a description of what the item actually does.

Originally Bess Sadler, also from Stanford, was going to present with Naomi, but ended up presenting a different talk and the last one of the conference on Creating a Commons (the full text is available here). This was a very moving look at what motivates her to create open source software and how to create better open source software projects. She used the framework of the Creative Commons licenses to discuss open source software–that it needs to be “[m]achine readable, human readable, and lawyer readable.” Machine readable means that code needs to be properly structured and allow for contributions from multiple people without breaking, lawyer readable means that the project should have the correct structure and licensing to collaborate across institutions. Bess focused particularly on the “human readable” aspect of creating communities and understanding the “hacker epistemology,” as she so eloquently put it, “[t]he truth is what works.” Part of understanding that requires being willing to reshape default expectations–for instance, the Code4Lib community developed a Code of Conduct at Bess’s urging to underline the fact that the community aims at inclusion and creating a safe space. She encouraged everyone to keep working to do better and “file bug reports” about open source communities.

This year’s Code4Lib conference was a reminder to me about why I do the work I do as an academic librarian working in a technical role. Even though I may spend a lot of time sitting in front of a computer looking at code, or workflows, or processes, I know it makes access to the collections and exploration of those collections better.