My rating: 4 of 5 stars
Disclosure: the publisher of this book provided me with a free copy in exchange for a review. The opinions expressed in the review are my own.
While OpenRefine is an extremely useful “power tool for messy data”, its power can be difficult to master without a great deal of trial and error on the part of the user. Part of this stems from the evolving nature of the tool. It began life as Freebase Gridworks, with the purpose of cleaning up data in order to run it against linked data in Freebase. When the Freebase parent organization was acquired by Google, they rebranded the tool as Google Refine, but as Google’s priorities shifted, they stopped working on the tool and it became the open source OpenRefine. This legacy means that the tool has many pieces created by different people for different purposes. While there is quite a lot of good documentation out there on the OpenRefine site and elsewhere, this book puts it together in a easy to follow format. Like a lot of OpenRefine documentation, it is a series of “recipes” that explain how to do one specific task, but is written with the cover to cover reader in mind as well. The Google produced tutorial videos have similar coverage, but the book is more in depth, and has the advantage for readers coming from the cultural institution side of using a museum data set for examples. Another advantage is that the authors of the book have a particular interest in named entity recognition (part of the book covers the tool that one of them produced), which is particularly helpful for more abstract data sets with cultural data.
Using OpenRefine is useful for beginner or intermediate users of OpenRefine. As someone who has used OpenRefine for awhile and written about its use in libraries, this was more helpful than I expected initially, since there were pieces of functionality I’d not yet encountered in experimentation or documentation so far. My one criticism is that much of the book promises a complete explanation in the appendix of regular expressions and the Google Refine Expression Language that powers the software, but I found that the GREL documentation was less useful than I hoped, though I still learned from it. I would have preferred if that section had been earlier in the book. That aside, I would recommend this book to anyone who has been using OpenRefine or thinking about using it, and additionally for library and museum professional development collections.technology,What I've been reading lately
In honor of Ada Lovelace Day, #LibTechWomen is blogging about one of the great early library and information science theorists, Suzanne Briet. You can read all our blog posts on Twitter using the hash tag #briet. In a shocking but not surprising turn of events, a recent textbook gave Paul Otlet credit for the one thing that everyone should be able to remember from the first semester of library school: a wild antelope is not a document, but an antelope in a zoo is, since it was collected, cataloged, and provides evidence. Unfortunately textbooks have a way of perpetuating wrong information from generation to generation, and critical thinking and research skills are woefully lacking. So, we are remembering Madame Documentation on a day dedicated to remembering and celebrating women in STEM fields.
My personal forever favorite tribute to Suzanne Briet comes in the form of a critical puppet show put on by the Self Preservation Working Group at the Read/Write Library. Watch and enjoy.
This originally appeared on the ACRL TechConnect blog.
You may not think much about cryptography on a daily basis, but it underpins your daily work and personal existence. In this post I want to talk about a few realms of cryptography that affect the work of academic librarians, and talk about some interesting facets you may never have considered. I won’t discuss the math or computer science basis of cryptography, but look at it from a historical and philosophical point of view. If you are interested in the math and computer science, I have a few a resources listed at the end in addition to a bibliography.
Note that while I will discuss some illegal activities in this post, neither I nor anyone connected with the ACRL TechConnect blog is suggesting that you actually do anything illegal. I think you’ll find the intellectual part of it stimulation enough.
What is cryptography?
Keeping information secret is as simple as hiding it from view in, say, an envelope, and trusting that only the person to whom it is addressed will read that information and then not tell anyone else. But we all know that this doesn’t actually work. A better system would only allow a person with secret credentials to open the envelope, and then for the information inside to be in a code that only she could know.
The idea of codes to keep important information secret goes back thousands of years , but for the purposes of computer science, most of the major advances have been made since the 1970s. In the 1960s with the advent of computing for business and military uses, it was necessary to come up with ways to encrypt data. In 1976, the concept of public-key cryptography was developed, but it wasn’t realized practically until 1978 with the paper by Rivest, Shamir, and Adleman–if you’ve ever wondered what RSA stood for, there’s the answer. There were some advancements to this system, which resulted in the digital signature algorithm as the standard used by the federal government.[1. Menezes, A., P. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. CRC Press, 1996. http://cacr.uwaterloo.ca/hac/. 1-2.] Public-key systems work basically by creating a private and a public key–the private one is known only to each individual user, and the public key is shared. Without the private key, however, the public key can’t open anything. See the resources below for more on the math that makes up these algorithms.
Another important piece of cryptography is that of cryptographic hash functions, which were first developed in the late 1980s. These are used to encrypt blocks of data– for instance, passwords stored in databases should be encrypted using one of these functions. These functions ensure that even if someone unauthorized gets access to sensitive data that they cannot read it. These can also be used to verify the identify of a piece of digital content, which is probably how most librarians think about these functions, particularly if you work with a digital repository of any kind.
Why do you care?
You probably send emails, log into servers, and otherwise transmit all kinds of confidential information over a network (whether a local network or the internet). Encrypted access to these services and the data being transmitted is the only way that anybody can trust that any of the information is secret. Anyone who has had a credit card number stolen and had to deal with fraudulent purchases knows first-hand how upsetting it can be when these systems fail. Without cryptography, the modern economy could not work.
Of course, we all know a recent example of cryptography not working as intended. It’s no secret (see above where keeping something a secret requires that no one who knows the information tells anyone else) by now that the National Security Agency (NSA) has sophisticated ways of breaking codes or getting around cryptography though other methods [2. See the New York Times and The Guardian for complete details.] Continuing with our envelope analogy from above, the NSA coerced companies to allow them to view the content of messages before the envelopes were sealed. If the messages were encoded, they got the keys to decode the data, or broke the code using their vast resources. While these practices were supposedly limited to potential threats, there’s no denying that this makes it more difficult to trust any online communications.
Librarians certainly have a professional obligation to keep data about their patrons confidential, and so this is one area in which cryptography is on our side. But let’s now consider an example in which it is not so much.
Breaking DRM: e-books and DVDs
Librarians are exquisitely aware of the digital rights management realm of cryptography (for more on this from the ALA, see The ALA Copyright Office page on digital rights ). These are algorithms that encode media in such a way that you are unable to copy or modify the material. Of course, like any code, once you break it, you can extract the material and do whatever you like with it. As I covered in a recent post, if you purchase a book from Amazon or Apple, you aren’t purchasing the content itself, but a license to use it in certain proscribed ways, so legally you have no recourse to break the DRM to get at the content. That said, you might have an argument under fair use, or some other legitimate reason to break the DRM. It’s quite simple to do once you have the tools to do so. For e-books in proprietary formats, you can download a plug-in for the Calibre program and follow step by step instructions on this site. This allows you to change proprietary formats into more open formats.
As above, you shouldn’t use software like that if you don’t have the rights to convert formats, and you certainly shouldn’t use it to pirate media. But just because it can be used for illegal purposes, does that make the software itself illegal? Breaking DVD DRM offers a fascinating example of this (for a lengthy list of CD and DVD copy protection schemes, see here and for a list of DRM breaking software see here). The case of CSS (Content Scramble System) descramblers illustrates some of the strange philosophical territory into which this can end up. The original code was developed in 1999, and distributed widely, which was initially ruled to be illegal. This was protested in a variety of ways; the Gallery of CSS Descramblers has a lot more on this [4. Touretzky, D. S. (2000) Gallery of CSS Descramblers. Available: http://www.cs.cmu.edu/~dst/DeCSS/Gallery, (September 18, 2013).]. One of my favorite protest CSS descramblers is the “illegal” prime number, which is a prime number that contains the entire code for breaking the CSS DRM. The first illegal prime number was discovered in 2001 by Phil Carmody (see his description here) [6. For more, see Caldwell, Chris. “The Prime Glossary: Illegal Prime.” Accessed September 17, 2013. http://primes.utm.edu/glossary/xpage/Illegal.html.]. This number is, of course, only illegal inasmuch as the information it represents is illegal–in this case it was a secret code that helped break another secret code.
In 2004, after years of court hearings, the California Court of Appeal overturned one of the major injunctions against posting the code, based on the fact that source code is protected speech under the first amendment , and that the CSS was no longer a trade secret. So you’re no longer likely to get in trouble for posting this code–but again, using it should only be done for reasons protected under fair use. [5.“DVDCCA v Bunner and DVDCCA v Pavlovich.” Electronic Frontier Foundation. Accessed September 23, 2013. https://www.eff.org/cases/dvdcca-v-bunner-and-dvdcca-v-pavlovich.] One of the major reasons you might legitimately need to break the DRM on a DVD is to play DVDs on computers running the Linux operating system, which still has no free legal software that will play DVDs (there is legal software with the appropriate license for $25, however). Given that DVDs are physical media and subject to the first sale doctrine, it is unfair that they are manufactured with limitations to how they may be played, and therefore this is a code that seems reasonable for the end consumer to break. That said, as more and more media is streamed or otherwise licensed, that argument no longer applies, and the situation becomes analogous to e-book DRM.
The Gambling With Secrets video series explains the basic concepts of cryptography, including the mathematical proofs using colors and other visual concepts that are easy to grasp. This comes highly recommended from all the ACRL TechConnect writers.
Since it’s a fairly basic part of computer science, you will not be surprised to learn that there are a few large open courses available about cryptography. This Cousera class from Stanford is currently running, and this Udacity class from University of Virginia is a self-paced course. These don’t require a lot of computer science or math skills to get started, though of course you will need a great deal of math to really get anywhere with cryptography.
A surprising but fun way to learn a bit about cryptography is from the NSA’s Kids website–I discovered this years ago when I was looking for content for my X-Files fan website, and it is worth a look if for nothing else than to see how the NSA markets itself to children. Here you can play games to learn basics about codes and codebreaking.Filed under: hacking,technology
Next Page »