Analyzing CVs for publisher copyrights and self-archiving with OpenRefine

This originally appeared on the ACRL TechConnect blog.

I started working on this project yesterday, but I wanted to write it up as quickly as possible so that I could see how others are approaching this issue. First of all, I should say that this approach was inspired by this article in the Code4Lib Journal: “Using XSLT and Google Scripts to Streamline Populating an Institutional Repository” by Stephen X. Flynn, Catalina Oyler, Marsha Miles.

The problem I had was a faculty member who sent a CV to a liaison for adding the items to the repository, but whose citations were not showing up in the citation databases–and I now work at an institution with all of the resources I need for this. So I wanted to go the other way, and start with the CV and turn that into something that I could use to query SHERPA/RoMEO.

It occurred to me that the best tool for this might be Google Refine (now OpenRefine, I guess), which I’ve always wanted to play around with. I am sure there are lots of other ways to do this, but I found this pretty easy to get set up. Here’s the approach I’m taking, with a version of my own CV.

  1. Start with a CV, and identify the information you want–you could copy the whole thing or use screen scraping or what have you, but most people’s CVs are about 20 pages long and you only care about 1 or 2 pages of journal
  2. Copy this into a text editor to remove weird formatting or spacing. You want to have each citation in its own line, so if you had a CV with hanging indents or similar you would have to remove those.text
  3. Now (assuming you’ve installed and opened Google Refine), either import this text file, or copy in the text. Import it as line based text file, and don’t select anything else.
  4. Click on Create Project>>, and it will bring it into Google Refine. Note that each line from the text file has become a row in the set of data, but now you have to turn it into something useful.refine1
  5. My tactic was to separate the author, date, title, journal title, and other bibliographic information into their own columns. The journal title is the only one that really matters for these purposes, but of course you want to hang on to all the information. There are probably any number of ways to accomplish this, but since citations all have a standard structure, it’s really easy to exploit that to make columns. The citations above are in APA style, and since I started with social work faculty to test this out, that’s what I am starting with, but I will adjust for Chicago or MLA in the future. Taking a look at one citation as example we see the following:

    Heller, M. (2011). A Review of “Strategic Planning for Social Media in Libraries”. Journal of Electronic Resources Librarianship, 24 (4), 339-240)

    Note that we always have a space after the name, an open parentheses, the date, a closed parentheses, and a period followed by a space (I’ve colored all the punctuation we want blue). So I can use this information to split the columns. To do so, select “Split into several columns” from the Edit Column menu.
    Then in the menu, type in the separator you want to use, which in this case is space open parentheses. Split into two columns, and leave the rest alone. Note that you can also put a regular expression in here if necessary. Since dates are always the same length you could get away with field lengths, but this way works fine.
    After this, you will end up with the following change to your data:
    Now the author is in the first column, and the opening parenthesis is gone.

  6. Following along with the same rationale for each additional field and renaming the columns we end up with (cooking show style) the following:
    refine5Each piece of information is in its own column so we can really start to do something with it.
  7. I am sure there’s a better way to do this, but my next step was to use the journal title as the query term to the SHERPA/RoMEO API call. This was super easy once I watched the data augmentation screencast; there is documentation here as well. Open up the following option from the edit column menu:
    You get a box to fill in the information about your API call. You have all kinds of options, but all you really need to do for this is format your URL in the way required by SHERPA/RoMEO. You should get an API key, and can read all about this in the article I linked to above. There are probably several ways to do this, but I found that what I have below works really well. Note that it will give you a preview to see if the URL is formatted in the way you expect. Give your column a name, and set the Throttle delay. I found 1000 worked fine.
    refine7In a copy and pastable format, here’s what I have in the box:'[YOUR API KEY HERE]&qtype=starts&jtitle=' + escape(value,'url')
  8. Now it will run the query and send back a column of XML that looks like this:
  9. From this you can extract anything you want, but in my case I want to quickly get the pre and post-archiving allowances, plus any conditions. This took me awhile to figure out, but you can use the Googe Refine Expression Language parseHtml function to work on this. Click on Add column based on this column from the Edit Column menu, and you will get a menu to fill in an expression to do something to your data. After a bit of trial and error, I decided the following makes the most sense. This grabs the text out between the <prearchiving> element in the XML and shows you the text. You can see from the preview that you are getting the right information. Do the same thing for post-archiving or any other columns you want to isolate. The code you want is value.parseHtml().select("elementname")
    Conditions are a little different, since there are multiple conditions for each journal. In that case, you would use the following syntax (after join you can put whatever separator you want): forEach(value.parseHtml().select("condition"),v,v.htmlText()).join(". ")"
  10. Now you have your data neatly structured and know the conditions for archiving, hooray! Again, cooking show style, here’s what you end up with. You can certainly remove the SHERPA/RoMEO column at this point, and export the data as Excel or whatever format you want it in.
  11. BUT WAIT, IT GETS BETTER. So that was a lot of work to do all that moving and renaming. Now you can save this for the future. Click on Undo/Redo and then the Extract option.
    Make sure to unclick any mistakes you made! I entered the information wrong the first time for the API call, so that added an unnecessary step. Copy and paste the JSON into a text editor and save for later. refine13
  12. From now on when you have your CV data, you can click on the Undo/Redo tab and then choose Apply. It will run through the steps for you and automatically spit out your nicely formatted and researched publications. Well… realistically the first time it will spit out something with multiple errors, and you will see all the typos that are messing up your plan. But since the entire program is built to clean up messy data, you’re all set on that end. Here’s the APA format I described above for you to copy and paste if you like–make sure to fill in your API key: cvomatic

I hope this is useful to some people. I know even in this protean form this procedure will save me a ton of time and allow library liaisons to concentrate on outreach to faculty rather than look up a ton of things in a database. Let me know if you are doing something similar and the way you are doing it.

How-to Libraries

Using GVMax with LibraryH3lp

“So you’ve really done it this time,” I said to myself yesterday as I wondered why no one had been using text message reference recently, and then remembered I’d changed the library’s Google password earlier in the semester and had never updated it in LibraryH3lp. “Oh well,” I thought, “I check the Google Voice inbox periodically and didn’t see anything missed in there. I’ll update the password now, and it will all be ok.” Then I deleted the Google Voice gateway in order to add in the correct password and discovered to my horror that I couldn’t add it back. Little did I realize that LibraryH3lp had decided that the Google Voice gateway had reached end of life since the Twilio solution works better. I am sure the people at LibraryH3lp would have helped me out of this, but they suggested using something called GVMax as a solution. I didn’t see anyone talking about how to set this up, so I wanted to show how I did this if anyone else is wondering how it works.

  • Gather your materials. You will need a Google Voice account. I assume you already have one. Note that you don’t need to give GVMax access to your Google account, but you should to use all the functions. You will also want a queue for your Google Voice gateway  in LibraryH3lp that is offline. You can do that by removing all logged in users from the queue. If you forget, that’s ok, just make sure you take the queue offline and then online before testing this out.
  • Type your Google account username and password, and accept terms of service. Refer to the GVMax setup instructions for how to set up your account. Basically you need to set up a filter in your Google Voice email address to forward to a special GVMax email address that will send notifications to your Google Talk. If you had other forwards set up, you should remove them. You can do it all with GVMax. To set up this filter in Gmail requires a confirmation code that GVMax will forward to your Gmail account, so it takes a moment to set up. I would suggest copying the GVMax email address into a text editor since to complete this process requires switching between screens in both Gmail and GVMax.
  • Note that you can have GVMax monitor incoming calls, voice mails, and SMS. For these purposes, SMS only is preferable.
  • Make sure that Google Talk will accept chats from the GVMax account. This doesn’t have to be the same as your Google Voice account. Depending on how you have your IM reference set up you may need to choose a different Google Talk account. We have a different Google Talk address for the very few occasions where a student used an IM client rather than the chat boxes. As far as I know, the main IM client users used the AIM account. Anyway, we were already using that same account in our LibraryH3lp gateway that made sense to me to use the same one. But if you use that account as an IM gateway already, you may want to get a new one to avoid confusion.
  •  GVMax has a feature on the My Account page to send a test SMS. You will use that a lot.
  • At this point, if you send a text to your Google Voice number, you will get a message in Google Talk with the text. When you respond to it, using the magic of Google Voice SMS to Gmail to Talk and back again, the other end will get a response as a text. Test this out and make sure it works before you put it in LibraryH3lp.
  • Now go to your SMS queue in LibraryH3lp. Ours is called domusms. Before it used a Google Voice gateway, now you will pick Google Talk for the gateway. The account and password should be whatever you used with GVMax.
  • This is what the interaction will look like in LibraryH3lp:
  • The blurred out number is my phone number (though it’s more complex than that–I won’t get into it), followed by the GVMax address. There was a wait time of about 1-10 seconds between when I sent the text for it to appear in LibraryH3lp and then for my response to appear on my phone. Not bad!
  • I set up my SMS queue with the picture of the cell phone to remind people that this is the SMS queue. I suggest you do the same.
  • What happens if the queue is offline? This happens a lot. I have yet to see how this works in practice, but the main suggestion I have is to have alerts of text messages sent to an account that someone monitors regularly, such as the reference email. Then that person has to go into Google Voice to send a response. My theory is that there has to be a way to send a notification only if the LibraryH3lp queue is offline and hence the Google Talk account, but haven’t figured this one out yet. Will keep you posted!
Conferences Libraries

Augmented Reality Hackfest Report from Code4Lib Midwest

Here’s the report on the discussion and experimentation session that my group had at Code4Lib Midwest about augmented reality applications. In my group were Erin Fisher and Kyle Felker of Grand Valley State University and Megan O’Neill of Albion College. We were interested in what augmented reality could do for marketing, public services, instruction, and other public areas of the library, and how it intersects with gamification.
We were not so interested in actually programming any augmented reality applications, but rather seeing what is available to the average consumer or library and whether it solves the problems inherent in QR codes. We defined these problems as follows: you have to know ahead of time what a QR code is to use it, you have to have the application to read it installed on your mobile device, and you have to have the ability to reach the internet with your device, which assumes cell coverage and/or wifi. For people who have all those in place, there are some additional problems of appropriate use.
We determined that for augmented reality to be truly useful to the average person, it should have the following features:  it should provide an answer to a real need rather than simply trying to sell something, and should ideally answer that need right away rather than sending you to another website to find the answer. We also discussed the concept of a “subculture” aspect to these sorts of applications–for instance people sharing uncensored information about public locations or institutions that only the ones with access to the app have. But we struggled with what sort of information or services libraries have that fit into this mold. We do, after all, freely give away everything we have. What do we have that people really need and want? This ended up being a rather depressing line of conversation. One of the conclusions we drew that was less depressing was that not everything has to have an educational or information literacy increasing point. For instance, the fairy doors in Ann Arbor (can we talk about how wonderfully late 90s this site is?) include one at the Ann Arbor District library. This was a kind of goofy and whimsical thing that wasn’t explained in advance, but people started to catch on and wanted to find all the fairy doors in town. It gets people into locations they might not have visited otherwise, but doesn’t feel like it has an ulterior motive.
In libraries, we felt that there were a few obvious useful applications for augmented reality. First, wayfinding through large buildings is always helpful, though none of us work in buildings so large that this seemed useful. But we recalled our days in library school, where after two years working in the University of Illinois Main Library there were still plenty of unknown and unfindable corners. Certainly some quiet and creepy corners of the Main Stacks had a subculture aspect to them. How to do this is another matter entirely. GPS doesn’t necessarily work down to the foot like might happen with finding a book. Once upon a time you could use wireless access points to triangulate someone’s position. But according to some people I talked to there, this really doesn’t work anymore because wireless networks saturate areas so heavily you couldn’t pinpoint where someone was.
We also talked extensively about the concept of bridging the physical with the digital. While it may seem counter-intuitive, we could all offer examples of students finding their way to a physical space within the library without having any concept that they had access to many more digital resources. Finding a book on the shelf had an easy to imagine trajectory that wasn’t overwhelming–if augmented reality could offer a similar experience it could make the research experience more palatable. And of course, if it’s fun in itself, that’s even better. The University of Rochester’s Just Press Play is an example of something that does this very well. Another example was the fairly recent promotion of Jay-Z’s book using Bing, where people could “visit” the physical world virtually through Bing maps, and also use a mobile device to actually visit the places and see the digital content. This was a really well done and smart promotion that was very popular. But it was also wildly expensive, so while it might provide some inspiration, we can’t do it in libraries.
One of the promises of QR codes (and if you remember the CueCat you know this is going back awhile) is making print interactive. We discussed the Wonderbook, which is an odd hybrid between augmented reality toy, video game, and book. Personally if I played video games I could see something like that being attractive to me, but not something I would  use otherwise. We discussed (and played around with) a lot of tools which make print or other physical objects do something interesting when you take a picture of them. But these have all the same problems that QR codes have: you have to have the right app, you have to know where to look, and you have to care enough to try to look. One of the ones we looked at could make a topless lady appear–sure, that might make you want to look, but perhaps less than appropriate for library wayfinding.

The main tools we played around with or researched include:

  • Layar (propietary; iOS and Android)
  • Aurasma (proprietary; iOS and Android)
  • mixAIRE (open source; iOS and Android)
  • ARIS (open source; iOS)
  • Argon (open source; iOS)

These didn’t always work perfectly on all our devices (we all had iOS devices but of varying ages and capabilities)–in particular the open source seemed to require many dependencies and not be as immediately useful. And basically all we ended up with were the ability to embed URLs or videos on magazines or other physical objects. It was fun, but ultimately didn’t seem to solve any of the problems we hoped that it would. Still, something learned.

 Last but certainly not least, we would like to introduce you to the extremely important concepts of boozy popsicles and putting fruit such as blueberries and pineapple in lemon-lime pop (Diet 7-Up or Sprite, for instance). Once you have these things, the larger problems of the world tend to recede.
Additional reading: