You are on the HTRC Sandbox. This is where new functionality is introduced on a smaller public domain subset of the HathiTrust.
Read more about the Sandbox or visit the main HTRC Portal.

Features extracted from the HTRC

Note that this is an alpha data release. Please send feedback to (Need javascript to show).

A great deal of fruitful research can be performed using non-consumptive pre-extracted features. For this reason, HTRC has put together a select set of page-level features extracted from the HathiTrust's non-Google-digitized public domain volumes. The source texts for this set of feature files are primarily in English.

Features are notable or informative characteristics of the text. We have processed a number of useful features, including part-of-speech tagged token counts, header and footer identification, and various line-level information. This is all provided per-page. Providing token information at the page level makes it possible to separate text from paratext. (An example of the latter may be: thirty pages of publishers’ ads at the back of a book). We have also decided to break each page into a collection of three parts: header, body, and footer. The specific features that we extract from the text are described in more detail below.

The primary pre-calculated feature that we are providing is the token (unigram) count, on a per-page basis. Token counts are specific to the part-of-speech usage for that token's term, so that a term used as both a noun and a verb, for example, will have a separate count provided for each of these two modalities of its use. We also include line information, such as the number of lines with text on each page, and counts of characters that start and end lines on each page. This information can illuminate genre and volume structure: for instance, it helps distinguish poetry from prose, or body text from an index.

The present release is an alpha release, and we would love to hear about how you use it, or what else you would like to see!

Review sample data or jump straight to the downloads.

Sample JSON


{
   "id":"loc.ark:/13960/t1fj34w02",
   "metadata":{
      "schemaVersion":"1.0",
      "title":"Shakespeare's Romeo and Juliet,",
      "pubDate":1920,
      "htBibUrl":"http://catalog.hathitrust.org/api/volumes/full/htid/loc.ark:/13960/t1fj34w02",
      "dateCreated":"2014-05-01T09:42",
      "handleUrl":"http://hdl.handle.net/2027/loc.ark:/13960/t1fj34w02",
      "oclc":"",
      "language":"eng",
      "imprint":"Chicago, Scott Foresman and company, [c1920]"
   },
   "features":{
      "schemaVersion":"1.0",
      "dateCreated":"2014-05-01T02:15",
      "pageCount":230,
      "pages":[
         {
            "seq":15,
            "tokenCount":212,
            "lineCount":38,
            "emptyLineCount":10,
            "sentenceCount":7,
            "header":{
               "tokenCount":1,
               "lineCount":1,
               "emptyLineCount":0,
               "sentenceCount":1,
               "tokens":{
                  "INTRODUCTION":{ "NN":1 }
               },
               "beginLineChars":{ "I":1 },
               "endLineChars":{ "N":1 }
            },
            "body":{
               "tokenCount":211,
               "lineCount":37,
               "emptyLineCount":10,
               "sentenceCount":6,
               "tokens":{
                  "priests":{ "NNS":1 },
                  "development":{ "NN":1 },
                  "extraordinary":{ "JJ":1 },
                  "striking":{ "JJ":1 },
                  "ceremonial":{ "NN":1 },
                  "which":{ "WDT":3 },
                  "sprang":{ "VBD":1 },
                  ".":{ ".":7 },
                  ",":{ ",":10 },
                  "1600":{ "CD":1 },
                  …,
                  "growth":{ "NN":1 }
               },
               "beginLineChars":{
                  "f":2, "d":2, "b":2, …,"S":1
               },
               "endLineChars":{
                  "f":1, "g":2, "d":2, …, "r":1
               }
            },
            "footer":{
               "tokenCount":0,
               "lineCount":0,
               "emptyLineCount":0,
               "sentenceCount":0,
               "tokens":{ },
               "beginLineChars":{ },
               "endLineChars":{ }
            }
         }
      ]
   }
}

    
	    

Documentation

The extracted features data is provided in JSON form.

Volume

The volume represents the current work (e.g. book) as represented in the HathiTrust index.

id: A unique identifier for the current volume. This is the same identifier used in the HathiTrust and HathiTrust Research Center corpora.

Metadata

A small amount of bibliographic metadata for identifying the volume is included in this dataset. See also: “Where can I find detailed bibliographic metadata?”.

schemaVersion: A version identifier for the format and structure of this metadata object. metadata.schemaVersion is separate from features.schemaVersion below.

dateCreated: The time this metadata object was processed. metadata.dateCreated is not necessarily the same as the features.dataCreated below.

title: Title of the given volume.

pubDate: The publication year.

language: The primary language of the given volume.

htBibUrl: The HathiTrust Bibliographic API call for the volume.

handleUrl: The persistent identifier for the given volume.

oclc: The array of OCLC number(s).

imprint: The publication place, publisher, and publication date of the given volume.

Features

The features extracted from the content of the volume.

schemaVersion: A version identifier for the format and structure of the feature data (HTRC generated).

dateCreated: The time the batch of metadata was processed and recorded (HTRC generated).

pageCount: The number of pages in the volume.

pages: An array of JSON objects, each representing a page of the volume.

Page

Pages are contained within volumes, they have a sequence number, and information about their header, body, and footer.

Page-level information

seq: The sequence number. See notes on ID usage.

tokenCount: The total number of tokens in the page.

lineCount: The total number of non-empty lines in the page.

emptyLineCount: The total number of empty lines in the page.

sentenceCount: Total number of sentences identified in the page using OpenNLP. Details on parsing.

The fields for header, body, and footer are the same, but apply to different parts of the page. Read about the differences between the sections.

tokenCount: The total number of tokens in this page section.

lineCount: The number of lines containing characters of any kind in this page section. This represents the layout of the page, for sentence counts, see the sentenceCount field.

emptyLineCount: The number of lines without text in this page section.

sentenceCount: The number of sentences found in the text in this page section, parsed using OpenNLP.

tokens: An unordered list of all tokens (characterized by part of speech using OpenNLP), and their corresponding frequency counts, in this page section. Tokens are case-sensitive. There will be separate counts, for instance, for “rose” (noun) and “rose” (verb), while a capitalized “Rose” is shown as a separate token. Words separated by a hyphen across a line break are rejoined. No other data cleaning or OCR correction was performed. Details on POS parsing and types of tags used.

beginLineChars: Count of the initial character of each line in this page section (ignoring whitespace).

endLineChars: Count of the last character on each line in this page section (ignoring whitespace).


Download Links

This feature dataset is licensed under a Creative Commons Attribution 4.0 International License.

Download

Below we provide the extracted feature data for download, in the form of chronologically sequential bundles consisting of page-level features and metadata. While we attempted to keep the bundles of similar size, the file size varies because we were careful to not split years between bundles.

Data Sample(126M)

pre-1850 (4.2G), 1850–1879 (5.5G), 1880–1889 (3.3G), 1890-1899 (4.4G)
1900-1909 (5.7G), 1910–1919 (5.5G), post-1919 (2.0G)

Rsync

The data is also set up to be downloaded with rsync. This has the benefit of allowing you to download feature files individually, and keeping them in sync with our copies.

To sync all the feature files:

rsync -v sandbox.htrc.illinois.edu::ngpd-features/*/*.json.bz2 .

This command will download all 250k files. If you would like a listing of all the files to see what is available, or to selectively sync parts of the data:

rsync -v sandbox.htrc.illinois.edu::ngpd-features/*.txt .

To sync a single feature file for which you know the (file friendly) volume id

rsync -v sandbox.htrc.illinois.edu::ngpd-features/uiuo/uiuo.ark+=13960=t0jt0c18w.json.bz2 .

You can also sync any of the download files above, such as the random sample of 1000 volumes:

rsync -v sandbox.htrc.illinois.edu::ngpd-features/sample.tar .

About the Data

The data provided here represents non-consumptive features of pages from the HathiTrust's non-Google-digitized volumes. These are primarily English-language public domain volumes.

More information about HathiTrust datasets.

# of pages67,932,813
# of volumes250,178
Years1431-2010
Median Date1899

Questions

How are tokens parsed?

Hyphenation of tokens at end of line was corrected using custom code. Apache OpenNLP was used for sentence segmentation, tokenization, and part of speech (POS) tagging. No additional data cleaning or OCR corrections was performed.

OpenNLP uses the Penn Treebank POS tags.

Can I use the page sequence as a unique identifier?

The seq value is always sequential from the start. In this version of processing, the seq value was extracted from the filename. In some limited cases, if the labeling of the filenames doesn’t align with the seq given in the METS file, then our pages are not in alignment with HT. In the future, we will change and use the METS file to specify the seq number. Each scanned page of a volume has a unique sequence number, but it is specific to the current version of the full text. In theory, updates to the OCR that add or remove pages will change the sequence. The practical likelihood of sequential changes is low, but uses of the page as an id should be cautious.

A future release of this data will include persistent page identifiers that remain unchanged when there are sequential changes.

Where’s the bibliographic metadata? Who wrote the book, when is it from, etc.?

This dataset is foremost an extracted features dataset, with minimal metadata included as a convenience. For additional metadata information, i.e. subject classifications, etc., HT offers Hathifiles, which can be paired to our feature dataset through the volume id field.

The metadata that is included in this data includes MARC metadata from Hathitrust and additional information from Hathifiles:

  • imprint: 260a from Hathitrust MARC record, 260b and 260c from Hathifiles.
  • language: MARC control field 008 from Hathifiles.
  • pubDate: extracted from Hathifiles. See also: details on HathiTrust's rights-determination.
  • oclc: extracted from HathiFiles.

Additionally, schemaVersion and dateCreated are specific to this feature dataset.

What do I do with beginning- or end-of-line characters?

The characters at the start and end of a line can be used to differentiate text from paratext at a page level. For instance, index lines tend to begin with capitalized letters and end with numbers. Likewise, lines in a table of contents can be identified through arabic or roman numerals at the start of a line.

What is the difference between the header, body, and footer sections?

Because repeated headers and footers can distort word counts in a document, but also help identify document parts, we attempt to identify repeated lines at the top or bottom of a page and provide separate token counts for those forms of paratext. The “header” and “footer” sections will also include tokens that are page numbers, catchwords, or other short lines at the very top or bottom of a page. Users can of course ignore these divisions by aggregating the token counts for header, body, and footer sections.

The current implementation uses a variety of heuristics to identify headers and footers, which can be seen in our code base. Note that we are considering methods for improving header/footer parsing, and this code is subject to change.


Contact Us

Need JavaScript to show

Tools

If you've built tools or scripts for processing our data, let us know and we'll feature them here!

Projects

Let us know about your projects and we'll link to them here.