JSTOR Text Analysis Support: Working with JSTOR Bibliographic Metadata – JSTOR Support

On this page:

Downloading the JSTOR bibliographic metadata file
Creating an item ID list
- Item ID list formatting requirements
- Example: Using Python to filter metadata and create an item ID list
Working with JSON Lines
- JSON data dictionary

To request the full text of items for analysis, you must submit a dataset request which includes a list of requested item IDs from the JSTOR bibliographic metadata file.

A document's bibliographic metadata provides information about the document, such as title and year of publication. You may use it to identify the items you need for your research.

Downloading the JSTOR bibliographic metadata file

To download the JSTOR bibliographic metadata file:

Log into the JSTOR text analysis support page with your personal JSTOR account.
- If you don't already have a JSTOR account, you may register for a JSTOR account for free.
Under Download metadata of JSTOR content available for text analysis, read and agree to the Terms and Conditions and then select the Download JSONL button.

Creating an item ID list

Once you've downloaded the JSTOR bibliographic metadata file and understand the data available in it, you'll need to create a list of JSTOR item IDs to submit as part of your dataset request.

Item ID list formatting requirements

Prior to submission, your ID list must meet the following requirements:

A UTF-8 plain text file with the extension .txt
One JSTOR item ID per line
No additional text (no header, commas, or other separators)
Maximum of 1,500,000 lines

While you can use a variety of tools to create this list, because of the size of the metadata file when uncompressed, we recommend using tools that can stream the file rather than load it entirely into memory. See an example of stream processing using Python.

Other free and open source utilities that we recommend are:

R and the jsonlite package for streaming JSON data
The command line utility jq

Example: Using Python to filter metadata and create an item ID list

In the following example, we want to request all items that don’t require review to download, and whose titles contain the word “Rembrandt”.

import gzip
import json

item_ids = []

# Using gzip.open it is possible to read the file line by line without loading it all into memory
with (
   gzip.open(
       "jstor_metadata_2025-06-20.jsonl.gz",  # Replace with the name of the most recent metadata file you downloaded from https://www.jstor.org/ta-support/metadata
       "rt",
       encoding="utf-8" 
   ) as f
):
   for line_number, line in enumerate(f, start=1):
        # Print progress every 1000 lines to avoid overwhelming Jupyter
       if line_number%10000 == 0:
           print(f"\rProcessing line: {line_number}", end="", flush=True)
       
       data = json.loads(line)
       # Put your filtering logic here

       # Check if the item will require staff review
       if data.get("review_required") is False:
           # Check if the item has a title
           if data.get("title") is not None:
               # Check if the title contains the term "Rembrandt"
               if "rembrandt" in data["title"].lower():
                   item_ids.append(data["item_id"])
   print()  # Move to the next line after processing

# Write the item IDs to a file
output_file="rembrandt_item_ids.txt"
with open(output_file, "w") as f:
   for item_id in item_ids:
       f.write(f"{item_id}\n")

print("Output file:" + output_file)

Place this file in the same directory as the downloaded metadata file, update the file name in the Python code to match the filename you downloaded (it updates daily), and run using:

python jstor_filter.py

This will produce the TXT file rembrandt_item_ids.txt, which is ready to upload with your JSTOR text analysis dataset request.

Working with JSON Lines

The JSTOR bibliographic metadata file is made available in JSON Lines file format, compressed with gzip. JSON Lines is a file format for storing structured data, where each line is a valid JSON value.

The file contains one JSON object per line, each representing the bibliographic metadata for one JSTOR item. This allows programs to process the file one record at a time without needing to load the entire dataset into memory at once.

Visit JSON Lines for documentation on the JSON Lines file format.

JSON data dictionary

You'll encounter the following fields in the JSTOR bibliographic metadata JSON Lines file. Use this data to identify what content you may want the full-text of for your research.

Field	Type	Description	Example
item_id	UUID	Unique identifier for a JSTOR item.	2c0018ee-094b-3f3c-b677-2b56c0f73b7e
review_required	Boolean	If this document is included in a dataset request, will that request require ITHAKA staff review? Items where review_required is False are generally items that are Open Access.	true
ithaka_doi	String	A DOI-like identifier for a JSTOR item. Note that these are not guaranteed to be DOIs resolvable by CrossRef.	10.2307/25057135
identifiers	Dictionary with the following fields: print_isbn online_isbn print_issn online_issn ssid catsid journal_code	Note that some fields are identifiers for a parent item, such as ISBNs for the book that contains a book chapter, or ISSNs for the serial title that contains a research article.	{"print_isbn":null,"online_isbn":null,"print_issn":"00128163","online_issn":null,"ssid":null,"catsid":null,"journal_code":"earlamerlite","aluka_doi":null}
title	String	Title of the item. Note that title strings may contain subtitles or whitespace such as line breaks.	Book Review : Captivity and Sentiment: Cultural Exchange in American Literature, 1682-1861 Michelle Burnham
isPartOf	String	Title of the parent item, if it exists. For book chapters, this is the book title. For research articles, this is the journal title.	Early American Literature
creators_string	String	Text string with names of the item creators, which may include authors as well as editors.	Christopher Castiglia, Michelle Burnham
creators	List of dictionaries with the following fields: first_name: string last_name: string order: integer	When possible, JSTOR attempts to parse the raw creator string information to separate out creators as structured data.	[{"first_name":"Christopher","last_name":"Castiglia","order":"1"},{"first_name":"Michelle","last_name":"Burnham","order":"2"}]
publishers	List of strings	Publisher(s) of the item.	["University of North Carolina Press"]
published_date	String date in the format YYYY-MM-DD	Publication date of the item.
languages	List of strings	Language(s) of the item. Note that these language codes are not fully normalized, and contain a blend of ISO 639-1 and ISO 639-2 codes.	["eng"]
discipline_names	List of strings	JSTOR discipline headings for the item, usually set at the serial title level. See JSTOR: Browse by Subject.	["Language & Literature","American Studies"]
issue_number	String	Issue number, if known. Note that this field can contain punctuation.	3
issue_volume	String	Volume number, if known. Note that this field can contain punctuation.	33
content_type	String	High-level categorization of the content types available on JSTOR (Books, Journals, Research Reports aka "Academic content" and Primary Sources), applied by JSTOR.	article
content_subtype	String	A more granular classification of article or book part types available on JSTOR, applied by JSTOR or by journal publishers who submitted content to JSTOR via CSP/JHP. Also used as a more granular classification of contributed primary source content published through JSTOR Digital Stewardship Services, applied by JSTOR.	book-review
c5_data_type	String	JSTOR maps content types to COUNTER data types for purposes of COUNTER reports provided to subscribing libraries. See Section 3.3.2 of the COUNTER Code of Practice Release 5.1.	Journal
c5_section_type	String	Section type was deprecated with COUNTER Release 5.1. For reference, see Section 3.3.3 of the COUNTER Code of Practice Release 5.0.3.	Article
ccda_resource_type	String	High-level categorization of the primary source content types available on JSTOR, applied by JSTOR. Each of the more granular Resource Types (see ccda_resource_subtype) also rolls up to a broad Content Type which enables the narrowing of search results through facets and determines the format for item downloads on JSTOR. See JSTOR Content and Resource Types.	null
ccda_resource_subtype	String	A more granular classification that reflects the nature or genre of the content of the resource. Cataloged by the contributing institution from a JSTOR controlled list. Each Resource Type also rolls up to a broad Content Type (ccda_resource_type) which enables the narrowing of search results through facets and determines the format for item downloads on JSTOR. See JSTOR Content and Resource Types.	null
contributed_content	Boolean	Is this item part of content contributed to JSTOR outside of the regular journal archive collections?	false
collections	List of strings	One or more JSTOR collections that this item belongs to.	["Arts & Sciences V Collection","Corporate & For-Profit Collection"]
licensing_status	String	More details about the licensing status of the item when it is known.	open_access:CC BY-NC
url	String	Stable URL for the item on JSTOR.	www.jstor.org/stable/10.2307/25057135