Goodreads Datasets
The datasets were collected in late 2017 from goodreads.com, where we only scraped users' public shelves, i.e. everyone can see it on web without login. User IDs and review IDs are anonymized.
We collected these datasets for academic use only. Please do not redistribute them or use for commercial purposes.
If you are using our datasets, please cite the following papers:
- Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18. [bibtex]
- Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", in ACL'19. [bibtex]
If you have any questions or find any bugs regarding these datasets, feel free to contact Mengting Wan (m5wan@ucsd.edu).
Latest Updates
We've updated several files in May 2019. We really appreciate those who helped us to identify duplicates and bugs in the previous version!
- A github repo is created, which includes a few jupyter notebooks showing how to load the datasets and some basic data explorations.
- [May 2019] Review files are uploaded.
- [May 2019] Interaction files are updated: duplicates and mismatches are removed.
- [May 2019] Meta-data of books are updated: text descriptions are normalized; popular shelf names with negative counts are removed.
Overview
We collected three groups of datasets: (1) meta-data of the books, (2) user-book interactions (users' public shelves) and (3) users' detailed book reviews. These datasets can be merged together by matching book/user/review ids.
Basic Statistics of the Complete Book Graph:
- 2,360,655 books (1,521,962 works, 400,390 book series, 829,529 authors)
- 876,145 users; 228,648,342 user-book interactions in users' shelves (include 112,131,203 reads and 104,551,549 ratings)
876,145users; 229,154,523 user-book interactions in users' shelves (include 112,310,716 reads and 104,713,520 ratings)(We've updated the interaction files and removed duplicates in May 2019).
Note the complete interaction dataset is very large! We extracted several medium-size subsets by genre, and recommend using these subsets for experimentation first (see "By Genre" for details).
(Meta-Data of Books)
We collected detailed meta-data about 2.36M books. Please see "Books" page for dataset details and sample records.
Quick links:
- Complete book graph: goodreads_books.json.gz
- Author information: goodreads_book_authors.json.gz
- Work information: goodreads_book_works.json.gz
- Book series: goodreads_book_series.json.gz
- Fuzzy book genres: gooreads_book_genres_initial.json.gz
(User-Book Interactions)
We collected more than 229M user-book interactions. Please see "Shelves" page for dataset details and sample records.
Quick links (These files could be very large! Consider using genre-wise datasets if your resources are limited.):
- Complete *229m* interactions in 'csv' format (~4.1g): goodreads_interactions.csv
- User IDs: user_id_map.csv
- Book IDs: book_id_map.csv
- Contact Mengting Wan (m5wan@ucsd.edu) if you need a detailed version
(Book Review Texts)
We further re-scraped more than 15M records with detailed review text. Please see "Reviews" page for details and sample records.
Quick links:
- Complete 15.7m reviews (~5g): goodread_reviews_dedup.json.gz
- Review subset (~1.38m reviews) with parsed spoiler tags: goodreads_reviews_spoiler.json.gz
- Spoiler subset with original review text: goodreads_reviews_spoiler_raw.json.gz
(Operate the Datasets)
We created several jupyter notebooks to illustrate how to download/read these datasets, and provide some basic explorations of the data.
Quick links:
- README!
- Download datasets without GUI: download.ipynb
- Display sample records: samples.ipynb
- Calculate basic statistics: statistics.ipynb:
- Explore the interaction data: distributions.ipynb
- Explore the review data: reviews.ipynb
By Genre
- We notice different interaction densities in different subsets.
- Books can be overlapped across different genres (i.e., one book may belong to multiple genres).
- The (similar) book graph for each genre may not be self-contained. Those are just subsets of the nodes on the complete book graph (see the meta-data section).
- Detailed information about authors, works, book series etc. can be found in the meta-data section.
Children
Download Links:
- goodreads_books_children.json.gz (124,082 books)
- goodreads_interactions_children.json.gz (10,059,349 interactions)
- goodreads_reviews_children.json.gz (734,640 detailed reviews)
Comics & Graphic
Download Links:
- goodreads_books_comics_graphic.json.gz (89,411 books)
- goodreads_interactions_comics_graphic.json.gz (7,347,630 interactions)
- goodreads_reviews_comics_graphic.json.gz (542,338 detailed reviews)
Fantasy & Paranormal
Download Links:
- goodreads_books_fantasy_paranormal.json.gz (258,585 books)
- goodreads_interactions_fantasy_paranormal.json.gz (55,397,550 interactions)
- goodreads_reviews_fantasy_paranormal.json.gz (3,424,641 detailed reviews)
History & Biography
Download Links:
- goodreads_books_history_biography.json.gz (302,935 books)
- goodreads_interactions_history_biography.json.gz (31,479,229 interactions)
- goodreads_reviews_history_biography.json.gz (2,066,193 detailed reviews)
Mystery, Thriller & Crime
Download Links:
- goodreads_books_mystery_thriller_crime.json.gz (219,235 books)
- goodreads_interactions_mystery_thriller_crime.json.gz (24,799,896 interactions)
- goodreads_reviews_mystery_thriller_crime.json.gz (1,849,236 detailed reviews)
Poetry
Download Links:
- goodreads_books_poetry.json.gz (36,514 books)
- goodreads_interactions_poetry.json.gz (2,734,350 interactions)
- goodreads_reviews_poetry.json.gz (154,555 detailed reviews)
Romance
Download Links:
- goodreads_books_romance.json.gz (335,449 books)
- goodreads_interactions_romance.json.gz (42,792,856 interactions)
- goodreads_reviews_romance.json.gz (3,565,378 detailed reviews)
Young Adult
Download Links:
- goodreads_books_young_adult.json.gz (93,398 books)
- goodreads_interactions_young_adult.json.gz (34,919,254 interactions)
- goodreads_reviews_young_adult.json.gz (2,389,900 detailed reviews)