xml - Load a table from wikipedia into R -
i'm trying load table of supreme court justices r following url. https://en.wikipedia.org/wiki/list_of_justices_of_the_supreme_court_of_the_united_states
i'm using following code:
scotusurl <- "https://en.wikipedia.org/wiki/list_of_justices_of_the_supreme_court_of_the_united_states" scotusdata <- geturl(scotusurl, ssl.verifypeer = false) scotusdoc <- htmlparse(scotusdata) scotusdata <- scotusdoc['//table[@class="wikitable"]'] scotustable <- readhtmltable(scotusdata[[1]], stringsasfactors = false)
r returns scotustable null. goal here data.frame in r can use make ggplot of scotus justice tenure on court. had script working make awesome plot, after recent decisions changed on page , script not function. went through html on wikipedia try find changes, i'm not webdev break script isn't apparent.
additionally, there method in r allow me cache data page i'm not referencing url? seem ideal way avoid issue in future. appreciate help.
as aside, scotus in on-going hobby/side-project of mine if there's other data source out there that's better wikipedia, i'm ears.
edit: sorry should have listed dependencies. i'm using xml, plyr, rcurl, data.table, , ggplot2 libraries.
if don't mind using different package, can try "rvest" package.
library(rvest) scotusurl <- "https://en.wikipedia.org/wiki/list_of_justices_of_the_supreme_court_of_the_united_states"
option 1: grab tables page , use
html_table
function extract tables you're interested in.temp <- scotusurl %>% html %>% html_nodes("table") html_table(temp[1]) ## "legend" table html_table(temp[2]) ## table you're interested in
option 2: inspect table element , copy xpath read table directly (right-click, inspect element, scroll relevant "table" tag, right click on that, , select "copy xpath").
scotusurl %>% html %>% html_nodes(xpath = '//*[@id="mw-content-text"]/table[2]') %>% html_table
another option loading data in google spreadsheet , reading using "googlesheets" package.
in google drive, create new spreadsheet named, instance "supreme court". in first worksheet, enter:
=importhtml("https://en.wikipedia.org/wiki/list_of_justices_of_the_supreme_court_of_the_united_states", "table", 2)
this automatically scrape table google spreadsheet.
from there, in r can do:
library(googlesheets) sc <- gs_title("supreme court") gs_read(sc)
Comments
Post a Comment