Title: | Tidy Analysis of Wikipedia |
---|---|
Description: | Access 'Wikipedia' through the several 'MediaWiki' APIs (<https://www.mediawiki.org/wiki/API>), as well as through the 'XTools' API (<https://www.mediawiki.org/wiki/XTools/API>). Ensure your API calls are correct, and receive results in tidy tibbles. |
Authors: | Michael Falk [aut, cre, cph] |
Maintainer: | Michael Falk <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.14.9000 |
Built: | 2025-02-13 21:21:08 UTC |
Source: | https://github.com/wikihistories/wikkitidy |
Any two revisions of a Wikipedia page can be compared using the 'diff' tool. The tool compares the 'from' revision to the 'to' revision, looking for insertions, deletions or relocations of text. This operation can be performed in any order, across any span of revisions.
get_diff(from, to, language = "en", simplify = TRUE)
get_diff(from, to, language = "en", simplify = TRUE)
from |
Vector of revision ids |
to |
Vector of revision ids |
language |
Vector of two-letter language codes (will be recycled if length==1) |
simplify |
logical: should R simplify the result (see return) |
The return value depends on the simplify
parameter.
If simplify
== TRUE: A list of tibble::tbl_df objects the same
length as from
and to
. Most of the response data is stripped away,
leaving just the textual differences between the revisions, their location,
type and 'highlightRanges' if the textual differences are complicated.
If simplify
== FALSE: A list the same length as from
and to
containing the full wikidiff2 response
for each pair of revisions. This response includes additional data for
displaying diffs onscreen.
# Compare revision 847170467 to 851733941 on English Wikipedia get_diff(847170467, 851733941) # The function is vectorised, so you can compare multiple pairs of revisions # in a single call # See diffs for the last two revisions of the Main Page revisions <- wiki_action_request() %>% query_by_title("Main Page") %>% query_page_properties( "revisions", rvlimit = 2, rvprop = "ids", rvdir = "older" ) %>% gracefully(next_result) if (tibble::is_tibble(revisions)) { revisions <- revisions %>% tidyr::unnest(cols = c(revisions)) %>% dplyr::mutate(diffs = get_diff(from = parentid, to = revid)) print(revisions) }
# Compare revision 847170467 to 851733941 on English Wikipedia get_diff(847170467, 851733941) # The function is vectorised, so you can compare multiple pairs of revisions # in a single call # See diffs for the last two revisions of the Main Page revisions <- wiki_action_request() %>% query_by_title("Main Page") %>% query_page_properties( "revisions", rvlimit = 2, rvprop = "ids", rvdir = "older" ) %>% gracefully(next_result) if (tibble::is_tibble(revisions)) { revisions <- revisions %>% tidyr::unnest(cols = c(revisions)) %>% dplyr::mutate(diffs = get_diff(from = parentid, to = revid)) print(revisions) }
Count how many times Wikipedia articles have been edited
get_history_count( title, type = c("edits", "anonymous", "bot", "editors", "minor", "reverted"), from = NULL, to = NULL, language = "en", failure_mode = c("error", "quiet") )
get_history_count( title, type = c("edits", "anonymous", "bot", "editors", "minor", "reverted"), from = NULL, to = NULL, language = "en", failure_mode = c("error", "quiet") )
title |
A vector of article titles |
type |
|
from |
Optional: a vector of revision ids |
to |
Optional: a vector of revision ids |
language |
Vector of two-letter language codes for Wikipedia editions |
failure_mode |
What to do if no data is found. See |
A tibble::tbl_df with two columns:
'count': integer, the number of edits of the given type
'limit': logical, whether the 'count' exceeds the API's limit. Each type of edit has a different limit. If the 'count' exceeds the limit, then the limit is returned as the count and 'limit' is set to TRUE
# Get the number of edits made by auto-confirmed editors to a page between # revisions 384955912 and 406217369 get_history_count( title="Jupiter", type="editors", from=384955912, to=406217369, failure_mode="quiet" ) # Compare which authors have the most edit activity authors <- tibble::tribble( ~author, "Jane Austen", "William Shakespeare", "Emily Dickinson" ) %>% dplyr::mutate(get_history_count(author, failure_mode="quiet")) authors
# Get the number of edits made by auto-confirmed editors to a page between # revisions 384955912 and 406217369 get_history_count( title="Jupiter", type="editors", from=384955912, to=406217369, failure_mode="quiet" ) # Compare which authors have the most edit activity authors <- tibble::tribble( ~author, "Jane Austen", "William Shakespeare", "Emily Dickinson" ) %>% dplyr::mutate(get_history_count(author, failure_mode="quiet")) authors
next_result()
sends exactly one request to the server.
next_batch()
requests results from the server until data is complete the
latest batch of pages in the result.
retrieve_all()
keeps requesting data until all the pages from the query
have been returned.
next_result(x) next_batch(x) retrieve_all(x)
next_result(x) next_batch(x) retrieve_all(x)
x |
The query. Either a wiki_action_request or a query_tbl. |
It is rare that a query can be fulfilled in a single request to the
server. There are two ways a query can be incomplete. All queries return a
list of pages as their result. The result may be incomplete because not all
the data for each page has been returned. In this case the batch is
incomplete. Or the data may be complete for all pages, but there are more
pages available on the server. In this case the query can be continued.
Thus the three functions for next_result()
, next_batch()
and
retrieve_all()
.
A query_tbl containing results of the query. If x
is a
query_tbl, then the function will return a new data with the new data
appended to it. If x
is a wiki_action_request, then the returned
query_tbl will contain the necessary data to supply future calls to
next_result()
, next_batch()
or retrieve_all()
.
# Try out a request using next_result(), then retrieve the rest of the # results. The clllimt limits the first request to 40 results. preview <- wiki_action_request() %>% query_by_title("Steve Wozniak") %>% query_page_properties("categories", cllimit = 40) %>% gracefully(next_result) preview all_results <- preview %>% gracefully(retrieve_all) all_results # tidyr is useful for list-columns. if (tibble::is_tibble(all_results)) { all_results %>% tidyr::unnest(cols=c(categories), names_sep = "_") }
# Try out a request using next_result(), then retrieve the rest of the # results. The clllimt limits the first request to 40 results. preview <- wiki_action_request() %>% query_by_title("Steve Wozniak") %>% query_page_properties("categories", cllimit = 40) %>% gracefully(next_result) preview all_results <- preview %>% gracefully(retrieve_all) all_results # tidyr is useful for list-columns. if (tibble::is_tibble(all_results)) { all_results %>% tidyr::unnest(cols=c(categories), names_sep = "_") }
This function is intended for developer use. It makes it easy to quickly generate vectorised calls to the different APIs.
get_rest_resource( ..., language = "en", api = c("core", "wikimedia", "wikimedia_org", "xtools"), response_format = c("json", "html"), response_type = NULL, failure_mode = c("error", "quiet") )
get_rest_resource( ..., language = "en", api = c("core", "wikimedia", "wikimedia_org", "xtools"), response_format = c("json", "html"), response_type = NULL, failure_mode = c("error", "quiet") )
... |
< |
language |
Character vector of two-letter language codes |
api |
The desired REST api: "core", "wikimedia", "wikimedia_org", or "xtools" |
response_format |
The expected Content-Type of the response. Currently "html" and "json" are supported. |
response_type |
The schema of the response. If supplied, the results will be parsed using the schema. |
failure_mode |
How to respond if a request fails "error", the default: raise an error "quiet", silently return NA, and include the http error code in the response |
The key invariant to maintain is the number of rows. Users ought to be able to use this function with dplyr::mutate, which requires the number of rows to be invariant.
A list of responses. If response_format
== "json", then the responses
will be simple R lists. If response_format
== "html", then the responses
will xml_document
objects. If response_type
is supplied, the response
will be coerced into a tibble::tbl_df or vector using the relevant schema.
If the response is a 'scalar list' (i.e. a list of length == 1), then it is
silently unlisted, returning a simple list or vector.
The main purpose of this function is to enable examples using live resources
in the documentation. Examples must not throw errors, according to CRAN
policy. If you wrap a requesting method in gracefully
, then any
errors of type httr2_http
will be caught and no error will be thrown.
gracefully(request_object, request_method)
gracefully(request_object, request_method)
request_object |
A |
request_method |
The desired function for performing the request, typically one of those in get_query_results |
The output of request_method
called on request_object
, if the
request was successful. Otherwise a httr2_response
object with details
of the failed request.
# This fails without throwing an error req <- httr2::request(httr2::example_url()) |> httr2::req_url_path("/status/404") resp <- gracefully(req, httr2::req_perform) print(resp) # This request succeeds req <- httr2::request(httr2::example_url()) resp <- gracefully(req, httr2::req_perform) print(resp)
# This fails without throwing an error req <- httr2::request(httr2::example_url()) |> httr2::req_url_path("/status/404") resp <- gracefully(req, httr2::req_perform) print(resp) # This request succeeds req <- httr2::request(httr2::example_url()) resp <- gracefully(req, httr2::req_perform) print(resp)
Construct a new query to a generator module of
the Action API. This low-level constructor only performs basic type-checking.
It is your responsibility to ensure that the chosen generator
is an
existing API endpoint, and that you have composed the query correctly. For
a more user-friendly interface, use query_generate_pages.
new_generator_query(.req, generator, ...)
new_generator_query(.req, generator, ...)
.req |
A |
generator |
The generator to add to the query. If the generator is based
on a property module, then
|
... |
< |
The output type depends on the input. If .req
is a
query/action_api/httr2_request
, then the output
will be a generator/query/action_api/httr2_request
. If .req
is a
prop/query/action_api/httr2_request
, then the return
object will be a subclass of the passed request, with "generator" as the
first term in the class vector, i.e.
generator/(titles|pageids|revids)/prop/query/action_api/httr2_request
.
# Build a generator query using a list module # List all members of Category:Physics on English Wikipedia physics <- wiki_action_request() %>% new_generator_query("categorymembers", gcmtitle = "Category:Physics") # Build a generator query on a property module # Generate the pages that are linked to Albert Einstein's page on English # Wikipedia einstein_categories <- wiki_action_request() %>% new_prop_query("titles", "Albert Einstein") %>% new_generator_query("iwlinks")
# Build a generator query using a list module # List all members of Category:Physics on English Wikipedia physics <- wiki_action_request() %>% new_generator_query("categorymembers", gcmtitle = "Category:Physics") # Build a generator query on a property module # Generate the pages that are linked to Albert Einstein's page on English # Wikipedia einstein_categories <- wiki_action_request() %>% new_prop_query("titles", "Albert Einstein") %>% new_generator_query("iwlinks")
This low-level constructor only performs basic type checking.
new_list_query(.req, list, ...) ## S3 method for class 'list' new_list_query(.req, list, ...) ## S3 method for class 'generator' new_list_query(.req, list, ...) ## S3 method for class 'prop' new_list_query(.req, list, ...) ## S3 method for class 'query' new_list_query(.req, list, ...)
new_list_query(.req, list, ...) ## S3 method for class 'list' new_list_query(.req, list, ...) ## S3 method for class 'generator' new_list_query(.req, list, ...) ## S3 method for class 'prop' new_list_query(.req, list, ...) ## S3 method for class 'query' new_list_query(.req, list, ...)
.req |
A |
list |
The list module to add to the query |
... |
< |
An object of type list/query/action_api/httr2_request
.
# Create a query to list all members of Category:Physics physics_query <- wiki_action_request() %>% new_list_query("categorymembers", cmtitle="Category:Physics")
# Create a query to list all members of Category:Physics physics_query <- wiki_action_request() %>% new_list_query("categorymembers", cmtitle="Category:Physics")
The intended use for this query is to set the 'titles', 'pageids' or 'revids'
parameter, and enforce that only one of these is set. All property modules API in the Action API require
this parameter to be set, or they require a
generator
parameter to be set instead. The
prop/query
type is an abstract type representing the three possible kinds
of property query that do not rely on a generator (see below on the return
value). A complication is that a prop/query
can itself be used as the
basis for a generator.
new_prop_query(.req, by, pages, ...)
new_prop_query(.req, by, pages, ...)
.req |
A |
by |
The type of page. Allowed values are: pageids, titles, revids |
pages |
A string, the pages to query by, corresponding to the 'by' parameter. Multiple values should be separated with "|" |
... |
< |
A properly qualified prop/query
object. There are six
possibilities:
titles/prop/query
pageids/prop/query
revids/prop/query
generator/titles/prop/query
generator/pageids/prop/query
generator/revids/prop/query
# Build a query on a set of pageids # 963273 and 1159171 are Kate Bush albums bush_albums_query <- wiki_action_request() %>% new_prop_query("pageids", "963273|1159171")
# Build a query on a set of pageids # 963273 and 1159171 are Kate Bush albums bush_albums_query <- wiki_action_request() %>% new_prop_query("pageids", "963273|1159171")
get_latest_revision()
returns metadata about the latest
revision of each
page.
get_page_html()
returns the rendered html for each
page.
get_page_summary()
returns metadata about the latest revision, along
with the page description and a summary extracted from the opening
paragraph
get_page_related()
returns summaries for 20 related pages for each
passed page
get_page_talk()
returns structured talk page content for each
title. You must ensure to use the title for the Talk page itself, e.g.
"Talk:Earth" rather than "Earth"
get_page_langlinks()
returns interwiki links for each
title
get_latest_revision(title, language = "en", failure_mode = "error") get_page_html(title, language = "en", failure_mode = "error") get_page_summary(title, language = "en", failure_mode = "error") get_page_talk(title, language = "en", failure_mode = "error") get_page_langlinks(title, language = "en", failure_mode = "error")
get_latest_revision(title, language = "en", failure_mode = "error") get_page_html(title, language = "en", failure_mode = "error") get_page_summary(title, language = "en", failure_mode = "error") get_page_talk(title, language = "en", failure_mode = "error") get_page_langlinks(title, language = "en", failure_mode = "error")
title |
A character vector of page titles. |
language |
A character vector of two-letter language codes, either of
length 1 or the same length as |
failure_mode |
Either "quiet" or "error." See |
A list, vector or tibble, the same length as title
, with the
desired data.
# Get language links for a known page on English Wikipedia get_page_langlinks("Charles Harpur", failure_mode = "quiet") # The functions are vectorised over title and language # Find all articles about Joanna Baillie, and retrieve summary data for # the first two. baillie <- get_page_langlinks("Joanna Baillie") %>% dplyr::slice(1:2) %>% dplyr::mutate(get_page_summary(title = title, language = code, failure_mode = "quiet")) baillie
# Get language links for a known page on English Wikipedia get_page_langlinks("Charles Harpur", failure_mode = "quiet") # The functions are vectorised over title and language # Find all articles about Joanna Baillie, and retrieve summary data for # the first two. baillie <- get_page_langlinks("Joanna Baillie") %>% dplyr::slice(1:2) %>% dplyr::mutate(get_page_summary(title = title, language = code, failure_mode = "quiet")) baillie
These functions help you to build a query for the MediaWiki Action API if you already have a set of pages that you wish to investigate. These functions can be combined with query_page_properties to choose which properties to return for the passed pages.
query_by_title(.req, title) query_by_pageid(.req, pageid) query_by_revid(.req, revid)
query_by_title(.req, title) query_by_pageid(.req, pageid) query_by_revid(.req, revid)
.req |
A wiki_action_request query to modify |
title |
A character vector of page titles |
pageid |
A character or numeric vector of page ids |
revid |
A character or numeric vector of revision ids |
If you don't already know which pages you wish to examine, you can build a query to find pages that meet certain criteria using query_list_pages or query_generate_pages.
A request object of type pages/query/action_api/httr2_request
. To
perform the query, pass the object to next_batch or retrieve_all
# Retrieve the categories for Charles Harpur's Wikipedia page resp <- wiki_action_request() %>% query_by_title("Charles Harpur") %>% query_page_properties("categories") %>% gracefully(next_batch)
# Retrieve the categories for Charles Harpur's Wikipedia page resp <- wiki_action_request() %>% query_by_title("Charles Harpur") %>% query_page_properties("categories") %>% gracefully(next_batch)
These functions provide access to the CategoryMembers endpoint of the Action API.
query_category_members()
builds a generator query to return the members of a given category.
build_category_tree()
finds all the pages and subcategories beneath the
passed category, then recursively finds all the pages and subcategories
beneath them, until it can find no more subcategories.
query_category_members( .req, category, namespace = NULL, type = c("file", "page", "subcat"), limit = 10, sort = c("sortkey", "timestamp"), dir = c("ascending", "descending", "newer", "older"), start = NULL, end = NULL, language = "en" ) build_category_tree(category, language = "en")
query_category_members( .req, category, namespace = NULL, type = c("file", "page", "subcat"), limit = 10, sort = c("sortkey", "timestamp"), dir = c("ascending", "descending", "newer", "older"), start = NULL, end = NULL, language = "en" ) build_category_tree(category, language = "en")
.req |
|
category |
The category to start from. |
namespace |
Only return category members from the provided namespace |
type |
Alternative to |
limit |
The number to return each batch. Max 500. |
sort |
How to sort the returned category members. 'timestamp' sorts them by the date they were included in the category; 'sortkey' by the category member's unique hexadecimal code |
dir |
The direction in which to sort them |
start |
If |
end |
If |
language |
The language edition of Wikipedia to query |
query_category_members()
: A request object of type
generator/query/action_api/httr2_request
, which can be passed to
next_batch()
or retrieve_all()
. You can specify which properties to
retrieve for each page using query_page_properties()
.
build_category_tree()
: A list containing two dataframes. nodes
lists
all the subcategories and pages found underneath the passed categories.
edges
records the connections between them. The source
column gives the
pageid of the parent category, while the target
column gives the pageid
of any categories, pages or files contained within the source
category.
The timestamp
records the moment when the target
page or subcategory
was included in the source
category. The two dataframes in the list can
be passed to igraph::graph_from_data_frame for network analysis.
# Get the first 10 pages in 'Category:Physics' on English Wikipedia physics_members <- wiki_action_request() %>% query_category_members("Physics") %>% gracefully(next_batch) physics_members # Build the tree of all albums for the Melbourne band Custard tree <- build_category_tree("Category:Custard_(band)_albums") tree # For network analysis and visualisation, you can pass the category tree # to igraph tree_graph <- igraph::graph_from_data_frame(tree$edges, vertices = tree$nodes) tree_graph
# Get the first 10 pages in 'Category:Physics' on English Wikipedia physics_members <- wiki_action_request() %>% query_category_members("Physics") %>% gracefully(next_batch) physics_members # Build the tree of all albums for the Melbourne band Custard tree <- build_category_tree("Category:Custard_(band)_albums") tree # For network analysis and visualisation, you can pass the category tree # to igraph tree_graph <- igraph::graph_from_data_frame(tree$edges, vertices = tree$nodes) tree_graph
Many of the endpoints on the Action API can be used as generators
. Use
list_all_generators()
to see a complete list. The main advantage of using a
generator is that you can chain it with calls to query_page_properties()
to
find out specific information about the pages. This is not possible for
queries constructed using query_list_pages()
.
query_generate_pages(.req, generator, ...) list_all_generators()
query_generate_pages(.req, generator, ...) list_all_generators()
.req |
A httr2_request, e.g. generated by |
generator |
The generator module you wish to use. Most list and property modules can be used, though not all. |
... |
< |
There are two kinds of generator
: list-generators and prop-generators. If
using a prop-generator, then you need to use a query_by_()
function to tell
the API where to start from, as shown in the examples.
To set additional parameters to a generator, prepend the parameter with "g".
For instance, to set a limit of 10 to the number of pages returned by the
categorymembers
generator, set the parameter gcmlimit = 10
.
query_generate_pages: The modified request, which can be passed to next_batch or retrieve_all as appropriate.
list_all_generators: a tibble of all the available generator
modules. The name
column gives the name of the generator, while the
group
column indicates whether the generator is based on a list module
or a property module. Generators based on property modules can only be
added to a query if you have already used query_by_ to specify which
pages' properties should be generated.
# Search for articles about seagulls seagulls <- wiki_action_request() %>% query_generate_pages("search", gsrsearch = "seagull") %>% gracefully(next_batch) seagulls
# Search for articles about seagulls seagulls <- wiki_action_request() %>% query_generate_pages("search", gsrsearch = "seagull") %>% gracefully(next_batch) seagulls
See API:Lists for available
list actions. Each list action returns a list of pages, typically including
their pageid, namespace
and title. Individual lists have particular properties that can be requested,
which are usually prefaced with a two-word code based on the name of the
list (e.g. specific properties for the categorymembers
list action are
prefixed with cm
).
query_list_pages(.req, list, ...) list_all_list_modules()
query_list_pages(.req, list, ...) list_all_list_modules()
.req |
A httr2_request, e.g. generated by |
list |
The type of list to return |
... |
< |
When the request is performed, the data is returned in the body of the
request under the query
object, labeled by the chosen list action.
If you want to study the actual pages listed, it is advisable to retrieve the pages directly using a generator, rather than listing their IDs using a list action. When using a list action, a second request is required to get further information about each page. Using a generator, you can query pages and retrieve their relevant properties in a single API call.
An HTTP response: an S3 list with class httr2_request
# Get the ten most recently added pages in Category:Physics physics_pages <- wiki_action_request() %>% query_list_pages("categorymembers", cmsort = "timestamp", cmdir = "desc", cmtitle = "Category:Physics" ) %>% gracefully(next_batch) physics_pages
# Get the ten most recently added pages in Category:Physics physics_pages <- wiki_action_request() %>% query_list_pages("categorymembers", cmsort = "timestamp", cmdir = "desc", cmtitle = "Category:Physics" ) %>% gracefully(next_batch) physics_pages
See API:Properties for a list of available properties. Many have additional parameters to control their behavior, which can be passed to this function as named arguments.
query_page_properties(.req, property, ...) list_all_property_modules()
query_page_properties(.req, property, ...) list_all_property_modules()
.req |
A httr2_request, e.g. generated by |
property |
The property to request |
... |
< |
query_page_properties is not useful on its own. It must be combined with a
query_by_ function or query_generate_pages to specify which pages
properties are to be returned. It should be noted that many of the
API:Properties modules can
themselves be used as generators. If you wish to use a property module in
this way, then you must use query_generate_pages, passing the name of the
property module as the genenerator
.
An HTTP response: an S3 list with class httr2_request
# Search for articles about seagulls and retrieve their number of # watchers resp <- wiki_action_request() %>% query_generate_pages("search", gsrsearch = "seagull") %>% query_page_properties("info", inprop = "watchers") %>% gracefully(next_batch) %>% dplyr::select(pageid, ns, title, watchers) resp
# Search for articles about seagulls and retrieve their number of # watchers resp <- wiki_action_request() %>% query_generate_pages("search", gsrsearch = "seagull") %>% query_page_properties("info", inprop = "watchers") %>% gracefully(next_batch) %>% dplyr::select(pageid, ns, title, watchers) resp
Representation of Wikipedia data returned from an Action API Query module as tibble, with request metadata stored as attributes.
query_tbl(x, request, continue, batchcomplete)
query_tbl(x, request, continue, batchcomplete)
x |
A tibble |
request |
The httr2_request object used to generate the tibble |
continue |
The continue parameter returned by the API |
batchcomplete |
The batchcomplete parameter returned by the API |
A tibble: an S3 data.frame with class query_tbl
.
Wikipedia exposes a To build up a query, you first call
wiki_action_request()
to create the basic request object, then use the
helper functions query_page_properties()
, query_list_pages()
and
query_generate_pages()
to modify the request, before calling next_batch()
or retrieve_all()
to perform the query and download results from the
server.
wiki_action_request(..., action = "query", language = "en")
wiki_action_request(..., action = "query", language = "en")
... |
< |
action |
The action to perform, typically 'query' |
language |
The language edition of Wikipedia to request, e.g. 'en' or 'fr' |
wikkitidy provides an ergonomic API for the Action API's Query modules. These modules are most
useful for researchers, because they allow you to explore the structure of
Wikipedia and its back pages. You can obtain a list of available modules in
your R console using list_all_property_modules()
, list_all_list_modules()
and list_all_generators()
,
An action_api
object, an S3 list that subclasses httr2::request.
The dependencies between different aspects of the Action API are complex.
At the time of writing, there are five major subclasses of
action_api/httr2_request
:
generator/action_api/httr2_request
, returned (sometimes) by query_generate_pages
list/action_api/httr2_request
, returned by query_list_pages
titles
, pageids
and revids/action_api/httr2_request
, returned by the various query_by_ functions
You can use query_page_properties to modify any kind of query except
for list
queries: indeed, the central limitation of the list
queries is
that you cannot choose what properties to return for the pages the meet the
given criterion. The concept of a generator
is complex. If the
generator
is based on a
property module, then it
must be combined with a query_by_ function to produce a valid query. If
the generator is based on a list module, then it cannot be
combined with a query_by_ query.
# List the first 10 pages in the category 'Australian historians' historians <- wiki_action_request() %>% query_list_pages( "categorymembers", cmtitle = "Category:Australian_historians", cmlimit = 10 ) %>% gracefully(next_batch) historians
# List the first 10 pages in the category 'Australian historians' historians <- wiki_action_request() %>% query_list_pages( "categorymembers", cmtitle = "Category:Australian_historians", cmlimit = 10 ) %>% gracefully(next_batch) historians
wikimedia_org_rest_request()
builds a request for the
wikimedia.org REST API, which
provides statistical data about Wikimedia Foundation projects
xtools_rest_request()
builds a request to the XTools API, which provides additional
statistical data about Wikimedia foundation projects
wikimedia_org_rest_request(endpoint, ..., language = "en") xtools_rest_request(endpoint, ..., language = "en")
wikimedia_org_rest_request(endpoint, ..., language = "en") xtools_rest_request(endpoint, ..., language = "en")
endpoint |
The endpoint for the specific kind of request; for wikimedia apis, this comprises the path components in between the general API endpoint and the component specifying the project to query |
... |
< |
language |
Two-letter language code for the desired Wikipedia edition. |
A wikimedia_org/rest
or xtools/rest
object, an S3 vector that
subclasses httr2::request.
# Build request for articleinfo about Kate Bush's page on English Wikipedia request <- xtools_rest_request("page/articleinfo", "Kate_Bush") # Build request for most-viewed pages on German Wikipedia in July 2020 request <- wikimedia_org_rest_request( "metrics/pageviews/top", "all-access", "2020", "07", "all-days", language = "de" )
# Build request for articleinfo about Kate Bush's page on English Wikipedia request <- xtools_rest_request("page/articleinfo", "Kate_Bush") # Build request for most-viewed pages on German Wikipedia in July 2020 request <- wikimedia_org_rest_request( "metrics/pageviews/top", "all-access", "2020", "07", "all-days", language = "de" )
core_request_request()
builds a request for the MediaWiki Core REST API, the basic REST
API available on all MediaWiki wikis.
wikimedia_rest_request()
builds a request for the Wikimedia REST API, an additional
api just for Wikipedia and other wikis managed by the Wikimedia
Foundation
core_rest_request(..., language = "en") wikimedia_rest_request(..., language = "en")
core_rest_request(..., language = "en") wikimedia_rest_request(..., language = "en")
... |
< |
language |
The two-letter language code for the Wikipedia edition |
A core/rest
, wikimedia/rest
, object, an S3 vector that subclasses
httr2_request
(see httr2::request). The request needs to be passed to
httr2::req_perform to retrieve data from the API.
# Get the html of the 'Earth' article on English Wikipedia response <- core_rest_request("page", "Earth", "html") %>% httr2::req_perform() response <- wikimedia_rest_request("page", "html", "Earth") %>% httr2::req_perform() # Some REST requests take query parameters. Pass these as named arguments. # To search German Wikipedia for articles about Goethe response <- core_rest_request("search/page", q = "Goethe", limit = 2, language = "de") %>% httr2::req_perform() %>% httr2::resp_body_json()
# Get the html of the 'Earth' article on English Wikipedia response <- core_rest_request("page", "Earth", "html") %>% httr2::req_perform() response <- wikimedia_rest_request("page", "html", "Earth") %>% httr2::req_perform() # Some REST requests take query parameters. Pass these as named arguments. # To search German Wikipedia for articles about Goethe response <- core_rest_request("search/page", q = "Goethe", limit = 2, language = "de") %>% httr2::req_perform() %>% httr2::resp_body_json()
wikkitidy comes bundled with a number of sample files in its inst/extdata
directory. This function make them easy to access
wikkitidy_example(file = NULL)
wikkitidy_example(file = NULL)
file |
Name of file. If |
A character vector, containing either the path of the chosen file, or the nicknames of all available example files.
wikkitidy_example() wikkitidy_example("fatwiki_dump")
wikkitidy_example() wikkitidy_example("fatwiki_dump")
get_xtools_page_info()
returns basic statistics
about articles' history and quality, including their total edits, creation
date, and assessment value (good, featured etc.)
get_xtools_page_prose()
returns statistics about the word counts and referencing of
articles
get_xtools_page_links()
returns the number of ingoing and outgoing links to articles, including redirects
get_xtools_page_top_editors()
returns the list of top editors for articles, with
optional filters by date range and non-bot status
get_xtools_page_assessment()
returns more detailed statistics about articles' assessment status and Wikiproject importance levels
get_xtools_page_info(title, language = "en", failure_mode = "error") get_xtools_page_prose( title, language = "en", failure_mode = c("error", "quiet") ) get_xtools_page_links( title, language = "en", failure_mode = c("error", "quiet") ) get_xtools_page_top_editors( title, start = NULL, end = NULL, limit = 1000, nobots = FALSE, language = "en", failure_mode = c("error", "quiet") ) get_xtools_page_assessment( title, classonly = FALSE, language = "en", failure_mode = c("error", "quiet") )
get_xtools_page_info(title, language = "en", failure_mode = "error") get_xtools_page_prose( title, language = "en", failure_mode = c("error", "quiet") ) get_xtools_page_links( title, language = "en", failure_mode = c("error", "quiet") ) get_xtools_page_top_editors( title, start = NULL, end = NULL, limit = 1000, nobots = FALSE, language = "en", failure_mode = c("error", "quiet") ) get_xtools_page_assessment( title, classonly = FALSE, language = "en", failure_mode = c("error", "quiet") )
title |
Character vector of page titles |
language |
Language code for the version of Wikipedia to query |
failure_mode |
What to do if no data is found. See |
start |
A character vector or date object (optional): the start date for calculating top editors |
end |
A character vector or date object (optional): the end date for calculating top editors |
limit |
An integer: the maximum number of top editors to return |
nobots |
TRUE or FALSE: if TRUE, bots are excluded from the top editor calculation |
classonly |
TRUE or FALSE: if TRUE, only return the article's assessment status, without Wikiproject information |
A list or tbl of results, the same length as title
. NB: The
results for get_xtools_page_assessment
are still not parsed properly.
# Get basic statistics about Erich Auerbach on German Wikipedia auerbach <- get_xtools_page_info("Erich Auerbach", language = "de", failure_mode = "quiet") auerbach
# Get basic statistics about Erich Auerbach on German Wikipedia auerbach <- get_xtools_page_info("Erich Auerbach", language = "de", failure_mode = "quiet") auerbach