MediaWiki::DumpFile::Compat::Revisions(3pm) | User Contributed Perl Documentation | MediaWiki::DumpFile::Compat::Revisions(3pm) |
NAME¶
Parse::MediaWikiDump::Revisions - Object capable of processing dump files with multiple revisions per article
ABOUT¶
This object is used to access the metadata associated with a MediaWiki instance and provide an iterative interface for extracting the individual article revisions out of the same. To guarantee that there is only a single revision per article use the Parse::MediaWikiDump::Pages object.
SYNOPSIS¶
use MediaWiki::DumpFile::Compat; $pmwd = Parse::MediaWikiDump->new; $revisions = $pmwd->revisions('pages-articles.xml'); $revisions = $pmwd->revisions(\*FILEHANDLE); #print the title and id of each article inside the dump file while(defined($page = $revisions->next)) { print "title '", $page->title, "' id ", $page->id, "\n"; }
METHODS¶
- $revisions->new
- Open the specified MediaWiki dump file. If the single argument to this method is a string it will be used as the path to the file to open. If the argument is a reference to a filehandle the contents will be read from the filehandle as specified.
- $revisions->next
- Returns an instance of the next available Parse::MediaWikiDump::page object or returns undef if there are no more articles left.
- $revisions->version
- Returns a plain text string of the dump file format revision number
- $revisions->sitename
- Returns a plain text string that is the name of the MediaWiki instance.
- $revisions->base
- Returns the URL to the instances main article in the form of a string.
- $revisions->generator
- Returns a string containing 'MediaWiki' and a version number of the instance that dumped this file. Example: 'MediaWiki 1.14alpha'
- $revisions->case
- Returns a string describing the case sensitivity configured in the instance.
- $revisions->namespaces
- Returns a reference to an array of references. Each reference is to another array with the first item being the unique identifier of the namespace and the second element containing a string that is the name of the namespace.
- $revisions->namespaces_names
- Returns an array reference the array contains strings of all the namespaces each as an element.
- $revisions->current_byte
- Returns the number of bytes that has been processed so far
- $revisions->size
- Returns the total size of the dump file in bytes.
EXAMPLE¶
Extract the article text of each revision of an article using a given title¶
#!/usr/bin/perl use strict; use warnings; use MediaWiki::DumpFile::Compat; my $file = shift(@ARGV) or die "must specify a MediaWiki dump of the current pages"; my $title = shift(@ARGV) or die "must specify an article title"; my $pmwd = Parse::MediaWikiDump->new; my $dump = $pmwd->revisions($file); my $found = 0; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); #this is the only currently known value but there could be more in the future if ($dump->case ne 'first-letter') { die "unable to handle any case setting besides 'first-letter'"; } $title = case_fixer($title); while(my $revision = $dump->next) { if ($revision->title eq $title) { print STDERR "Located text for $title revision ", $revision->revision_id, "\n"; my $text = $revision->text; print $$text; $found = 1; } } print STDERR "Unable to find article text for $title\n" unless $found; exit 1; #removes any case sensativity from the very first letter of the title #but not from the optional namespace name sub case_fixer { my $title = shift; #check for namespace if ($title =~ /^(.+?):(.+)/) { $title = $1 . ':' . ucfirst($2); } else { $title = ucfirst($title); } return $title; }
LIMITATIONS¶
Version 0.4¶
This class was updated to support version 0.4 dump files from a MediaWiki instance but it does not currently support any of the new information available in those files.
2022-06-15 | perl v5.34.0 |