.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.42) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" ======================================================================== .\" .IX Title "MediaWiki::DumpFile::Pages 3pm" .TH MediaWiki::DumpFile::Pages 3pm "2022-06-15" "perl v5.34.0" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" MediaWiki::DumpFile::Pages \- Process an XML dump file of pages from a MediaWiki instance .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 1 \& use MediaWiki::DumpFile::Pages; \& \& #dump files up to version 0.5 are tested \& $input = \*(Aqfile\-name.xml\*(Aq; \& #many supported compression formats \& $input = \*(Aqfile\-name.xml.bz2\*(Aq; \& $input = \*(Aqfile\-name.xml.gz\*(Aq; \& $input = \e*FH; \& \& $pages = MediaWiki::DumpFile::Pages\->new($input); \& \& #default values \& %opts = ( \& input => $input, \& fast_mode => 0, \& version_ignore => 1 \& ); \& \& #override configuration options passed to constructor \& $ENV{MEDIAWIKI_DUMPFILE_VERSION_IGNORE} = 0; \& $ENV{MEDIAWIKI_DUMPFILE_FAST_MODE} = 1; \& \& $pages = MediaWiki::DumpFile::Pages\->new(%opts); \& $version = $pages\->version; \& \& #version 0.3 and later dump files only \& $sitename = $pages\->sitename; \& $base = $pages\->base; \& $generator = $pages\->generator; \& $case = $pages\->case; \& %namespaces = $pages\->namespaces; \& \& #all versions \& while(defined($page = $pages\->next) { \& print \*(AqTitle: \*(Aq, $page\->title, "\en"; \& } \& \& $title = $page\->title; \& $id = $page\->id; \& $revision = $page\->revision; \& @revisions = $page\->revision; \& \& $text = $revision\->text; \& $id = $revision\->id; \& $timestamp = $revision\->timestamp; \& $comment = $revision\->comment; \& $contributor = $revision\->contributor; \& #version 0.4 and later dump files only \& $bool = $revision\->redirect; \& \& $username = $contributor\->username; \& $id = $contributor\->id; \& $ip = $contributor\->ip; \& $username_or_ip = $contributor\->astext; \& $username_or_ip = "$contributor"; .Ve .SH "METHODS" .IX Header "METHODS" .SS "new" .IX Subsection "new" This is the constructor for this package. If it is called with a single parameter it must be the input to use for parsing. The input is specified as either the location of a MediaWiki pages dump file or a reference to an already open file handle. .PP If more than one argument is passed to new it must be a hash of options. The keys are named .IP "input" 4 .IX Item "input" This is the input to parse as documented earlier. .IP "fast_mode" 4 .IX Item "fast_mode" Have the iterator run in fast mode by default; defaults to false. See the section on fast mode below. .IP "version_ignore" 4 .IX Item "version_ignore" Do not enforce parsing of only tested schemas in the \s-1XML\s0 document; defaults to true .SS "version" .IX Subsection "version" Returns the version of the dump file. .SS "sitename" .IX Subsection "sitename" Returns the sitename from the MediaWiki instance. Requires a dump file of at least version 0.3. .SS "base" .IX Subsection "base" Returns the \s-1URL\s0 used to access the MediaWiki instance. Requires a dump file of at least version 0.3. .SS "generator" .IX Subsection "generator" Returns the version of MediaWiki that generated the dump file. Requires a dump file of at least version 0.3. .SS "case" .IX Subsection "case" Returns the case sensitivity configuration of the MediaWiki instance. Requires a dump file of at least version 0.3. .SS "namespaces" .IX Subsection "namespaces" Returns a hash where the key is the numerical namespace id and the value is the plain text namespace name. The main namespace has an id of 0 and an empty string value. Requires a dump file of at least version 0.3. .SS "next" .IX Subsection "next" Accepts an optional boolean argument to control fast mode. If the argument is specified it forces fast mode on or off. Otherwise the mode is controlled by the fast_mode configuration option. See the section below on fast mode for more information. .PP It is safe to intermix calls between fast and normal mode in one parsing session. .PP In all modes undef is returned if there is no more data to parse. .PP In normal mode an instance of MediaWiki::DumpFile::Pages::Page is returned and the full \s-1API\s0 is available. .PP In fast mode an instance of MediaWiki::DumpFile::Pages::FastPage is returned; the only methods supported are title, text, and revision. This class can act as a stand-in for MediaWiki::DumpFile::Pages::Page except it will throw an error if any attempt is made to access any other part of the \s-1API.\s0 .SS "size" .IX Subsection "size" Returns the size of the input file in bytes or if the input specified is a reference to a file handle it returns undef. .SS "current_byte" .IX Subsection "current_byte" Returns the number of bytes of \s-1XML\s0 that have been successfully parsed. .SH "FAST MODE" .IX Header "FAST MODE" Fast mode is a way to get increased parsing performance while dropping some of the features available in the parser. If you only require the titles and text from a page then fast mode will decrease the amount of time required just to parse the \s-1XML\s0 file; some times drastically. .PP When fast mode is used on a dump file that has more than one revision of a single article in it only the text of the first article in the dump file will be returned; the other revisions of the article will be silently skipped over. .SH "MediaWiki::DumpFile::Pages::Page" .IX Header "MediaWiki::DumpFile::Pages::Page" This object represents a distinct Mediawiki page and is used to access the page data and metadata. The following methods are available: .IP "title" 4 .IX Item "title" Returns a string of the page title .IP "id" 4 .IX Item "id" Returns a numerical page identification .IP "revision" 4 .IX Item "revision" In scalar context returns the last revision in the dump for this page; in array context returns a list of all revisions made available for the page in the same order as the dump file. All returned data is an instance of MediaWiki::DumpFile::Pages::Revision .SH "MediaWiki::DumpFile::Pages::Page::Revision" .IX Header "MediaWiki::DumpFile::Pages::Page::Revision" This object represents a distinct revision of a page from the Mediawiki dump file. The standard dump files contain only the most specific revision of each page and the comprehensive dump files contain all revisions for each page. The following methods are available: .IP "text" 4 .IX Item "text" Returns the page text for this specific revision of the page. .IP "id" 4 .IX Item "id" Returns the numerical revision id for this specific revision \- this is independent of the page id. .IP "timestamp" 4 .IX Item "timestamp" Returns a string value representing the time the revision was created. The string is in the format of \&\*(L"2008\-07\-09T18:41:10Z\*(R". .IP "comment" 4 .IX Item "comment" Returns the comment made about the revision when it was created. .IP "contributor" 4 .IX Item "contributor" Returns an instance of MediaWiki::DumpFile::Pages::Page::Revision::Contributor .IP "minor" 4 .IX Item "minor" Returns true if the edit was marked as being minor or false otherwise .IP "redirect" 4 .IX Item "redirect" Returns true if the page is a redirect to another page or false otherwise. Requires a dump file of at least version 0.4. .SH "MediaWiki::DumpFile::Pages::Page::Revision::Contributor" .IX Header "MediaWiki::DumpFile::Pages::Page::Revision::Contributor" This object provides access to the contributor of a specific revision of a page. When used in a scalar context it will return the username of the editor if the editor was logged in or the \s-1IP\s0 address of the editor if the edit was anonymous. .IP "username" 4 .IX Item "username" Returns the username of the editor if the editor was logged in when the edit was made or undef otherwise. .IP "id" 4 .IX Item "id" Returns the numerical id of the editor if the editor was logged in or undef otherwise. .IP "ip" 4 .IX Item "ip" Returns the \s-1IP\s0 address of the editor if the editor was anonymous or undef otherwise. .IP "astext" 4 .IX Item "astext" Returns the username of the editor if they were logged in or the \s-1IP\s0 address if the editor was anonymous. .SH "ERRORS" .IX Header "ERRORS" .SS "E_XML_CREATE_FAILED Error creating \s-1XML\s0 parser object" .IX Subsection "E_XML_CREATE_FAILED Error creating XML parser object" While trying to build the XML::TreePuller object a fatal error occurred; the error message from the parser was included in the generated error output you saw. At the time of writing this document the error messages are not very helpful but for some reason the \&\s-1XML\s0 parser rejected the document; here's a list of things to check: .IP "Make sure the file exists and is readable" 4 .IX Item "Make sure the file exists and is readable" .PD 0 .IP "Make sure the file is actually an \s-1XML\s0 file and is not compressed" 4 .IX Item "Make sure the file is actually an XML file and is not compressed" .PD .SS "E_XML_PARSE_FAILED \s-1XML\s0 parser failed during parsing" .IX Subsection "E_XML_PARSE_FAILED XML parser failed during parsing" Something went wrong with the \s-1XML\s0 parser \- the error from the parser was included in the generated error message. This happens when there is a severe error parsing the document such as a syntax error. .SS "E_UNTESTED_DUMP_VERSION Untested dump file versions" .IX Subsection "E_UNTESTED_DUMP_VERSION Untested dump file versions" The dump files created by Mediawiki include a versioned \s-1XML\s0 schema. This software is tested with the most recent known schema versions and can be configured to enforce a specific tested schema. MediaWiki::DumpFile::Pages no longer enforces the versions by default but the software author using this library has indicated that it should. When this happens it dies with an error like the following: .PP E_UNTESTED_DUMP_VERSION Version 0.4 dump file \*(L"t/simpleenglish\-wikipedia.xml\*(R" has not been tested with MediaWiki::DumpFile::Pages version 0.1.9; see the \s-1ERRORS\s0 section of the MediaWiki::DumpFile::Pages Perl module documentation for what to do at lib/MediaWiki/DumpFile/Pages.pm line 148. .PP If you encounter this condition you can do the following: .IP "Check your module version" 4 .IX Item "Check your module version" The error message should have the version number of this module in it. Check \s-1CPAN\s0 and see if there is a newer version with official support. The web page .Sp .Vb 1 \& http://search.cpan.org/dist/MediaWiki\-DumpFile/lib/MediaWiki/DumpFile/Pages.pm .Ve .Sp will show the highest supported version dump files near the top of the \s-1SYNOPSIS.\s0 .IP "Check the bug database" 4 .IX Item "Check the bug database" It is possible the issue has been resolved already but the update has not made it onto \s-1CPAN\s0 yet. See this web page .Sp .Vb 1 \& http://rt.cpan.org/Public/Dist/Display.html?Name=mediawiki\-dumpfile .Ve .Sp and check for an open bug report relating to the version number changing. .IP "Be adventurous" 4 .IX Item "Be adventurous" If you just want to have the software run anyway and see what happens you can set the environment variable \s-1MEDIAWIKI_DUMPFILE_VERSION_IGNORE\s0 to a true value which will cause the module to silently ignore the case and continue parsing the document. You can set the environment and run your program at the same time with a command like this: .Sp .Vb 1 \& MEDIAWIKI_DUMPFILE_VERSION_IGNORE=1 ./wikiscript.pl .Ve .Sp This may work fine or it may fail in subtle ways silently \- there is no way to know for sure with out studying the schema to see if the changes are backwards compatible. .IP "Open a bug report" 4 .IX Item "Open a bug report" You can use the same \s-1URL\s0 for rt.cpan.org above to create a new ticket in MediaWiki-DumpFile or just send an email to \*(L"bug-mediawiki-dumpfile at rt.cpan.org\*(R". Be sure to use a title for the bug that others will be able to use to find this case as well and to include the full text from the error message. Please also specify if you were adventurous or not and if it was successful for you. .SH "AUTHOR" .IX Header "AUTHOR" Tyler Riddle, \f(CW\*(C`\*(C'\fR .SH "BUGS" .IX Header "BUGS" Please see MediaWiki::DumpFile for information on how to report bugs in this software. .SH "COPYRIGHT & LICENSE" .IX Header "COPYRIGHT & LICENSE" Copyright 2009 \*(L"Tyler Riddle\*(R". .PP This program is free software; you can redistribute it and/or modify it under the terms of either: the \s-1GNU\s0 General Public License as published by the Free Software Foundation; or the Artistic License. .PP See http://dev.perl.org/licenses/ for more information.