Scroll to navigation

XTRACT(1) NCBI Entrez Direct User's Manual XTRACT(1)

NAME

xtract - NCBI Entrez Direct XML conversion and transformation tool

SYNOPSIS

xtract [-help] [-strict] [-mixed] [-self] [-accent] [-ascii] [-compress] [-stops] [-input filename] [-transform filename] [-aliases filename] [-pattern expr] [-group expr] [-block expr] [-subset expr] [-path path] [-if expr [constraint]] [-unless expr [constraint]] [-and condition] [-or condition] [-else] [-position pos] [-equals str] [-contains str] [-mimics str] [-excludes str] [-includes str] [-is-within str] [-starts-with str] [-ends-with str] [-is-not str] [-is-before str] [-is-after str] [-consists-of str] [-matches str] [-resembles str] [-is-equal-to expr] [-differs-from expr] [-gt N] [-ge N] [-lt N] [-le N] [-eq N] [-ne N] [-ret str] [-tab str] [-sep str] [-pfx str] [-sfx str] [-rst] [-clr] [-pfc str] [-deq str] [-def str] [-lbl str] [-set tag] [-rec tag] [-wrp tag] [-enc tag] [-plg str] [-elg str] [-pkg tag] [-fwd str] [-awd str] [-tag tag] [-att key str] [-atr key element] [-cls] [-slf] [-end tag] [-bkt] [-element element] [-first element] [-last element] [-first element] [-last element] [-backward element] [-NAME] [--STATS] [-num element] [-len element] [-sum element] [-acc element] [-min element] [-max element] [-inc element] [-dec element] [-sub element] [-avg element] [-dev element] [-med element] [-mul element] [-div element] [-mod element] [-geo element] [-hrm element] [-rms element] [-sqt element] [-lge element] [-lg2 element] [-log element] [-bin element] [-oct element] [-hex element] [-bit element] [-pad element] [-encode element] [-decode element] [-upper element] [-lower element] [-chain element] [-title element] [-mirror element] [-alpha element] [-alnum element] [-basic element] [-plain element] [-simple element] [-author element] [-journal element] [-prose element] [-terms element] [-words element] [-pairs element] [-split element -with str] [-order element] [-reverse element] [-letters element] [-clauses element] [-pentamers element] [-year element] [-month element] [-date element] [-page element] [-auth element] [-initials element] [-trim element] [-wct element] [-doi element] [-accession element] [-numeric element] [-translate element] [-classify element] [-replace -reg target -exp replacement] [-fasta] [-revcomp] [-nucleic] [-ncbi2na] [-ncbi4na] [-cds2prot [-gcode N] [-frame N]] [-molwt] [-molwt-m] [-molwt-f] [-pept] [-0-based element] [-1-based element] [-ucsc-based element] [-insd arg ...] [-insdx] [-histogram] [-indexer element] [-head str] [-tail str] [-hd str] [-tl str] [-select condition] [-in filename] [-sort[-fwdelement] [-sort-rev element] [-format fmt [-unicode style]] [-verify] [-test] [-outline] [-synopsis] [-contour [delimiter]] [-examples] [-unix] [-version]

DESCRIPTION

xtract converts an XML document into a table of data values according to user-specified rules.

OPTIONS

Processing Flags

Remove HTML and MathML tags.
Allow mixed content XML.
Allow detection of empty self-closing tags.
Delete Unicode accents and diacritical marks.
Convert Unicode to numeric HTML character entities.
Compress runs of spaces.
Retain stop words in selected phrases.

Data Source

Read XML from file instead of standard input.
File of substitutions for -translate.
Mappings file for -classify operation.

Exploration Argument Hierarchy

Name of record within set. Use of different argument names allows command-line control of nested looping.

Path Navigation

Explore by list of adjacent object names.

Exploration Constructs

DateRevised
Book/AuthorList
MedlineCitation/Article/Journal/JournalIssue/PubDate
"PubmedArticleSet/*"
"History/**"
"*/Taxon"

Conditional Execution

Element (or @attribute) must exist and satisfy any specified constraint.
Skip if element matches.
Preceding and following tests must both pass.
Any passing test suffices.
Execute if conditional test failed.
first/last/outer/inner/even/odd/all.

String Constraints

String must match exactly.
Substring must be present.
Containment test after converting punctuation to space.
Substring must be absent.
Substring must match at word boundaries.
String must be present.
Substring must be at beginning.
Substring must be at end.
String must not match.
First string < second string.
First string > second string.
String must only contain specified characters.
Matches without commas or semicolons.
Requires all words, but in any order.

Object Constraints

Object values must match.
Object values must differ.

Numeric Constraints

Greater than.
Greater than or equal to.
Less than to.
Less than or equal to.
Equal to.
Not equal to.

Format Customization

Override line break between patterns.
Replace tab character between fields.
Separator between group members.
Prefix to print before group.
Suffix to print after group.
Reset -sep through -elg.
Clear queued tab separator.
Preface combines -clr and -pfx.
Delete and replace queued tab separator.
Default placeholder for missing fields.
Insert arbitrary text.

XML Generation

XML tag for entire set.
XML tag for each record.
Wrap elements in XML object.
Encase instance in XML object.
Prologue to print before instance.
Epilogue to print after instance.
Package subset in XML object.
Foreword to print before subset.
Afterword to print after subset.

Tag and Attribute Construction

Start with <tag.
Attribute key and literal string.
Attribute key and element name.
Close with >.
Self-close with />.
End contents with </tag>.

FASTA Parsable Fields

Wrap elements in bracketed fields.

Element Selection

Print all items that match tag name.
Only print value of first item.
Only print value of last item.
Only print value of even items.
Only print value of odd items.
Print values in reverse order.
-NAME
Record value in named variable.
Accumulate values into variable.

-element Constructs

Caption
Initials,LastName
MedlineCitation/PMID
"**/Gene-commentary_accession"
PubDate/*
DescriptorName@MajorTopicYN
MedlineDate[1:4]
"Title[phospholipase | rattlesnake]"
"[can contain ^ vertical bar]"
Object Count
"#Author"
"%Title"
"^PMID"
"&NAME"

Special -element Operations

"+"
Object Name
"?"
Object Value
"~"
"*"
"$"
"@"
"."
"%"

Numeric (Integer) Processing

Count.
Length.
Sum.
Accumulator.
Minimum.
Maximum.
Increment.
Decrement.
Difference.
Arithmetic mean.
Deviation.
Median.
Product.
Quotient.
Remainder.
Geometric mean.
Harmonic mean.
Root mean square.
Square root.
Natural logarithm.
Logarithm base two.
Logarithm base ten.
Binary.
Octal.
Hexadecimal.
Number of bits set.

Leading Zero Padding

Zero-pad to eight digits.

Character Processing

XML-encode <, >, &, ", and ' characters.
Base64-decode object embedded in XML.
Convert text to uppercase.
Convert text to lowercase.
Change spaces to underscores.
Capitalize initial letters of words.
Reverse order of letters.
Non-alphabetic characters to space.
Non-alphanumeric characters to space.

String Processing

Convert superscripts and subscripts.
Remove embedded mixed-content markup tags.
Normalize accented letters; spell Greek letters.
Multi-step author cleanup.
Journal capitalization and punctuation punctuation.
Text conversion to ASCII.

Text Processing

Partition text at spaces.
Split at punctuation marks.
Adjacent informative words.
Split using -with for delimiter.
Rearrange words in sorted order.
Reverse words in string.
Separate individual letters.
Break at phrase separators.
Sliding window of pentamers.

Citation Functions

Extract first 4-digit year from string.
Match first month name and return a corresponding integer.
YYYY/MM/DD from -unit "PubDate" -date "*"
Get digits (and letters) of first page number.
Change GenBank authors to Medline form.
Parse initials from forename or given name.
Remove extra spaces and leading zeros.
Count number of -words in a string.
Add https://doi.org/ prefix, URL encode.
Allow indexing of full accession.version.
Only accept items that are entirely digits.

Value Transformation

Substitute values with -transform table.
Substring word or phrase matches to -aliases table.

Regular Expression

Substitute text using regular expressions.
Target expression.
Replacement pattern.

Sequence Processing

Split sequence into blocks of 70 uppercase letters.

Nucleotide Processing

Reverse complement nucleotide sequence.
Subrange determines forward or revcomp.
Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.)
Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.)
Translate coding region using -gcode and (1-based) -frame (both 1 by default).

Protein Processing

Calculate molecular weight of peptide.
Molecular weight retaining initial methionine.
Keep initial M residue as formyl-methionine.
Split amino acid runs at *, -, x, or X.

Sequence Coordinates

-0-based element
Zero-based.
-1-based element
One-based.
Half-open.

Command Generator

Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:
INSDSeq_sequence/INSDSeq_definition/INSDSeq_division/... [...]
complete/partial
CDS/mRNA/...[,...]
INSDFeature_key/"#INSDInterval"/gene/product/feat_location/sub_sequence/... [...]
Process -insd output table into XML.

Frequency Table

Collects data for sort-uniq-count(1) on entire set of records.

Entrez Indexing

Positional index using -wrp for field name.

Output Organization

Print before everything else.
Print after everything else.
Print before each record.
Print after each record.

Record Selection

Select record subset by conditions.
File of identifiers to use for selection.

Record Rearrangement

Element to use as sort key.
Sort records in reverse order.

Reformatting

Fast block copy (still applies processing flags).
Compress runs of spaces.
Suppress line indentation.
Indent according to nesting depth.
Place each attribute on a separate line.

Validation

Report XML data integrity problems.
Check field for visible combining accents and invisible Unicode.

Summary

Display outline of XML structure.
Display individual XML paths.
Display XML paths to leaf nodes (delimited by / by default).

Full Exploration Command Precedence

Documentation

Print usage information and some example argument combinations.
Complete usage examples, involving additional Entrez Direct tools.
Illustrate common Unix command arguments.
Print version number.

NOTES

String constraints use case-insensitive comparisons.

Numeric constraints and selection arguments use integer values.

-num and -len selections are synonyms for Object Count (#) and Item Length (%).

-words, -pairs, and -indices convert to lower case.

SEE ALSO

align-columns(1), archive-nihocc(1), archive-nlmnlp(1), archive-nmcds(1), archive-pids(1), archive-pmc(1), archive-pubmed(1), archive-taxonomy(1), asn2ref(1), between-two-genes(1), bsmp2info(1), csv2xml(1), custom-index(1), disambiguate-nucleotides(1), download-flatfile(1), download-ncbi-data(1), ds2pme(1), efetch(1), esample(1), filter-columns(1), find-in-gene(1), fuse-ranges(1), fuse-segments(1), gbf2facds(1), gbf2fsa(1), gbf2info(1), gbf2tbl(1), gene2range(1), gff2xml(1), gff-sort(1), gm2segs(1), hgvs2spdi(1), nquire(1), pm-collect(1), pm-refresh(1), pma2apa(1), pma2pme(1), pmc2bioc(1), pmc2info(1), print-columns(1), rchive(1), refseq-nm-cds(1), reorder-columns(1), snp2hgvs(1), snp2tbl(1), sort-table(1), sort-uniq-count(1), spdi2tbl(1), tbl2prod(1), transmute(1), uniq-table(1), xfetch(1), xfilter(1), xinfo(1), xlink(1), xml2fsa(1), xml2tbl(1), xsearch(1), xy-plot(1).

2025-05-26 NCBI