XTRACT(1) | NCBI Entrez Direct User's Manual | XTRACT(1) |
NAME¶
xtract - NCBI Entrez Direct XML conversion and transformation tool
SYNOPSIS¶
xtract [-help] [-strict] [-mixed] [-self] [-accent] [-ascii] [-compress] [-stops] [-input filename] [-transform filename] [-aliases filename] [-pattern expr] [-group expr] [-block expr] [-subset expr] [-path path] [-if expr [constraint]] [-unless expr [constraint]] [-and condition] [-or condition] [-else] [-position pos] [-equals str] [-contains str] [-mimics str] [-excludes str] [-includes str] [-is-within str] [-starts-with str] [-ends-with str] [-is-not str] [-is-before str] [-is-after str] [-consists-of str] [-matches str] [-resembles str] [-is-equal-to expr] [-differs-from expr] [-gt N] [-ge N] [-lt N] [-le N] [-eq N] [-ne N] [-ret str] [-tab str] [-sep str] [-pfx str] [-sfx str] [-rst] [-clr] [-pfc str] [-deq str] [-def str] [-lbl str] [-set tag] [-rec tag] [-wrp tag] [-enc tag] [-plg str] [-elg str] [-pkg tag] [-fwd str] [-awd str] [-tag tag] [-att key str] [-atr key element] [-cls] [-slf] [-end tag] [-bkt] [-element element] [-first element] [-last element] [-first element] [-last element] [-backward element] [-NAME] [--STATS] [-num element] [-len element] [-sum element] [-acc element] [-min element] [-max element] [-inc element] [-dec element] [-sub element] [-avg element] [-dev element] [-med element] [-mul element] [-div element] [-mod element] [-geo element] [-hrm element] [-rms element] [-sqt element] [-lge element] [-lg2 element] [-log element] [-bin element] [-oct element] [-hex element] [-bit element] [-pad element] [-encode element] [-decode element] [-upper element] [-lower element] [-chain element] [-title element] [-mirror element] [-alpha element] [-alnum element] [-basic element] [-plain element] [-simple element] [-author element] [-journal element] [-prose element] [-terms element] [-words element] [-pairs element] [-split element -with str] [-order element] [-reverse element] [-letters element] [-clauses element] [-pentamers element] [-year element] [-month element] [-date element] [-page element] [-auth element] [-initials element] [-trim element] [-wct element] [-doi element] [-accession element] [-numeric element] [-translate element] [-classify element] [-replace -reg target -exp replacement] [-fasta] [-revcomp] [-nucleic] [-ncbi2na] [-ncbi4na] [-cds2prot [-gcode N] [-frame N]] [-molwt] [-molwt-m] [-molwt-f] [-pept] [-0-based element] [-1-based element] [-ucsc-based element] [-insd arg ...] [-insdx] [-histogram] [-indexer element] [-head str] [-tail str] [-hd str] [-tl str] [-select condition] [-in filename] [-sort[-fwd] element] [-sort-rev element] [-format fmt [-unicode style]] [-verify] [-test] [-outline] [-synopsis] [-contour [delimiter]] [-examples] [-unix] [-version]
DESCRIPTION¶
xtract converts an XML document into a table of data values according to user-specified rules.
OPTIONS¶
Processing Flags¶
- -strict
- Remove HTML and MathML tags.
- -mixed
- Allow mixed content XML.
- -self
- Allow detection of empty self-closing tags.
- -accent
- Delete Unicode accents and diacritical marks.
- -ascii
- Convert Unicode to numeric HTML character entities.
- -compress
- Compress runs of spaces.
- -stops
- Retain stop words in selected phrases.
Data Source¶
- -input filename
- Read XML from file instead of standard input.
- -transform filename
- File of substitutions for -translate.
- -aliases filename
- Mappings file for -classify operation.
Exploration Argument Hierarchy¶
- -pattern expr
- -group expr
- -block expr
- -subset expr
- Name of record within set. Use of different argument names allows command-line control of nested looping.
Path Navigation¶
- -path path
- Explore by list of adjacent object names.
Exploration Constructs¶
- Object
- DateRevised
- Parent/Child
- Book/AuthorList
- Path
- MedlineCitation/Article/Journal/JournalIssue/PubDate
- Heterogeneous
- "PubmedArticleSet/*"
- Exhaustive
- "History/**"
- Nested
- "*/Taxon"
Conditional Execution¶
- -if expr [constraint]
- Element (or @attribute) must exist and satisfy any specified constraint.
- -unless expr [constraint]
- Skip if element matches.
- -and condition
- Preceding and following tests must both pass.
- -or condition
- Any passing test suffices.
- -else
- Execute if conditional test failed.
- -position pos
- first/last/outer/inner/even/odd/all.
String Constraints¶
- -equals str
- String must match exactly.
- -contains str
- Substring must be present.
- -mimics str
- Containment test after converting punctuation to space.
- -excludes str
- Substring must be absent.
- -includes str
- Substring must match at word boundaries.
- -is-within str
- String must be present.
- -starts-with str
- Substring must be at beginning.
- -ends-with str
- Substring must be at end.
- -is-not str
- String must not match.
- -is-before str
- First string < second string.
- -is-after str
- First string > second string.
- -consists-of str
- String must only contain specified characters.
- -matches str
- Matches without commas or semicolons.
- -resembles str
- Requires all words, but in any order.
Object Constraints¶
- -is-equal-to expr
- Object values must match.
- -differs-from expr
- Object values must differ.
Numeric Constraints¶
Format Customization¶
- -ret str
- Override line break between patterns.
- -tab str
- Replace tab character between fields.
- -sep str
- Separator between group members.
- -pfx str
- Prefix to print before group.
- -sfx str
- Suffix to print after group.
- -rst
- Reset -sep through -elg.
- -clr
- Clear queued tab separator.
- -pfc str
- Preface combines -clr and -pfx.
- -deq str
- Delete and replace queued tab separator.
- -def str
- Default placeholder for missing fields.
- -lbl str
- Insert arbitrary text.
XML Generation¶
- -set tag
- XML tag for entire set.
- -rec tag
- XML tag for each record.
- -wrp tag
- Wrap elements in XML object.
- -enc tag
- Encase instance in XML object.
- -plg str
- Prologue to print before instance.
- -elg str
- Epilogue to print after instance.
- -pkg tag
- Package subset in XML object.
- -fwd str
- Foreword to print before subset.
- -awd str
- Afterword to print after subset.
Tag and Attribute Construction¶
- -tag tag
- Start with <tag.
- -att key str
- Attribute key and literal string.
- -att key element
- Attribute key and element name.
- -cls
- Close with >.
- -slf
- Self-close with />.
- -end tag
- End contents with </tag>.
FASTA Parsable Fields¶
- -bkt
- Wrap elements in bracketed fields.
Element Selection¶
- -element element
- Print all items that match tag name.
- -first element
- Only print value of first item.
- -last element
- Only print value of last item.
- -even element
- Only print value of even items.
- -odd element
- Only print value of odd items.
- -backward element
- Print values in reverse order.
- -NAME
- Record value in named variable.
- --STATS
- Accumulate values into variable.
-element Constructs¶
- Tag
- Caption
- Group
- Initials,LastName
- Parent/Child
- MedlineCitation/PMID
- Recursive
- "**/Gene-commentary_accession"
- Unrestricted
- PubDate/*
- Attribute
- DescriptorName@MajorTopicYN
- Range
- MedlineDate[1:4]
- Substring
- "Title[phospholipase | rattlesnake]"
- Alternative
- "[can contain ^ vertical bar]"
- Object Count
- "#Author"
- Item Length
- "%Title"
- Element Depth
- "^PMID"
- Variable
- "&NAME"
Special -element Operations¶
- Parent Index
- "+"
- Object Name
- "?"
- Object Value
- "~"
- XML Subtree
- "*"
- Children
- "$"
- Attributes
- "@"
- ASN.1 Record
- "."
- JSON Record
- "%"
Numeric (Integer) Processing¶
- -num element
- Count.
- -len element
- Length.
- -sum element
- Sum.
- -acc element
- Accumulator.
- -min element
- Minimum.
- -max element
- Maximum.
- -inc element
- Increment.
- -dec element
- Decrement.
- -sub element
- Difference.
- -avg element
- Arithmetic mean.
- -dev element
- Deviation.
- -med element
- Median.
- -mul element
- Product.
- -div element
- Quotient.
- -mod element
- Remainder.
- -geo element
- Geometric mean.
- -hrm element
- Harmonic mean.
- -rms element
- Root mean square.
- -sqt element
- Square root.
- -lge element
- Natural logarithm.
- -lg2 element
- Logarithm base two.
- -log element
- Logarithm base ten.
- -bin element
- Binary.
- -oct element
- Octal.
- -hex element
- Hexadecimal.
- -bit element
- Number of bits set.
Leading Zero Padding¶
- -pad element
- Zero-pad to eight digits.
Character Processing¶
- -encode element
- XML-encode <, >, &, ", and ' characters.
- -decode element
- Base64-decode object embedded in XML.
- -upper element
- Convert text to uppercase.
- -lower element
- Convert text to lowercase.
- -chain element
- Change spaces to underscores.
- -title element
- Capitalize initial letters of words.
- -mirror element
- Reverse order of letters.
- -alnum element
- Non-alphabetic characters to space.
- -alnum element
- Non-alphanumeric characters to space.
String Processing¶
- -basic element
- Convert superscripts and subscripts.
- -plain element
- Remove embedded mixed-content markup tags.
- -simple element
- Normalize accented letters; spell Greek letters.
- Multi-step author cleanup.
- -jour element
- Journal capitalization and punctuation punctuation.
- -prose element
- Text conversion to ASCII.
Text Processing¶
- -terms element
- Partition text at spaces.
- -words element
- Split at punctuation marks.
- -pairs element
- Adjacent informative words.
- -split element
- Split using -with for delimiter.
- -order element
- Rearrange words in sorted order.
- -reverse element
- Reverse words in string.
- -letters element
- Separate individual letters.
- -clauses element
- Break at phrase separators.
- -pentamers element
- Sliding window of pentamers.
Citation Functions¶
- -year element
- Extract first 4-digit year from string.
- -month element
- Match first month name and return a corresponding integer.
- -date element
- YYYY/MM/DD from -unit "PubDate" -date "*"
- -page element
- Get digits (and letters) of first page number.
- -auth element
- Change GenBank authors to Medline form.
- -initials element
- Parse initials from forename or given name.
- -trim element
- Remove extra spaces and leading zeros.
- -wct element
- Count number of -words in a string.
- -doi element
- Add https://doi.org/ prefix, URL encode.
- -accession element
- Allow indexing of full accession.version.
- -numeric element
- Only accept items that are entirely digits.
Value Transformation¶
- -translate element
- Substitute values with -transform table.
- -classify element
- Substring word or phrase matches to -aliases table.
Regular Expression¶
- -replace
- Substitute text using regular expressions.
- -reg target
- Target expression.
- -exp pattern
- Replacement pattern.
Sequence Processing¶
- -fasta
- Split sequence into blocks of 70 uppercase letters.
Nucleotide Processing¶
- -revcomp
- Reverse complement nucleotide sequence.
- -nucleic
- Subrange determines forward or revcomp.
- -ncbi2na
- Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.)
- -ncbi4na
- Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.)
- -cds2prot [-gcode N] [-frame N]
- Translate coding region using -gcode and (1-based) -frame (both 1 by default).
Protein Processing¶
Sequence Coordinates¶
- -0-based element
- Zero-based.
- -1-based element
- One-based.
- -ucsc-based element
- Half-open.
Command Generator¶
- -insd arg ...
- Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:
- Descriptor(s)
- INSDSeq_sequence/INSDSeq_definition/INSDSeq_division/... [...]
- Completeness
- complete/partial
- Feature(s)
- CDS/mRNA/...[,...]
- Qualifier(s)
- INSDFeature_key/"#INSDInterval"/gene/product/feat_location/sub_sequence/... [...]
- -insdx
- Process -insd output table into XML.
Frequency Table¶
- -histogram
- Collects data for sort-uniq-count(1) on entire set of records.
Entrez Indexing¶
- -indexer element
- Positional index using -wrp for field name.
Output Organization¶
Record Selection¶
- -select condition
- Select record subset by conditions.
- -in filename
- File of identifiers to use for selection.
Record Rearrangement¶
- -sort[-fwd] element
- Element to use as sort key.
- -sort-rev element
- Sort records in reverse order.
Reformatting¶
Validation¶
Summary¶
- -outline
- Display outline of XML structure.
- -synopsis
- Display individual XML paths.
- -contour [delimiter]
- Display XML paths to leaf nodes (delimited by / by default).
Full Exploration Command Precedence¶
Documentation¶
NOTES¶
String constraints use case-insensitive comparisons.
Numeric constraints and selection arguments use integer values.
-num and -len selections are synonyms for Object Count (#) and Item Length (%).
-words, -pairs, and -indices convert to lower case.
SEE ALSO¶
align-columns(1), archive-nihocc(1), archive-nlmnlp(1), archive-nmcds(1), archive-pids(1), archive-pmc(1), archive-pubmed(1), archive-taxonomy(1), asn2ref(1), between-two-genes(1), bsmp2info(1), csv2xml(1), custom-index(1), disambiguate-nucleotides(1), download-flatfile(1), download-ncbi-data(1), ds2pme(1), efetch(1), esample(1), filter-columns(1), find-in-gene(1), fuse-ranges(1), fuse-segments(1), gbf2facds(1), gbf2fsa(1), gbf2info(1), gbf2tbl(1), gene2range(1), gff2xml(1), gff-sort(1), gm2segs(1), hgvs2spdi(1), nquire(1), pm-collect(1), pm-refresh(1), pma2apa(1), pma2pme(1), pmc2bioc(1), pmc2info(1), print-columns(1), rchive(1), refseq-nm-cds(1), reorder-columns(1), snp2hgvs(1), snp2tbl(1), sort-table(1), sort-uniq-count(1), spdi2tbl(1), tbl2prod(1), transmute(1), uniq-table(1), xfetch(1), xfilter(1), xinfo(1), xlink(1), xml2fsa(1), xml2tbl(1), xsearch(1), xy-plot(1).
2025-05-26 | NCBI |