- bullseye 14.6.20210224+dfsg-4+b1
- testing 19.0.20230216+dfsg-2
- unstable 19.0.20230216+dfsg-2+b1
- experimental 19.2.20230331+dfsg-1
XTRACT(1) | NCBI Entrez Direct User's Manual | XTRACT(1) |
NAME¶
xtract - NCBI Entrez Direct XML conversion and transformation tool
SYNOPSIS¶
xtract [-help] [-strict] [-mixed] [-accent] [-ascii] [-compress] [-stops] [-input filename] [-transform filename] [-pattern expr] [-group expr] [-block expr] [-subset expr] [-path path] [-if expr [constraint]] [-unless expr [constraint]] [-and condition] [-or condition] [-else] [-position pos] [-equals str] [-contains str] [-is-within str] [-starts-with str] [-ends-with str] [-is-not str] [-is-before str] [-is-after str] [-matches str] [-resembles str] [-is-equal-to expr] [-differs-from expr] [-gt N] [-ge N] [-lt N] [-le N] [-eq N] [-ne N] [-ret str] [-tab str] [-sep str] [-pfx str] [-sfx str] [-rst] [-clr] [-pfc str] [-deq str] [-def str] [-lbl str] [-set tag] [-rec tag] [-wrp tag] [-enc tag] [-plg str] [-elg str] [-pkg tag] [-fwd str] [-awd str] [-element element] [-first element] [-last element] [-NAME] [--STATS] [-num element] [-len element] [-sum element] [-min element] [-max element] [-inc element] [-dec element] [-sub element] [-avg element] [-dev element] [-med element] [-mul element] [-div element] [-mod element] [-bin element] [-bit element] [-encode element] [-plain element] [-upper element] [-lower element] [-chain element] [-title element] [-year element] [-doi element] [-translate element] [-terms element] [-words element] [-pairs element] [-order element] [-reverse element] [-letters element] [-clauses element] [-replace -reg target -exp replacement] [-revcomp] [-nucleic] [-fasta] [-ncbi2na] [-ncbi4na] [-molwt] [-0-based element] [-1-based element] [-ucsc-based element] [-insd arg ...] [-histogram] [-e2index] [-indices element] [-head str] [-tail str] [-hd str] [-tl str] [-select condition] [-in filename] [-sort element] [-format fmt [-unicode style]] [-verify] [-outline] [-synopsis] [-contour [delimiter]] [-examples] [-unix] [-version]
DESCRIPTION¶
xtract converts an XML document into a table of data values according to user-specified rules.
OPTIONS¶
Processing Flags¶
Data Source¶
- -input filename
- Read XML from file instead of standard input.
- -transform filename
- File of substitutions for -translate.
Exploration Argument Hierarchy¶
- -pattern expr
- -group expr
- -block expr
- -subset expr
- Name of record within set. Use of different argument names allows command-line control of nested looping.
Path Navigation¶
- -path path
- Explore by list of adjacent object names.
Exploration Constructs¶
- Object
- DateRevised
- Parent/Child
- Book/AuthorList
- Path
- MedlineCitation/Article/Journal/JournalIssue/PubDate
- Heterogeneous
- "PubmedArticleSet/*"
- Exhaustive
- "History/**"
- Nested
- "*/Taxon"
- Recursive
- "**/Gene-commentary"
Conditional Execution¶
- -if expr [constraint]
- Element (or @attribute) must exist and satisfy any specified constraint.
- -unless expr [constraint]
- Skip if element matches.
- -and condition
- Preceding and following tests must both pass.
- -or condition
- Any passing test suffices.
- -else
- Execute if conditional test failed.
- -position pos
- first/last/outer/inner/even/odd/all.
String Constraints¶
- -equals str
- String must match exactly.
- -contains str
- Substring must be present.
- -is-within str
- String must be present.
- -starts-with str
- Substring must be at beginning.
- -ends-with str
- Substring must be at end.
- -is-not str
- String must not match.
- -is-before str
- First string < second string.
- -is-after str
- First string > second string.
- -matches str
- Matches without commas or semicolons.
- -resembles str
- Requires all words, but in any order.
Object Constraints¶
- -is-equal-to expr
- Object values must match.
- -differs-from expr
- Object values must differ.
Numeric Constraints¶
Format Customization¶
- -ret str
- Override line break between patterns.
- -tab str
- Replace tab character between fields.
- -sep str
- Separator between group members.
- -pfx str
- Prefix to print before group.
- -sfx str
- Suffix to print after group.
- -rst
- Reset -sep through -elg.
- -clr
- Clear queued tab separator.
- -pfc str
- Preface combines -clr and -pfx.
- -deq str
- Delete and replace queued tab separator.
- -def str
- Default placeholder for missing fields.
- -lbl str
- Insert arbitrary text.
XML Generation¶
- -set tag
- XML tag for entire set.
- -rec tag
- XML tag for each record.
- -wrp tag
- Wrap elements in XML object.
- -enc tag
- Encase instance in XML object.
- -plg str
- Prologue to print before instance.
- -elg str
- Epilogue to print after instance.
- -pkg tag
- Package subset in XML object.
- -fwd str
- Foreword to print before subset.
- -awd str
- Afterword to print after subset.
Element Selection¶
- -element element
- Print all items that match tag name.
- -first element
- Only print value of first item.
- -last element
- Only print value of last item.
- -NAME
- Record value in named variable.
- --STATS
- Accumulate values into variable.
-element Constructs¶
- Tag
- Caption
- Group
- Initials,LastName
- Parent/Child
- MedlineCitation/PMID
- Recursive
- "**/Gene-commentary_accession"
- Unrestricted
- PubDate/*
- Attribute
- DescriptorName@MajorTopicYN
- Range
- MedlineDate[1:4]
- Substring
- "Title[phospholipase | rattlesnake]"
- Object Count
- "#Author"
- Item Length
- "%Title"
- Element Depth
- "^PMID"
- Variable
- "&NAME"
Special -element Operations¶
- Parent Index
- "+"
- Object Name
- "+"
- XML Subtree
- "*"
- Children
- "$"
- Attributes
- "@"
Numeric Processing¶
- -num element
- Count.
- -len element
- Length.
- -sum element
- Sum.
- -min element
- Minimum.
- -max element
- Maximum.
- -inc element
- Increment.
- -dec element
- Decrement.
- -sub element
- Difference.
- -avg element
- Average.
- -dev element
- Deviation.
- -med element
- Median.
- -mul element
- Product.
- -div element
- Quotient.
- -mod element
- Remainder.
- -bin element
- Binary.
- -bit element
- Bit count.
String Processing¶
- -encode element
- XML-encode <, >, &, ", and ' characters.
- -plain element
- Remove embedded mixed-content markup tags.
- -upper element
- Convert text to uppercase.
- -lower element
- Convert text to lowercase.
- -chain element
- Change spaces to underscores.
- -title element
- Capitalize initial letters of words.
- -year element
- Extract first 4-digit year from string.
- -doi element
- Add https://doi.org/ prefix, URL encode.
- -translate element
- Substitute values with -transform table.
Text Processing¶
- -terms element
- Partition text at spaces.
- -words element
- Split at punctuation marks.
- -pairs element
- Adjacent informative words.
- -order element
- Rearrange words in sorted order.
- -reverse element
- Reverse words in string.
- -letters element
- Separate individual letters.
- -clauses element
- Break at phrase separators.
Regular Expression¶
- -replace
- Substitute text using regular expressions.
- -reg target
- Target expression.
- -exp pattern
- Replacement pattern.
Sequence Processing¶
- -revcomp
- Reverse complement nucleotide sequence.
- -nucleic
- Subrange determines forward or revcomp.
- -fasta
- Split sequence into blocks of 50 letters.
- -ncbi2na
- Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.)
- -ncbi4na
- Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.)
- -molwt
- Calculate molecular weight of peptide.
Sequence Coordinates¶
- -0-based element
- Zero-based.
- -1-based element
- One-based.
- -ucsc-based element
- Half-open.
Command Generator¶
- -insd arg ...
- Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:
- Descriptor(s)
- INSDSeq_sequence/INSDSeq_definition/INSDSeq_division/... [...]
- Completeness
- complete/partial
- Feature(s)
- CDS/mRNA/...[,...]
- Qualifier(s)
- INSDFeature_key/"#INSDInterval"/gene/product/feat_location/sub_sequence/... [...]
Frequency Table¶
- -histogram
- Collects data for sort-uniq-count(1) on entire set of records.
Entrez Indexing¶
- -e2index
- Create Entrez index XML.
- -indices element
- Index normalized words.
Output Organization¶
Record Selection¶
- -select condition
- Select record subset by conditions.
- -in filename
- File of identifiers to use for selection.
Record Rearrangement¶
- -sort element
- Element to use as sort key.
Reformatting¶
Validation¶
- -verify
- Report XML data integrity problems.
Summary¶
- -outline
- Display outline of XML structure.
- -synopsis
- Display individual XML paths.
- -contour [delimiter]
- Display XML paths to leaf nodes (delimited by / by default).
Documentation¶
- -help
- Print usage information and some example argument combinations.
- -examples
- Complete examples of edirect(1) and xtract usage.
- -unix
- Illustrate common Unix command arguments.
- -version
- Print version number.
NOTES¶
String constraints use case-insensitive comparisons.
Numeric constraints and selection arguments use integer values.
-num and -len selections are synonyms for Object Count (#) and Item Length (%).
-words, -pairs, and -indices convert to lower case.
SEE ALSO¶
download-ncbi-data(1), edirect(1), esample(1), index-extras(1), index-pubmed(1), pm-index(1), pm-invert(1), pm-stash(1), rchive(1), sort-uniq-count(1), transmute(1), xml2tbl(1), xy-plot(1).
2021-03-07 | NCBI |