aext is a tool used to extract full XML documents out of CLAIMS Direct. It is installed as part of the CLAIMS Direct repository. Please see the Client Tools Installation Instructions for more information about how to install this tool.
Detailed Description of the Parameters
|As configured in
/etc/alexandria.xml, the database entry pointing to the on-site CLAIMS Direct PostgreSQL instance. The default value is
alexandria as this value is pre-configured in
|Available with optional Solr on-site installation only, this is the URL of the standalone CLAIMS Direct Solr instance or, if used, the URL of the load balancer. Although there is a default value, if you specify
--solrq, this parameter is mandatory.
The following parameters determine the source criteria for extracting CLAIMS Direct XML. Only one may be specified.
modified_load_id of the table
xml.t_patent_document_values. Please see the documentation on content updates describing the various load-ids.
|The name of a user-created table with a minimum required column
|Any raw SQL that returns one or more
|Any raw Solr query.
Extract Naming and Destination
|The output location of either the batches or, if
--archive is specified, the root directory for files in the predictable path structure. The default is the current working directory.
|The standard extract is run in batches. This parameter specifies the prefix for each output file. The default is
Archive the XML into a predictable path structure. The structure is as follows:
|For increased speed, the extraction of data by default is done using parallel processes. This parameter specifies exactly how many parallel processes will be used. A general rule of thumb is to set this parameter to the number of CPU cores the machine has.
|This parameter specifies the number of documents to extract per thread. If you know the content you are extracting, this parameter can be used to increase speed, e.g., bibliographic content only would benefit from a larger value while full-text content would benefit from a lower value.
Output XML Filtering
aext uses the internal PostgreSQL function
xml.f_patent_document_s to extract full XML documents. This parameter allows you to specify a custom extract function.
Extracting Using a Specific load-id
The following example uses
modified_load_id 261358. The resulting XML batches will be in
/tmp and will be prefixed with
TEST. The logging output may be different depending on your logging configuration.
Extracting Using a Table
The following example uses the
table parameter. A user-defined table is created with a subset of documents which are then extracted using
First we create the table in a private
Next, we load the table with publication-ids. For the sake of an example, all documents associated with
modified_load_id 261358 will be selected.
Finally, extract the documents into a predicable path structure in the current directory.
Extracting Using SQL
This example takes the raw SQL used to populate the private table in the example above, and uses it directly as a parameter to
Extracting Using Solr
If the optional CLAIMS Direct Solr instance is installed, the power of Solr can be used to search, filter, and extract documents. This example simply pulls the same set of documents as above using Solr query syntax.
Extracting Using a Custom Database Function
The following example describes a use-case in which only CPC classifications are of interest. It makes use of a custom extract function created in a private schema.
By manipulating the content of the XML, there is a risk that invalid XML can be produced. If you are validating the XML using the CLAIMS Direct DTD, beware of required elements.
First, we create the function that extracts only publication information and classification information.
Together with the
--loadid parameter, we can now extract XML that only includes publication and CPC information.
To determine the current status of the data extraction, check the log output for the batch number currently being extracted, then insert it into the following formula:
For example, given 17000000 total documents, a batch size of 500, and a current batch number of 31000, the formula would determine that there are 1500000 documents left to extract: