Solr Indexing Process Explained
Processes
The main executable script used for indexing is aidx
delivered as part of Alexandria::Library
. This script is responsible for pulling source data, converting it into Solr XML and submitting via HTTP POST to Solr for indexing. The conversion process from CLAIMS Direct XML to Solr XML is handled by the indexer class (default is Alexandria::DWH::Index::Document
). Alexandria::Client::Tools
also provides an indexing daemon, aidxd
which monitors an index process queue. Insertion into this queue, the table reporting.t_client_index_process
, is handled by apgup
.
Process | Package |
---|---|
aidx | Alexandria::Library |
aidxd | Alexandria::Client::Tools |
apgup | Alexandria::Client::Tools |
Source Data
Source XML is extracted out of the PostgreSQL data warehouse using the core library functionality exposed by the Alexandria::Library
module Alexandria::DWH::Extract
. The Extract module can pull data based on a number of criteria, the most common of which are:
- load-id:
modified-load-id
ofxml.t_patent_document_values
- table: any table name that has
publication_id(int)
column - SQL: raw SQL selecting desired documents by
publication_id
Regardless of extraction criteria, Alexandria::DWH::Extract
utilizes an UNLOGGED
temporary table to accumulate desired publication_id(s)
. Extraction proper is done from this accumulation table in parallel select
batches. The amount of parallelization as well as the amount of documents per select
are controlled by the parameters batchsize
and nthreads
. aidx
also accepts a dbfunc
parameter which designates the stored function within the PostgreSQL database to use to extract the XML data needed for indexing. The current default function is xml.f_patent_document_s
which pulls an entire XML document. One could, for example, create a custom function, e.g., myschema.f_barebones
modeled on xml.f_patent_document_s
(i.e., accepting the same parameters and returning CLAIMS Direct XML with only application-specific XML content).
Command | Accumulation SQL | Extract SQL |
---|---|---|
aidx --table=x | select publication_id | select xml.f_patent_document_values_s(t2.publication_id) from xml.t_patent_document_values t1 inner join x as t1 on ( t1.publication_id=t2.publication_id) |
aidx --loadid=y | select publication_id | select xml.f_patent_document_values_s(t2.publication_id) |
aidx --sqlq=USER_SQL | execute SQL into t1 | select xml.f_patent_document_values_s(t2.publication_id) |
Command | Accumulation SQL | Extract SQL |
---|---|---|
aidx --table=x --dbfunc=f_my_function | select publication_id | select f_my_function(t2.publication_id) from xml.t_patent_document_values t1 inner join x as t1 on ( t1.publication_id=t2.publication_id) |
Indexer Class
Using the callback
mechanism exposed by the extract module, the indexer class takes an XML document and creates a transformed XML document suitable for loading into Solr. The following abbreviated example from aidx
serves to illustrate the process.
#! /usr/bin/perl use Alexandria::DWH::Extract; use Alexandria::DWH::Index; use Alexandria::DWH::Index::Document; my $idxcls = shift(@ARGV); # from command line sub _create_solr_document { my ( $batch, $xml ) = @_; eval 'require $idxcls'; return $idxcls->new( document => $xml )->toNode()->toString(1); } my $ex = Alexandria::DWH::Extract->new( ... callbacks => { on_document_processed => \&_create_solr_document } ); $ex->prepare(); $ex->run(); # every document extracted is sent through _create_solr_document() $ex->finalize();
Creating a Custom Indexing Class
Creating a custom indexing class is simply a matter of sub-classing the Alexandria::DWH::Index::Document
and manipulating the Solr document representation by either adding, deleting, or modifying certain fields. There is currently only one method that can be overridden in the sub-class, namely, _process_source
. The following shell-module will serve as a basis for the use cases detailed below.
package MyCustomIndexingClass; use Moose; ### note: if using v2.0, you would extend Alexandria::DWH::Index::Document extends 'Alexandria::DWH::Index::DocumentEx'; # override _process_source sub _process_source { my $self = shift; # we want to process the standard way ... $self->SUPER::_process_source(); # do nothing else } 1;
You can now specify MyCustomIndexingClass
as the command line argument --idxcls
to the indexing utility aidx
.
aidx --idxcls=MyCustomIndexingClass [ many other arguments ]
Use Cases
Assumptions
The following use cases assume:
- a valid index entry in
/etc/alexandria.xml
– this will be different than the default if you have a custom Solr installation - custom indexing class modules are either in the directory you run
aidx
or in yourPERL5LIB
path
(1) Adding (Injecting), Modifying, and Deleting Fields
For this use case, you will need to modify your Solr schema for the installation associated with the appropriate configuration index entry. Add the following field definition:
<field name="customInteger" type="tint" indexed="true" stored="true" />
Below is example code to inject customInteger
into the Solr document. Additionally, it will show how to modify the contents of anseries
and delete anseries
if the publication country is US and publication date is later than 2015.
package MyCustomIndexingClass; # subclass of Alexandria::DWH::Index::Document use Moose; ### note: if using v2.0, you would extend Alexandria::DWH::Index::Document extends 'Alexandria::DWH::Index::DocumentEx'; # override _process source sub _process_source { my $self = shift; # even though we are overriding _process_source(), we still # want the parent class to do all the work for us # by calling the parent method (SUPER) ... $self->SUPER::_process_source(); # the _fields member of $self contains all the # Solr content as a hash reference of array references # e.g. # _fields => # NOTE: multiValued=false fields are still represented as an array # but only have one member # pn => [ 'US-5551212-A' ], # anseries => [ '07' ], # icl1 => [ 'A', 'F', 'H' ] my $flds = $self->{_fields} || return; # nothing to do # inject a new field push( @{ $flds->{customInteger} }, 1 ) ; # we want to make certain that anseries is not padded, i.e., # we need to be sure it is an integer if( scalar( $flds->{anseries} ) ) { $flds->{anseries}->[0] = sprintf( "%d", $flds->{anseries}->[0] ); # lastly, we don't want to index anseries for US documents published # after 20150101 my $ctry = $flds->{pnctry}->[0]; my $date = $flds->{pd}->[0]; if( $ctry eq 'US' && $date > 20141231 ) { delete $flds->{anseries}; } } } 1;
(2) Accessing the CLAIMS Direct Source XML Document
This next use case will examine methods of (re)processing data from the source XML document. The goal will be to create a new multi-valued field to store related documents. The following changes need to be made to the Solr schema:
<field name="rel_ucids" type="string" indexed="true" stored="true" required="false" />
We first need to define related ucid rel_ucid
. For this example, it will be defined as:
- any related documents which have a
@relation=related-publication
- any pct-or-regional-publishing-data
The parts of the XML document that are of interest:
<related-documents> <relation type="related-publication"> <document-id> <country>US</country> <doc-number>20150126456</doc-number> <kind>A1</kind> <date>20150507</date> </document-id> </relation> </related-documents> <!-- ... --> <pct-or-regional-publishing-data ucid="WO-2013182650-A1"> <document-id> <country>WO</country> <doc-number>2013182650</doc-number> <kind>A1</kind> <date>20131212</date> </document-id> </pct-or-regional-publishing-data>
As this example is more involved, the following code is broken down by function. A complete listing of code will be provided below.
### routine to parse related documents sub _parse_related_documents { my $self = shift; # the root of the source XML # as an XML::LibXML::Node my $patdoc = shift; my @a = (); # stores any related-publications # if there are no related documents, return empty array my $reldoc_node = $patdoc->getElementsByTagName('related-documents')->[0]; return \@a if !$reldoc_node; foreach my $relation ( $reldoc_node->getElementsByTagName('relation') ) { if( $relation->getAttribute('type') eq 'related-publication' ) { push( @a, sprintf("%s-%s-%s", $relation->findvalue('./document-id/country'), $relation->findvalue('./document-id/doc-number'), $relation->findvalue('./document-id/kind') ) ); } } return \@a; }
Points to consider with _parse_related_documents:
- The source document (XML) representation is an
XML::LibXML::Node
, named above aspatdoc.
- Utilizing available methods, it is relatively simple to access particular parts of the XML tree.
The
findvalue
method is lacking error checking, i.e., we assume every value is present, combined insprintf
will return a correctly formatteducid
### routine to parse pct publication information sub _parse_pct_publishing_data { my $self = shift; # the root of the source XML # as an XML::LibXML::Node my $patdoc = shift; # if there is no pct publishing node, return undef my $pct_node = $patdoc->getElementsByTagName('pct-or-regional-publishing-data')->[0]; return undef if !$pct_node; # return ucid return $pct_node->getAttribute('ucid'); }
Points to consider:
- according to the DTD, there is only ever one related pct document, hence single-value return
- the
ucid
attribute is available, differing from the aboverelated-documents
The complete listing:
package MyCustomIndexingClass; # subclass of Alexandria::DWH::Index::Document use Moose; ### note: if using v2.0, you would extend Alexandria::DWH::Index::Document extends 'Alexandria::DWH::Index::DocumentEx'; # override _process source sub _process_source { my $self = shift; # even though we are overriding _process_source(), we still # want the parent class to do all the work for us # by calling the parent method (SUPER) ... $self->SUPER::_process_source(); my $flds = $self->{_fields} || return; # nothing to do my $reldocs = $self->_parse_related_documents( $self->{_source_root} ); my $pctdoc = $self->_parse_pct_publishing_data( $self->{_source_root} ); if( scalar( @{$reldocs} ) ) { foreach my $r ( @{$reldocs} ) { push( @{ $flds->{rel_ucids} }, $r ); } } if( $pctdoc ) { push( @{ $flds->{rel_ucids} }, $pctdoc ); } ### routine to parse related documents sub _parse_related_documents { my $self = shift; # the root of the source XML # as an XML::LibXML::Node my $patdoc = shift; my @a = (); # stores any related-publications # if there are no related documents, return empty array my $reldoc_node = $patdoc->getElementsByTagName('related-documents')->[0]; return \@a if !$reldoc_node; for each my $relation ( $reldoc_node->getElementsByTagName('relation') ) { if( $relation->getAttribute('type') eq 'related-publication' ) { push( @a, sprintf("%s-%s-%s", $relation->findvalue('./document-id/country'), $relation->findvalue('./document-id/doc-number'), $relation->findvalue('./document-id/kind') ) ); } } return \@a; } ### routine to parse pct publication information sub _parse_pct_publishing_data { my $self = shift; # the root of the source XML # as an XML::LibXML::Node my $patdoc = shift; my $ret; # only one value available (or none) # if there is no pct publishing node, return undef my $pct_node = $patdoc->getElementsByTagName('pct-or-regional-publishing-data')->[0]; return undef if !$pct_node; # return ucid return $pct_node->getAttribute('ucid'); } 1;