Within some document collection, searching for relevant documents via keywords is the most common and accepted method. Sifting through the results, reworking the query, collecting and organizing the results, is the process that most researchers have become familiar with. It is still quite analogous to manually investigating a collection of printed documents. Software essentially just helps to perform that job more efficiently.
The advent of the “search engine” is a cornerstone in the evolution of information research. In its simplest form, a search engine is used to find documents that contain some specific words. Advanced search engines such as Google are forgiving in the sense that they can yield results that don’t literally match on the keywords. With such search engines usually comes the baggage of “page rank” skewing the results, which may or may not be desirable. Most database search engines, e.g. Wikipedia, the United States Patent & Trade Office, have the familiar “boolean keyword search” that is very literal (which, of course, has its own distinct value and applicability). If a researcher types in too many keywords, they end up with no matches at all. If they type in too few, there are too many and highly varying results. So begins the reworking of the query—adding some complex combination of “AND”, “OR”, “NOT”, use of parentheses, the use of phrases, etc.
Another key requirement for semantic search is access to clean and accurate content. Fortunately in our case for patent data, we have been able to access IFI CLAIMS vast repository of patent data via their API to semantically index the entire text of a patent, while also utilizing several key structured elements such as CPC’s, citations and key dates to help with document filtering, analysis and presentation. Having access to a live and ever evolving patent dataset is key to enabling our search index to be up-to-date and accurate.
So assuming good data and improved algorithms, what is the best way to build the appropriate search criteria? Let’s try to answer that by first acknowledging the following scenario: A researcher enters some keywords that yield a set of documents that are not satisfactory. After struggling for a while, the researcher comes upon some document that at least comes close to what they are looking for and they discover some words in the document itself that would help them develop their search criteria. Now, if the researcher could somehow leverage the entirety of that particular document as the criteria for the search, it is extremely likely that many more relevant documents can be found. A pure boolean keyword search on the body of text would not likely yield any other matches; a completely different type of “search” is warranted.
It is the case that in many document collections, the highest quality search criteria is actually the entire text of one of the documents in the corpus. A real document in the collection (or a new one that the researcher pastes or types in full) contains so much more information than what a researcher typically types as keywords. The natural language of the document and all of its inherent properties tend to shine through, if analyzed with appropriate algorithms. The effect is that the result of the search criteria is the set of documents that are most similar to or related to the original document. In “complexity theory”, such a phenomenon is known as “emergence”. This emergence is the key to a natural stepping-stone in the evolution of information research—a “discovery engine”. It is true that discovery engines can currently be found in one form or another but our existing culture’s awareness and use of the concept is still in its infancy. To be complete, it should be noted that the search for relevant documents may still begin with a small set of keywords but they can really just be treated as a mini document.
One good example of leveraging a “discovery engine” as opposed to a “search engine” is performing a patent search. In our scenario, the researcher already has a full description of his patent. The description is submitted as the “search criteria” and the top related documents are returned. One of the top results looks relevant so the researcher clicks “Related” on it in order to see its own top related documents; from there, they click “Related” on another, all the while accumulating relevant documents. Notice that the “search criteria” is effectively changing each time on-the-fly, which is very much unlike having to rework a query manually. Also notice how this process is much like what an old-time patent analyzer job was like—sorting through paper documents, reviewing each of them, acquiring others that are referenced, placing good candidates in one pile, and placing irrelevant documents in another pile. The major difference is that with using a discovery engine, a given electronic document effectively points to all of the related documents and it is never out of date, unlike a paper document, which, at best, has some relevant backward document references.
Enlyton’s Skylight product via use of a Document Discovery distinguishes itself in part or in whole from other methodologies in at least the following ways:
- It confronts the real problem, which requires an extreme amount of computer resources. (For a collection of 10 million documents, the number of pairs of relationships is approximately 50 trillion, nontrivial indeed.)
- It forms a paradigm that is fundamentally sound and it is extensible. An example of extensibility is that different combinations of text can be used as the search criteria, e.g. a) multiple documents taken as a whole, b) an existing document that is augmented with some text supplied by the researcher, c) subsections of documents.
- First-rate algorithms are used so that the resulting set of related documents is of extremely high quality.