\midcom\datamanager\indexerdocument

This class is geared to ease indexing of datamanager driven documents. The user invoking the indexing must have full read permissions to the object.

Basic indexing operation

This class uses a number of conventions, see below, to merge an existing datamanager driven document into an indexing capable document. It requires the callee to instantiate the datamanager, as this class would have no idea where to take the schema database from.

The RI (the GUID) from the base class is left untouched.

Indexing field defaults:

Unless you specify anything else explicitly in the schema, the class will merge all text based fields together to form the content field of the index record, to allow for easy searching of the document. This will not include any metadata like keywords or summaries.

If the schema contains a field abstract, it will also be used as abstract field for the indexing process. In the same way, fields named title or author will be used for the index document's title or author respectively. The contents of abstract, title and author will also be appended to the content field at the end of the object construction, easing searching over this fields.

If no abstract field is present, the first 200 characters of the content area are used instead.

Not all types can be indexed, check the various types in question about their indexing capabilities. In general, if the system should index any non-text field, it will use the CSV representation for implicit conversion.

Metadata processing is done by the base class.

Document title:

You should either have an auto-indexed title field, or an assortment of other fields manually assigned to index to the title field.

Configurability using the Datamanager schema:

You can decorate datamanager fields with various directives influencing the indexing. See the Datamanager's schema documentation for details. Basically, you can choose from the following indexing methods using the key 'index_method' for each field:

  • The default auto mode will use the above guidelines to determine the indexing destination automatically, adding data to the content, abstract, title and author fields respectively.
  • You can specify abstract, content, title or author to indicate that the field should be used for the indicated document fields. The content selector may be specified more than once, indicating that the content of the relevant fields should be merged.
  • Any date field can be indexed into its own, range-filterable field using the date method. In this case, two document fields will be created actually. One containing the filterable timestamp named directly after the schema field, and a second one, having the _TS postfix which is set as noindex containing the plain timestamp.
  • Finally, you can explicitly index a field as a separate document field using one of the five field types keyword, unindexed, unstored or text. You can further control if the content of these fields is also added to the main content field. This is useful if you want to have fields searchable both by explicit field specification and the default field for simpler searches. This is controlled by setting the boolean key 'index_merge_with_content' in the field, which defaults to true.
  • noindex will prevent indexing of this field.

The documents type is "midcom_datamanager".

Summary

Methods
Properties
Constants
__construct()
members_to_fields()
get_field()
get_field_record()
list_fields()
remove_field()
add_date()
add_date_pair()
add_keyword()
add_unindexed()
add_unstored()
add_text()
add_result()
fields_to_members()
html2text()
is_a()
read_metadata_from_object()
$score
$RI
$lang
$topic_guid
$component
$document_url
$created
$edited
$indexed
$creator
$editor
$title
$content
$abstract
$author
$source
$topic_url
$type
$actually_index
No constants found
_add_field()
_set_type()
process_topic()
$_metadata
$_i18n
N/A
_process_metadata()
add_person()
read_unixtime()
read_person()
read_authorname()
complete_fields()
process_datamanager()
add_as_date_field()
resolve_auto_method()
$_fields
$datamanager
N/A

Properties

$score

$score : double

This is the score of this document. Only populated on resultset documents, of course.

Type

double

$RI

$RI : string

The Resource Identifier of this document.

Must be UTF-8 on assignment already.

This field is mandatory.

Type

string

$lang

$lang : string

Two letter language code of the document content

This field is optional.

Type

string

$topic_guid

$topic_guid : string

The GUID of the topic the document is assigned to.

May be empty for non-midgard resources.

This field is mandatory.

Type

string — GUID

$component

$component : string

The name of the component responsible for the document.

May be empty for non-midgard resources.

This field is mandatory.

Type

string

$document_url

$document_url : string

The fully qualified URL to the document, this should be a PermaLink.

This field is mandatory.

Type

string

$created

$created : integer

The time of document creation, this is a UNIX timestamp.

This field is mandatory.

Type

integer

$edited

$edited : integer

The time of the last document modification, this is a UNIX timestamp.

This field is mandatory.

Type

integer

$indexed

$indexed : integer

The timestamp of indexing.

This field is added automatically and to be considered read-only.

Type

integer

$creator

$creator : \midcom_db_person

The MidgardPerson who created the object.

This is optional.

Type

\midcom_db_person

$editor

$editor : \midcom_db_person

The MidgardPerson who modified the object the last time.

This is optional.

Type

\midcom_db_person

$title

$title : string

The title of the document

This is mandatory.

Type

string

$content

$content : string

The content of the document

This is mandatory.

This field is empty on documents retrieved from the index.

Type

string

$abstract

$abstract : string

The abstract of the document

This is optional.

Type

string

$author

$author : string

The author of the document

This is optional.

Type

string

$source

$source : string

An additional tag indicating the source of the document for use by the component doing the indexing.

This value is not indexed and should not be used by anybody except the component doing the indexing.

This is optional.

Type

string

$topic_url

$topic_url : string

The full path to the topic that houses the document.

For external resources, this should be either a MidCOM topic, to which this resource is associated or some "directory" after which you could filter. You may also leave it empty prohibiting it to appear on any topic-specific search.

The value should be fully qualified, as returned by MIDCOM_NAV_FULLURL, including a trailing slash, f.x. https://host/path/to/topic/

This is optional.

Type

string

$type

$type : string

The type of the document, set by subclasses and added to the index automatically.

The type must reflect the original type hierarchy. It is to be set using the $this->_set_type call after initializing the base class.

Type

string

$actually_index

$actually_index : boolean

This is have support for #651 without rewriting all components' index methods

If set to false the indexer backend will silently skip this document.

Type

boolean

$_metadata

$_metadata : \midcom_helper_metadata

The metadata instance attached to the object to be indexed.

Type

\midcom_helper_metadata

$_i18n

$_i18n : \midcom_services_i18n

The i18n service, used for charset conversion.

Type

\midcom_services_i18n

$_fields

$_fields : Array

An associative array containing all fields of the current document.

Each field is indexed by its name (a string). The value is another array containing the fields "name", type" and "content".

Type

Array

$datamanager

$datamanager : \midcom\datamanager\datamanager

The datamanager instance of the document we need to index.

Type

\midcom\datamanager\datamanager

Methods

__construct()

__construct(\midcom\datamanager\datamanager  $datamanager) 

The constructor initializes the member variables and invokes _process_datamanager, which will read and process the information out of that instance.

The document is ready for indexing after construction. On any critical error, midcom_error is triggered.

Parameters

\midcom\datamanager\datamanager $datamanager

The fully initialized datamanager instance to use

members_to_fields()

members_to_fields() 

This will translate all member variables into appropriate field records, the backend should call this immediately before indexing.

This call will automatically populate indexed with time() and author with the name of the creator (if set).

get_field()

get_field(string  $name) : mixed

Returns the contents of the field name or false on failure.

Parameters

string $name

The name of the field.

Returns

mixed —

The content of the field or false on failure.

get_field_record()

get_field_record(string  $name) : Array

Returns the complete internal field record, including type and UTF-8 encoded content.

This should normally not be used from the outside, it is geared towards the indexer backends, which need the full field information on indexing.

Parameters

string $name

The name of the field.

Returns

Array —

The full content record.

list_fields()

list_fields() : Array

Returns a list of all defined fields.

Returns

Array —

Fieldname list.

remove_field()

remove_field(string  $name) 

Remove a field from the list. Nonexistent fields are ignored silently.

Parameters

string $name

The name of the field.

add_date()

add_date(string  $name, integer  $timestamp) 

Add a date field. A timestamp is expected, which is automatically converted to a suitable ISO timestamp before storage.

Direct specification of the ISO timestamp is not yet possible due to lacking validation outside the timestamp range.

If a field of the same name is already present, it is overwritten silently.

Parameters

string $name

The field's name.

integer $timestamp

The timestamp to store.

add_date_pair()

add_date_pair(string  $name, integer  $timestamp) 

Create a normal date field and an unindexed _TS-postfixed timestamp field at the same time.

This is useful because the date fields are not in a readable format, it can't even be determined that they were a date in the first place. so the _TS field is quite useful if you need the original value for the timestamp.

Parameters

string $name

The field's name, "_TS" is appended for the plain-timestamp field.

integer $timestamp

The timestamp to store.

add_keyword()

add_keyword(string  $name, string  $content) 

Add a keyword field.

Parameters

string $name

The field's name.

string $content

The field's content.

add_unindexed()

add_unindexed(string  $name, string  $content) 

Add a unindexed field.

Parameters

string $name

The field's name.

string $content

The field's content.

add_unstored()

add_unstored(string  $name, string  $content) 

Add a unstored field.

Parameters

string $name

The field's name.

string $content

The field's content.

add_text()

add_text(string  $name, string  $content) 

Add a text field.

Parameters

string $name

The field's name.

string $content

The field's content.

add_result()

add_result(string  $name, string  $content) 

Add a search result field, this should normally not be done manually, the indexer will call this function when creating a document out of a search result.

Parameters

string $name

The field's name.

string $content

The field's content, which is assumed to be UTF-8 already

fields_to_members()

fields_to_members() 

Populate all relevant members with the respective values after retrieving a document from the index

html2text()

html2text(string  $text) : string

Convert HTML to plain text (relatively simple):

Basically, JavaScript blocks and HTML Tags are stripped, and all HTML Entities are converted to their native equivalents.

Don't replace with an empty string but with a space, so that constructs like

  • torben
  • nehmer
  • are recognized correctly.

    Parameters

    string $text

    The text to convert to text

    Returns

    string —

    The converted text.

    is_a()

    is_a(string  $document_type) : boolean

    Checks whether the given document is an instance of given document type.

    This is equivalent to the is_a object hierarchy check, except that it works with MidCOM documents.

    Parameters

    string $document_type

    The base type to search for.

    Returns

    boolean —

    Indicating relationship.

    read_metadata_from_object()

    read_metadata_from_object(\midgard_object  $object) 

    Tries to resolve created, revised, author, editor and creator for the document from Midgard object

    Parameters

    \midgard_object $object

    object to use as source for the info

    _add_field()

    _add_field(string  $name, string  $type, string  $content, boolean  $is_utf8 = false) 

    Internal helper which actually stores a field.

    Parameters

    string $name

    The field's name.

    string $type

    The field's type.

    string $content

    The field's content.

    boolean $is_utf8

    Set this to true explicitly, to override charset conversion and assume $content is UTF-8 already.

    _set_type()

    _set_type(string  $type) 

    Sets the type of the object, reflecting the inheritance hierarchy.

    Parameters

    string $type

    The name of this document type

    process_topic()

    process_topic() 

    Tries to determine the topic GUID and component using NAPs reverse-lookup capabilities.

    If this fails, you have to set the members $topic_guid, $topic_url and $component manually.

    _process_metadata()

    _process_metadata() 

    Processes the information contained in the metadata instance.

    add_person()

    add_person(string  $name, \midcom_db_person  $person) 

    Add a person field.

    Parameters

    string $name

    The field's name.

    \midcom_db_person $person

    The field's content.

    read_unixtime()

    read_unixtime(string  $stamp) : integer

    Heuristics to determine how to convert given timestamp to local unixtime

    Parameters

    string $stamp

    ISO or unix datetime

    Returns

    integer —

    unixtime

    read_person()

    read_person(string  $id) : \midcom_db_person

    Get person by given ID, caches results.

    Parameters

    string $id

    GUID or ID to get person for

    Returns

    \midcom_db_person

    object

    read_authorname()

    read_authorname(string  $id) : string

    Gets person name for given ID (in case it's imploded_wrapped of multiple GUIDs it will use the first)

    Parameters

    string $id

    GUID or ID to get person for

    Returns

    string —

    $author->name

    complete_fields()

    complete_fields() 

    Completes all fields which are not yet complete:

    content is completed with author, title and, if necessary, abstract.

    The title is set to the documents' URL in case that no title is set yet. The title is not added to the content field in that case.

    process_datamanager()

    process_datamanager() 

    Processes the information contained in the datamanager instance.

    The function iterates over the fields in the schema, and processes them according to the rules given in the introduction.

    add_as_date_field()

    add_as_date_field(\Symfony\Component\Form\FormView  $field) 

    This function tries to convert the $field into a date representation. Unixdate fields are used directly (localtime is used, not GMT), other fields will be parsed with strtodate.

    Invalid strings which are not parseable using strtotime will be stored as a "0" timestamp.

    Be aware, that this will work only for current dates in range of an UNIX timestamp. For all other cases you should use an ISO 8601 representation, which should work as well with Lucene range queries.

    Parameters

    \Symfony\Component\Form\FormView $field

    The field that should be stored

    resolve_auto_method()

    resolve_auto_method(string  $name) : string

    Parameters

    string $name

    The field name

    Returns

    string —

    index method