This class encapsulates a single indexer document.

It is used for both indexing and retrieval.

A document consists of a number of fields, each field has different properties when handled by the indexer (exact behavior depends, as always, on the indexer backend in use). On retrieval, this field information is lost, all fields being of the same type (naturally). The core indexer backend supports these field types:

  • date is a date-wrapped field suitable for use with the Date Filter.
  • keyword is store and indexed, but not tokenized.
  • unindexed is stored but neither indexed nor tokenized.
  • unstored is not stored, but indexed and tokenized.
  • text is stored, indexed and tokenized.

This class should not be instantiated directly, a new instance of this class can be obtained using the midcom_services_indexer class.

A number of predefined fields are available using member fields. These fields are all meta-fields. See their individual documentation for details. All fields are mandatory unless mentioned otherwise explicitly and, as always, assumed to be in the local charset.

Remember, that both date and unstored fields are not available on retrieval. For the core fields, all timestamps are stored twice therefore, once as searchable field, and once as readable timestamp.

The class will automatically pass all data to the i18n charset conversion functions, thus you work using your site's charset like usual. UTF-8 conversion is done implicitly.

package midcom.services
see \global\midcom_services_indexer
todo The Type field is not yet handled properly.

 Methods

__construct ()

Initialize the object, nothing fancy here.

_set_type (string $type)

Sets the type of the object, reflecting the inheritance hierarchy.
see \$type
see \is_a()
access protected

Parameters

$type

stringThe name of this document type

add_date (string $name, int $timestamp)

Add a date field.

A timestamp is expected, which is automatically converted to a suitable ISO timestamp before storage.

Direct specification of the ISO timestamp is not yet possible due to lacking validation outside the timestamp range.

If a field of the same name is already present, it is overwritten silently.

Parameters

$name

stringThe field's name.

$timestamp

intThe timestamp to store.

add_date_pair (string $name, int $timestamp)

This is a small helper which will create a normal date field and a unindexed _TS-postfixed timestamp field at the same time.

This is useful because the date fields are not in a readable format, it can't even be determined that they were a date in the first place. so the _TS field is quite useful if you need the original value for the timestamp.

Parameters

$name

stringThe field's name, "_TS" is appended for the plain-timestamp field.

$timestamp

intThe timestamp to store.

add_keyword (string $name, string $content)

Add a keyword field.

Parameters

$name

stringThe field's name.

$content

stringThe field's content.

add_result (string $name, string $content)

Add a search result field, this should normally not be done manually, the indexer will call this function when creating a document out of a search result.

Parameters

$name

stringThe field's name.

$content

stringThe field's content, which is assumed to be UTF-8 already

add_text (string $name, string $content)

Add a text field.

Parameters

$name

stringThe field's name.

$content

stringThe field's content.

add_unindexed (string $name, string $content)

Add a unindexed field.

Parameters

$name

stringThe field's name.

$content

stringThe field's content.

add_unstored (string $name, string $content)

Add a unstored field.

Parameters

$name

stringThe field's name.

$content

stringThe field's content.

fields_to_members ()

This function should be called after retrieving a document from the index.

It will populate all relevant members with the according values.

get_field (string $name)

Returns the contents of the field name or false on failure.

Parameters

$name

stringThe name of the field.

Returns

mixedThe content of the field or false on failure.

get_field_record (string $name)

Returns the complete internal field record, including type and UTF-8 encoded content.

This should normally not be used from the outside, it is geared towards the indexer backends, which need the full field information on indexing.

Parameters

$name

stringThe name of the field.

Returns

ArrayThe full content record.

html2text (string $text)

This is a small helper that converts HTML to plain text (relatively simple):

Basically, JavaScript blocks and HTML Tags are stripped, and all HTML Entities are converted to their native equivalents.

Don't replace with an empty string but with a space, so that constructs like

  • torben
  • nehmer
  • are recognized correctly.

    Parameters

    $text

    stringThe text to convert to text

    Returns

    stringThe converted text.

    is_a (string $document_type)

    Checks whether the given document is an instance of given document type.

    This is equivalent to the is_a object hierarchy check, except that it works with MidCOM documents.

    see \$type
    see \_set_type()

    Parameters

    $document_type

    stringThe base type to search for.

    Returns

    booleanIndicating relationship.

    list_fields ()

    Returns a list of all defined fields.

    Returns

    ArrayFieldname list.

    members_to_fields ()

    This will translate all member variables into appropriate field records, the backend should call this immediately before indexing.

    This call will automatically populate indexed with time() and author with the name of the creator (if set).

    remove_field (string $name)

    Remove a field from the list.

    Nonexistent fields are ignored silently.

    Parameters

    $name

    stringThe name of the field.

    _add_field (string $name, string $type, string $content, boolean $is_utf8)

    Internal helper which actually stores a field.

    Parameters

    $name

    stringThe field's name.

    $type

    stringThe field's type.

    $content

    stringThe field's content.

    $is_utf8

    booleanSet this to true explicitly, to override charset conversion and assume $content is UTF-8 already.

     Properties

     

    string $RI

    The Resource Identifier of this document.

    Must be UTF-8 on assignment already.

    This field is mandatory.

     

    string $abstract

    The abstract of the document

    This is optional.

     

    boolean $actually_index

    This is have support for #651 without rewriting all components' index methods

    If set to false the indexer backend will silently skip this document.

    see \http://trac.midgard-project.org/ticket/651
     

    string $author

    The author of the document

    This is optional.

     

    string $component

    The name of the component responsible for the document.

    May be empty for non-midgard resources.

    This field is mandatory.

     

    string $content

    The content of the document

    This is mandatory.

    This field is empty on documents retrieved from the index.

     

    int $created

    The time of document creation, this is a UNIX timestamp.

    This field is mandatory.

     

    \MidgardPerson $creator

    The MidgardPerson who created the object.

    This is optional.

     

    string $document_url

    The fully qualified URL to the document, this should be a PermaLink.

    This field is mandatory.

     

    int $edited

    The time of the last document modification, this is a UNIX timestamp.

    This field is mandatory.

     

    \MidgardPerson $editor

    The MidgardPerson who modified the object the last time.

    This is optional.

     

    int $indexed

    The timestamp of indexing.

    This field is added automatically and to be considered read-only.

     

    string $lang

    Two letter language code of the document content

    This field is optional.

     

    double $score

    This is the score of this document.

    Only populated on resultset documents, of course.

     

    string $security

    Security mechanism used to determine the availability of a search result.

    Can be one of:

    • 'default': Use only built-in processing (topic and metadata visibility checks), this is, as you might have guessed, the default.
    • 'component': Invoke the _on_check_document_visible component interface method of the component after doing default checks. This security class absolutely requires the document to contain a valid topic GUID, otherwise access control will fail anyway.
    • 'function:$function_name': Invoke the globally available function $function_name, its signature is boolean $function_name ($document, $topic), if you don't change the document during the check, you don't need to pass by-reference, so this is up to you. The topic passed is the Return true if the document is visible, false otherwise.
    • 'class:$class_name': Like above, but using a class instead. The class must provide a statically callable get_instance() method, which returns a usable instance of the class (mostly, this should be a singleton, for performance reasons). The instance returned is assigned by-reference. On that object, the method check_document_permissions, whose signature must be identical to the function callback.
    see \midcom_baseclasses_components_interface::_on_check_document_permissions()
     

    string $source

    An additional tag indicating the source of the document for use by the component doing the indexing.

    This value is not indexed and should not be used by anybody except the component doing the indexing.

    This is optional.

     

    string $title

    The title of the document

    This is mandatory.

     

    string $topic_guid

    The GUID of the topic the document is assigned to.

    May be empty for non-midgard resources.

    This field is mandatory.

     

    string $topic_url

    The full path to the topic that houses the document.

    For external resources, this should be either a MidCOM topic, to which this resource is associated or some "directory" after which you could filter. You may also leave it empty prohibiting it to appear on any topic-specific search.

    The value should be fully qualified, as returned by MIDCOM_NAV_FULLURL, including a trailing slash, f.x. https://host/path/to/topic/

    This is optional.

     

    string $type

    The type of the document, set by subclasses and added to the index automatically.

    The type must reflect the original type hierarchy. It is to be set using the $this->_set_type call after initializing the base class.

    see \is_a()
    see \_set_type
     

    \midcom_services_i18n $_i18n

    A reference to the i18n service, used for charset conversion.
     

    Array $_fields

    An associative array containing all fields of the current document.

    Each field is indexed by its name (a string). The value is another array containing the fields "name", type" and "content".