\midcom_services_indexer_document

This class encapsulates a single indexer document. It is used for both indexing and retrieval.

A document consists of a number of fields, each field has different properties when handled by the indexer (exact behavior depends, as always, on the indexer backend in use). On retrieval, this field information is lost, all fields being of the same type (naturally). The core indexer backend supports these field types:

date is a date-wrapped field suitable for use with the Date Filter.
keyword is store and indexed, but not tokenized.
unindexed is stored but neither indexed nor tokenized.
unstored is not stored, but indexed and tokenized.
text is stored, indexed and tokenized.

This class should not be instantiated directly, a new instance of this class can be obtained using the midcom_services_indexer class.

A number of predefined fields are available using member fields. These fields are all meta-fields. See their individual documentation for details. All fields are mandatory unless mentioned otherwise explicitly and, as always, assumed to be in the local charset.

Remember, that both date and unstored fields are not available on retrieval. For the core fields, all timestamps are stored twice therefore, once as searchable field, and once as readable timestamp.

The class will automatically pass all data to the i18n charset conversion functions, thus you work using your site's charset like usual. UTF-8 conversion is done implicitly.

Summary

Methods

Properties

Constants

__construct()
get_field()
get_field_record()
list_fields()
remove_field()
add_date()
add_date_pair()
add_keyword()
add_unindexed()
add_unstored()
add_text()
add_result()
members_to_fields()
fields_to_members()
html2text()
is_a()
read_metadata_from_object()

$score
$RI
$lang
$topic_guid
$component
$document_url
$created
$edited
$indexed
$creator
$editor
$title
$content
$abstract
$author
$source
$topic_url
$type
$actually_index

No constants found

_add_field()
_set_type()
process_topic()

$_i18n

N/A

add_person()
read_unixtime()
read_person()
read_authorname()

$_fields

N/A

File: lib/midcom/services/indexer/document.php
Package: midcom.services
Class hierarchy: \midcom_services_indexer_document
See also: \midcom_services_indexer

Properties

$score

$score : double

This is the score of this document. Only populated on resultset documents, of course.

Type

double

$RI

$RI : string

The Resource Identifier of this document.

Must be UTF-8 on assignment already.

This field is mandatory.

Type

string

$lang

$lang : string

Two letter language code of the document content

This field is optional.

Type

string

$topic_guid

$topic_guid : string

The GUID of the topic the document is assigned to.

May be empty for non-midgard resources.

This field is mandatory.

Type

string — GUID

$component

$component : string

The name of the component responsible for the document.

May be empty for non-midgard resources.

This field is mandatory.

Type

string

$document_url

$document_url : string

The fully qualified URL to the document, this should be a PermaLink.

This field is mandatory.

Type

string

$created

$created : integer

The time of document creation, this is a UNIX timestamp.

This field is mandatory.

Type

integer

$edited

$edited : integer

The time of the last document modification, this is a UNIX timestamp.

This field is mandatory.

Type

integer

$indexed

$indexed : integer

The timestamp of indexing.

This field is added automatically and to be considered read-only.

Type

integer

$creator

$creator : \midcom_db_person

The MidgardPerson who created the object.

This is optional.

Type

\midcom_db_person

$editor

$editor : \midcom_db_person

The MidgardPerson who modified the object the last time.

This is optional.

Type

\midcom_db_person

$title

$title : string

The title of the document

This is mandatory.

Type

string

$content

$content : string

The content of the document

This is mandatory.

This field is empty on documents retrieved from the index.

Type

string

$abstract

$abstract : string

The abstract of the document

This is optional.

Type

string

$author

$author : string

The author of the document

This is optional.

Type

string

$source

$source : string

An additional tag indicating the source of the document for use by the component doing the indexing.

This value is not indexed and should not be used by anybody except the component doing the indexing.

This is optional.

Type

string

$topic_url

$topic_url : string

The full path to the topic that houses the document.

For external resources, this should be either a MidCOM topic, to which this resource is associated or some "directory" after which you could filter. You may also leave it empty prohibiting it to appear on any topic-specific search.

The value should be fully qualified, as returned by MIDCOM_NAV_FULLURL, including a trailing slash, f.x. https://host/path/to/topic/

This is optional.

Type

string

$type

$type : string

The type of the document, set by subclasses and added to the index automatically.

The type must reflect the original type hierarchy. It is to be set using the $this->_set_type call after initializing the base class.

Type

string

$actually_index

$actually_index : boolean

This is have support for #651 without rewriting all components' index methods

If set to false the indexer backend will silently skip this document.

Type

boolean

$_i18n

$_i18n : \midcom_services_i18n

The i18n service, used for charset conversion.

Type

\midcom_services_i18n

$_fields

$_fields : Array

An associative array containing all fields of the current document.

Each field is indexed by its name (a string). The value is another array containing the fields "name", type" and "content".

Type

Array

Methods

__construct()

__construct()

Initialize the object, nothing fancy here.

get_field()

get_field(string  $name) : mixed

Returns the contents of the field name or false on failure.

Parameters

string

$name

The name of the field.

Returns

mixed —

The content of the field or false on failure.

get_field_record()

get_field_record(string  $name) : Array

Returns the complete internal field record, including type and UTF-8 encoded content.

This should normally not be used from the outside, it is geared towards the indexer backends, which need the full field information on indexing.

Parameters

string

$name

The name of the field.

Returns

Array —

The full content record.

list_fields()

list_fields() : Array

Returns a list of all defined fields.

Returns

Array —

Fieldname list.

remove_field()

remove_field(string  $name)

Remove a field from the list. Nonexistent fields are ignored silently.

Parameters

string

$name

The name of the field.

add_date()

add_date(string  $name, integer  $timestamp)

Add a date field. A timestamp is expected, which is automatically converted to a suitable ISO timestamp before storage.

Direct specification of the ISO timestamp is not yet possible due to lacking validation outside the timestamp range.

If a field of the same name is already present, it is overwritten silently.

Parameters

string	$name	The field's name.
integer	$timestamp	The timestamp to store.

add_date_pair()

add_date_pair(string  $name, integer  $timestamp)

Create a normal date field and an unindexed _TS-postfixed timestamp field at the same time.

This is useful because the date fields are not in a readable format, it can't even be determined that they were a date in the first place. so the _TS field is quite useful if you need the original value for the timestamp.

Parameters

string	$name	The field's name, "_TS" is appended for the plain-timestamp field.
integer	$timestamp	The timestamp to store.

add_keyword()

add_keyword(string  $name, string  $content)

Add a keyword field.

Parameters

string	$name	The field's name.
string	$content	The field's content.

add_unindexed()

add_unindexed(string  $name, string  $content)

Add a unindexed field.

Parameters

string	$name	The field's name.
string	$content	The field's content.

add_unstored()

add_unstored(string  $name, string  $content)

Add a unstored field.

Parameters

string	$name	The field's name.
string	$content	The field's content.

add_text()

add_text(string  $name, string  $content)

Add a text field.

Parameters

string	$name	The field's name.
string	$content	The field's content.

add_result()

add_result(string  $name, string  $content)

Add a search result field, this should normally not be done manually, the indexer will call this function when creating a document out of a search result.

Parameters

string	$name	The field's name.
string	$content	The field's content, which is assumed to be UTF-8 already

members_to_fields()

members_to_fields()

This will translate all member variables into appropriate field records, the backend should call this immediately before indexing.

This call will automatically populate indexed with time() and author with the name of the creator (if set).

fields_to_members()

fields_to_members()

Populate all relevant members with the respective values after retrieving a document from the index

html2text()

html2text(string  $text) : string

Convert HTML to plain text (relatively simple):

Basically, JavaScript blocks and HTML Tags are stripped, and all HTML Entities are converted to their native equivalents.

Don't replace with an empty string but with a space, so that constructs like

torben

nehmer

are recognized correctly.

Parameters

string

$text

The text to convert to text

Returns

string —

The converted text.

is_a()

is_a(string  $document_type) : boolean

Checks whether the given document is an instance of given document type.

This is equivalent to the is_a object hierarchy check, except that it works with MidCOM documents.

Parameters

string

$document_type

The base type to search for.

Returns

boolean —

Indicating relationship.

read_metadata_from_object()

read_metadata_from_object(\midgard_object  $object)

Tries to resolve created, revised, author, editor and creator for the document from Midgard object

Parameters

\midgard_object

$object

object to use as source for the info

_add_field()

_add_field(string  $name, string  $type, string  $content, boolean  $is_utf8 = false)

Internal helper which actually stores a field.

Parameters

string	$name	The field's name.
string	$type	The field's type.
string	$content	The field's content.
boolean	$is_utf8	Set this to true explicitly, to override charset conversion and assume $content is UTF-8 already.

_set_type()

_set_type(string  $type)

Sets the type of the object, reflecting the inheritance hierarchy.

Parameters

string

$type

The name of this document type

process_topic()

process_topic()

Tries to determine the topic GUID and component using NAPs reverse-lookup capabilities.

If this fails, you have to set the members $topic_guid, $topic_url and $component manually.

add_person()

add_person(string  $name, \midcom_db_person  $person)

Add a person field.

Parameters

string	$name	The field's name.
\midcom_db_person	$person	The field's content.

read_unixtime()

read_unixtime(string  $stamp) : integer

Heuristics to determine how to convert given timestamp to local unixtime

Parameters

string

$stamp

ISO or unix datetime

Returns

integer —

unixtime

read_person()

read_person(string  $id) : \midcom_db_person

Get person by given ID, caches results.

Parameters

string

$id

GUID or ID to get person for

Returns

\midcom_db_person —

object

read_authorname()

read_authorname(string  $id) : string

Gets person name for given ID (in case it's imploded_wrapped of multiple GUIDs it will use the first)

Parameters

string

$id

GUID or ID to get person for

Returns

string —

$author->name