\midcom_services_indexer_document_attachment

This is a class geared at indexing attachments. It requires you to "assign" the attachment to a topic, which is used as TOPIC_URL for permission purposes. In addition you may set another MidgardObject as source object, its GUID is stored in the __SOURCE field of the index.

The documents type is "midcom_attachment", though it is not derived from midcom for several reasons directly. They should be compatible though, in terms of usage.

Example Usage:

$document = new midcom_services_indexer_document_attachment($attachment, $object); $indexer->index($document);

Where $attachment is the attachment to be indexed and $object is the object the object is associated with. The corresponding topic will be detected using the object's GUID through NAP. If this fails, you have to set the members $topic_guid, $topic_url and $component manually.

Summary

Methods
Properties
Constants
__construct()
get_field()
get_field_record()
list_fields()
remove_field()
add_date()
add_date_pair()
add_keyword()
add_unindexed()
add_unstored()
add_text()
add_result()
members_to_fields()
fields_to_members()
html2text()
is_a()
read_metadata_from_object()
$score
$RI
$lang
$topic_guid
$component
$document_url
$created
$edited
$indexed
$creator
$editor
$title
$content
$abstract
$author
$source
$topic_url
$type
$actually_index
No constants found
_add_field()
_set_type()
process_topic()
$_i18n
N/A
add_person()
read_unixtime()
read_person()
read_authorname()
process_attachment()
process_mime_word()
process_mime_pdf()
process_mime_richtext()
process_mime_plaintext()
process_mime_html()
process_mime_binary()
get_attachment_content()
write_attachment_tmpfile()
$_fields
$attachment
N/A

Properties

$score

$score : double

This is the score of this document. Only populated on resultset documents, of course.

Type

double

$RI

$RI : string

The Resource Identifier of this document.

Must be UTF-8 on assignment already.

This field is mandatory.

Type

string

$lang

$lang : string

Two letter language code of the document content

This field is optional.

Type

string

$topic_guid

$topic_guid : string

The GUID of the topic the document is assigned to.

May be empty for non-midgard resources.

This field is mandatory.

Type

string — GUID

$component

$component : string

The name of the component responsible for the document.

May be empty for non-midgard resources.

This field is mandatory.

Type

string

$document_url

$document_url : string

The fully qualified URL to the document, this should be a PermaLink.

This field is mandatory.

Type

string

$created

$created : integer

The time of document creation, this is a UNIX timestamp.

This field is mandatory.

Type

integer

$edited

$edited : integer

The time of the last document modification, this is a UNIX timestamp.

This field is mandatory.

Type

integer

$indexed

$indexed : integer

The timestamp of indexing.

This field is added automatically and to be considered read-only.

Type

integer

$creator

$creator : \midcom_db_person

The MidgardPerson who created the object.

This is optional.

Type

\midcom_db_person

$editor

$editor : \midcom_db_person

The MidgardPerson who modified the object the last time.

This is optional.

Type

\midcom_db_person

$title

$title : string

The title of the document

This is mandatory.

Type

string

$content

$content : string

The content of the document

This is mandatory.

This field is empty on documents retrieved from the index.

Type

string

$abstract

$abstract : string

The abstract of the document

This is optional.

Type

string

$author

$author : string

The author of the document

This is optional.

Type

string

$source

$source : string

An additional tag indicating the source of the document for use by the component doing the indexing.

This value is not indexed and should not be used by anybody except the component doing the indexing.

This is optional.

Type

string

$topic_url

$topic_url : string

The full path to the topic that houses the document.

For external resources, this should be either a MidCOM topic, to which this resource is associated or some "directory" after which you could filter. You may also leave it empty prohibiting it to appear on any topic-specific search.

The value should be fully qualified, as returned by MIDCOM_NAV_FULLURL, including a trailing slash, f.x. https://host/path/to/topic/

This is optional.

Type

string

$type

$type : string

The type of the document, set by subclasses and added to the index automatically.

The type must reflect the original type hierarchy. It is to be set using the $this->_set_type call after initializing the base class.

Type

string

$actually_index

$actually_index : boolean

This is have support for #651 without rewriting all components' index methods

If set to false the indexer backend will silently skip this document.

Type

boolean

$_i18n

$_i18n : \midcom_services_i18n

The i18n service, used for charset conversion.

Type

\midcom_services_i18n

$_fields

$_fields : Array

An associative array containing all fields of the current document.

Each field is indexed by its name (a string). The value is another array containing the fields "name", type" and "content".

Type

Array

$attachment

$attachment : 

Type

Methods

__construct()

__construct(\midcom_db_attachment  $attachment) 

Create a new attachment document

Parameters

\midcom_db_attachment $attachment

The Attachment to index.

get_field()

get_field(string  $name) : mixed

Returns the contents of the field name or false on failure.

Parameters

string $name

The name of the field.

Returns

mixed —

The content of the field or false on failure.

get_field_record()

get_field_record(string  $name) : Array

Returns the complete internal field record, including type and UTF-8 encoded content.

This should normally not be used from the outside, it is geared towards the indexer backends, which need the full field information on indexing.

Parameters

string $name

The name of the field.

Returns

Array —

The full content record.

list_fields()

list_fields() : Array

Returns a list of all defined fields.

Returns

Array —

Fieldname list.

remove_field()

remove_field(string  $name) 

Remove a field from the list. Nonexistent fields are ignored silently.

Parameters

string $name

The name of the field.

add_date()

add_date(string  $name, integer  $timestamp) 

Add a date field. A timestamp is expected, which is automatically converted to a suitable ISO timestamp before storage.

Direct specification of the ISO timestamp is not yet possible due to lacking validation outside the timestamp range.

If a field of the same name is already present, it is overwritten silently.

Parameters

string $name

The field's name.

integer $timestamp

The timestamp to store.

add_date_pair()

add_date_pair(string  $name, integer  $timestamp) 

Create a normal date field and an unindexed _TS-postfixed timestamp field at the same time.

This is useful because the date fields are not in a readable format, it can't even be determined that they were a date in the first place. so the _TS field is quite useful if you need the original value for the timestamp.

Parameters

string $name

The field's name, "_TS" is appended for the plain-timestamp field.

integer $timestamp

The timestamp to store.

add_keyword()

add_keyword(string  $name, string  $content) 

Add a keyword field.

Parameters

string $name

The field's name.

string $content

The field's content.

add_unindexed()

add_unindexed(string  $name, string  $content) 

Add a unindexed field.

Parameters

string $name

The field's name.

string $content

The field's content.

add_unstored()

add_unstored(string  $name, string  $content) 

Add a unstored field.

Parameters

string $name

The field's name.

string $content

The field's content.

add_text()

add_text(string  $name, string  $content) 

Add a text field.

Parameters

string $name

The field's name.

string $content

The field's content.

add_result()

add_result(string  $name, string  $content) 

Add a search result field, this should normally not be done manually, the indexer will call this function when creating a document out of a search result.

Parameters

string $name

The field's name.

string $content

The field's content, which is assumed to be UTF-8 already

members_to_fields()

members_to_fields() 

This will translate all member variables into appropriate field records, the backend should call this immediately before indexing.

This call will automatically populate indexed with time() and author with the name of the creator (if set).

fields_to_members()

fields_to_members() 

Populate all relevant members with the respective values after retrieving a document from the index

html2text()

html2text(string  $text) : string

Convert HTML to plain text (relatively simple):

Basically, JavaScript blocks and HTML Tags are stripped, and all HTML Entities are converted to their native equivalents.

Don't replace with an empty string but with a space, so that constructs like

  • torben
  • nehmer
  • are recognized correctly.

    Parameters

    string $text

    The text to convert to text

    Returns

    string —

    The converted text.

    is_a()

    is_a(string  $document_type) : boolean

    Checks whether the given document is an instance of given document type.

    This is equivalent to the is_a object hierarchy check, except that it works with MidCOM documents.

    Parameters

    string $document_type

    The base type to search for.

    Returns

    boolean —

    Indicating relationship.

    read_metadata_from_object()

    read_metadata_from_object(\midgard_object  $object) 

    Tries to resolve created, revised, author, editor and creator for the document from Midgard object

    Parameters

    \midgard_object $object

    object to use as source for the info

    _add_field()

    _add_field(string  $name, string  $type, string  $content, boolean  $is_utf8 = false) 

    Internal helper which actually stores a field.

    Parameters

    string $name

    The field's name.

    string $type

    The field's type.

    string $content

    The field's content.

    boolean $is_utf8

    Set this to true explicitly, to override charset conversion and assume $content is UTF-8 already.

    _set_type()

    _set_type(string  $type) 

    Sets the type of the object, reflecting the inheritance hierarchy.

    Parameters

    string $type

    The name of this document type

    process_topic()

    process_topic() 

    Tries to determine the topic GUID and component using NAPs reverse-lookup capabilities.

    If this fails, you have to set the members $topic_guid, $topic_url and $component manually.

    add_person()

    add_person(string  $name, \midcom_db_person  $person) 

    Add a person field.

    Parameters

    string $name

    The field's name.

    \midcom_db_person $person

    The field's content.

    read_unixtime()

    read_unixtime(string  $stamp) : integer

    Heuristics to determine how to convert given timestamp to local unixtime

    Parameters

    string $stamp

    ISO or unix datetime

    Returns

    integer —

    unixtime

    read_person()

    read_person(string  $id) : \midcom_db_person

    Get person by given ID, caches results.

    Parameters

    string $id

    GUID or ID to get person for

    Returns

    \midcom_db_person

    object

    read_authorname()

    read_authorname(string  $id) : string

    Gets person name for given ID (in case it's imploded_wrapped of multiple GUIDs it will use the first)

    Parameters

    string $id

    GUID or ID to get person for

    Returns

    string —

    $author->name

    process_attachment()

    process_attachment() 

    process_mime_word()

    process_mime_word() 

    Convert a Word attachment to plain text and index it.

    process_mime_pdf()

    process_mime_pdf() 

    Convert a PDF attachment to plain text and index it.

    process_mime_richtext()

    process_mime_richtext() 

    Convert an RTF attachment to plain text and index it.

    process_mime_plaintext()

    process_mime_plaintext() 

    Simple plain-text driver, just copies the attachment.

    process_mime_html()

    process_mime_html() 

    Processes HTML-style attachments (should therefore work with XML too), strips tags and resolves entities.

    process_mime_binary()

    process_mime_binary() 

    Any binary file will have its name in the abstract unless no title is defined, in which case the documents title already contains the file's name.

    get_attachment_content()

    get_attachment_content(resource  $handle = null) 

    Returns the first four megabytes of the File referenced by $handle.

    The limit is in place to avoid clashes with the PHP Memory limit, it should be enough for most text based attachments anyway.

    If you omit $handle, a handle to the documents' attachment is created. If no handle is specified, it is automatically closed after reading the data, otherwise you have to close it yourselves afterwards.

    Parameters

    resource $handle

    A valid file-handle to read from, or null to automatically create a handle to the current attachment.

    write_attachment_tmpfile()

    write_attachment_tmpfile() : string

    Creates a temporary copy of the attachment, the caller must delete it manually after completing processing.

    Returns

    string —

    The name of the temporary file.