Tuesday, June 30, 2009

Writing KDE4 file analyzers

File analyzers extract data from files to display in the file dialogs and file managers. The data gathered this way is also used to search for files. KDE4 allows the use of multiple analyzers per file type. Analyzers can extract text which is used for indexing, but they can also retrieve other data such as song title, album title, recipient, md5 sum, the mimetype of a file, and much more.

This tutorial describes how you can write new analyzers.

[edit] Primer

[edit] What are file analyzers?

A file analyzer is a class that extracts metadata from a file or data stream. You can have file analyzers that are specific for certain file types such as an analyzer that extracts the information from an ogg vorbis file. There are also more general file analyzers that calculate for example the md5 or sha1 of a file.

[edit] File analyzers in KDE4

KDE4 uses stream based file analyzers for retrieving text and metadata from files. This has a number of advantages over file based methods. Stream based access

  • is faster for 90% of the file types,
  • allows easy analysis of embedded files such as email attachments or entries from zip files, rpms and many other container file formats.

Writing stream-based analyzers requires a different approach than the usual file-based methods and in the tutorial we will explain how to go about it.

The current state of porting the KDE3 kfile plugins to KDE4 stream analyzers can be seen at

Look for existing code

If you want to see some code examples, take a look at the already implemented file analyzers at

Some examples of meta-data extraction from files can also be found.

Choosing the type of analyzer

There are two main types of analyzers: StreamThroughAnalyzer and StreamEndAnalyzer. The latter is more powerful and a bit easier to program, but has a limitation: only one StreamEndAnalyzer can analyze a particular resource, while you can use as many StreamThroughAnalyzers as you like. Most analyzers can be written as StreamThroughAnalyzers. The most important exception is for analyzers that extract embedded resources from a stream. Examples of this are the ZipEndAnalyzer, the MailEndAnalyzer and the RpmEndAnalyzer.

In this tutorial we focus on a simple example file type: BMP images. The information we will get from this file is located at the start of the file. It turns out that in this case, it is just as easy to implement the analyzer as a StreamEndAnalyzer as a StreamThroughAnalyzer. We will implement it as a StreamEndAnalyzer and point out how to do the same as a StreamThroughAnalyzer.

Two other types of stream analyzers have been added to Strigi. StreamLineAnalyzer, for file format based on lines of plan text, StreamSaxAnalyzer, for XML based file such as SVG.

[edit] StreamEndAnalyzer

Three functions need to be implemented in a StreamEndAnalyzer:

  • name()
  • checkHeader()
  • analyze()

No comments:

Post a Comment