Extracting Keywords from PDF on Multimedia upload

I came across a nice library that allows you to create and manipulate PDF documents. Apart from creating PDF documents from scratch, you can also read existing ones, convert XML to PDF, fill out interactive PDF forms, stamp new content on existing PDF documents, split and merge existing PDF documents, and much more. The best part of it is that there is a C# port available which is open source, it’s called iTextSharp. Now I haven’t explored all features of it, like PDF creation, but so far it already looks very usable.

The first thing I did was to create a small and simple event handler in SDL Tridion, which upon first save of a Multimedia Component, would extract the available keywords in a PDF file and add them to the metadata of the Multimedia Component. Nothing fancy, it doesn’t try to match the PDF keywords with actual SDL Tridion Keywords or such, but still quite a useful start I thought, and it worked flawlessly and proved very fast.

In my example I simply subscribed to the Component Save event on the Initiated phase EventSystem.Subscribe(ExtractPdfKeywords, EventPhases.Initiated); this gives you access to the uploaded binary file, so it can be passed to the iTextSharp library which can read the PDF file directly from a byte array. On top of that we can change the metadata in this event without needing to call Save, since that is already done for us right after this event fires.

Using the iTextSharp.text.pdf.PdfReader was simple after that, the keywords of a PDF file (you can check the Document Properties in Adobe Reader to see if your PDF actually has keywords, they are the equivalent of Tags from a Microsoft Word document) are directly exposed in the Info object.

Below is the code I used in my event handler, as you can see its quite straight forward. It first tests if this is the initial save of a Multimedia Component. Then if the uploaded file has the extension “.pdf” and there is a Metadata Field available for storing the PDF keywords, it tries to extract the keywords from the PDF and sets them in the metadata.

private static void ExtractPdfKeywords(Component subject, SaveEventArgs args, EventPhases phase)
{
    // only act upon first save of multimedia components
    if (subject.Version == 0 && subject.ComponentType == ComponentType.Multimedia)
    {
        ItemFields componentMetadataFields = new ItemFields(subject.Metadata, subject.MetadataSchema);

        // add keywords to metadata for pdf files
        if (subject.BinaryContent.Filename.EndsWith(".pdf") && componentMetadataFields.Contains("pdf_keywords"))
        {
            try
            {
                // extract Keywords from PDF and store in metadata
                PdfReader reader = new PdfReader(subject.BinaryContent.GetByteArray());
                if (reader.Info.ContainsKey("Keywords"))
                {
                    // use TextField so we support both SingleLineTextField and MultiLineTextField
                    TextField keywordsField = (TextField)componentMetadataFields["pdf_keywords"];
                    keywordsField.Value = reader.Info["Keywords"];

                    // save fields back to component (no need to call save as we are in the initiated phase and save will happen after this event)
                    subject.Metadata = componentMetadataFields.ToXml();
                }
            }
            catch (Exception e)
            {
                Logger.Write(string.Format("PDF keyword extraction failed:\n{0}", e.Message), Name, LoggingCategory.General);
            }
        }
    }
}

This entry was posted in Community, Extensions, Tridion 2011 by Bart. Bookmark the permalink.

About Bart

Working as a Technical Product Manager, Bart is the evangelist of all SDL Web products and the Digital Experience Accelerator in particular. Bart has worked for SDL since 2003 as a technical support engineer, consultant and trainer, supporting both partners and customers with their implementations. Bart was one of the original developers of the first version of DXA, and currenlty determines the future of it.

One thought on “Extracting Keywords from PDF on Multimedia upload

  1. Nice post, Bart!

    I used to assume such document or Web metadata keywords should always be Tridion keywords. But considering how companies have SEO teams that optimize such attributes down to common misspellings, text fields might be a better fit with “modern” implementations.

    I’ve seen iTextSharp work well in at least one other Tridion scenario: taking a Tridion page (rendered as XML) to convert into PDF via a batch script job and on-the-fly. Nifty library for sure.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>