What is Java Content Repository

What is Java Content Repository

by Sunil Patil
10/04/2006

JSR-170 defines itself as "a standard, implementation independent way to access content bi-directionally on a granular level within a content repository," and goes on to define a content repository as "a high-level information management system that is a superset of traditional data repositories, [which] implements 'content services' such as: author based versioning, full textual searching, fine grained access control, content categorization and content event monitoring."

The Java Content Repository API (JSR-170) is an attempt to standardize an API that can be used for accessing a content repository. If you're not familiar with content management systems (CMS) such as Documentum, Vignette, or FileNet, then you must be wondering what a content repository is. Think of a content repository as a generic application "data store" tht can be used for storing both text and binary data (images, word processor documents, PDFs, etc.). One key feature of a content repository is that you don't have to worry about how the data is actually stored: data could be stored in a RDBMS or a filesystem or as an XML document. In addition to providing services for storing and retrieving your data, most content repositories provide advanced services such as uniform access control, searching, versioning, observation, locking, and more.

Various CMSs from different vendors have been on the market for quite some time, and all of these CMSs ship their own version of a content repository. The problem is, each CMS vendor provides its own API for interacting with the content repository shipped with that vendor's CMS. This is a problem for the application developer, since he has to learn a particular vendor's API and potentially tie up his code with one particular CMS implementation.

JSR-170 tries to solve this problem by standardizing the API that should be used for connecting to any content repository. With JCR-170, you develop code by only using the javax.jcr.* classes and interfaces. This should be able to work with any JSR-170 compliant content repository.

This article is a step-by-step tutorial for newcomers to JSR-170. I've decided to use Apache Jackrabbit, the reference implementation of JSR-170, as the content repository. I'll start the discussion by talking a little more about what content repository is and what is needed for standardizing the content repository API. After that I'll introduce you to JSR-170 by discussing the repository model defined by JSR-170. Next I will talk about what Apache Jackrabbit is, how to build it, and configure it for use. Once Apache Jackrabbit is set up, I will develop a sample application for demonstrating the basic features of JSR-170 API.

Need for Java Content Repository API

As the number of vendors offering proprietary content repositories has increased, the need for common programmatic interface to these repositories has become apparent and that's where JSR-170 comes into play. JSR-170 defines a programmatic interface that should be used for connecting to content repository. You can think about JSR-170 as a JDBC-like API for content repositories, allowing you to develop your program independently of any particular content repository implementation. At runtime, you can configure this program to work either with a natively JSR-170 compliant content repository (e.g., Communique or Apache Jackrabbit) if your repository is not natively JSR-170 compliant (e.g., Documentum or Vignette), then you can use some kind of repository-specific JSR-170 driver that takes care of converting your JSR-170 method calls to repository-specific method calls.

CMSs are a quite old concept. Some of the common applications of CMSs include a web content management system used to manage content (static HTML files and images) on a company's web site, or a document management system where a company stores scanned copies of all sales orders. There are different CMS vendors in the market that provide this type of application. CMS vendors need a content repository as a backend, one that handles both structured and non-structured content efficiently. By "structured content," we mean content like a news item or press release that is posted in the system and retrieved by queries (e.g., your application's front page should display, say, the 3 latest press releases or 10 latest news items). An example of unstructured content is a scanned copy of a sales order or an image that should be displayed on your corporate website.

To support these CMS systems, vendors have developed their own content repositories that ship with their CMS systems. They also provide proprietary APIs that can be used for accessing this repository. As the number of CMS vendors increases, standardizing this API becomes apparent and that's where JSR-170 comes into play.

Figure 1 describes the structure of an application developed using the JSR-170 API. At run time, this application can work with either content repository 1, 2 or 3. Of these, only content repository 2 is natively JSR-170 compliant; the other two repositories need JSR-170 drivers for interacting with a JSR-170 application. Note one more thing: your application does not have to worry about how actual content is stored. Content repository 1 may use RDMBS as underlying data store where as content repository 2 may use the filesystem as its underlying data store, while some other repository could use a mix of these.

Structure of JSR-170 compliant application
Figure 1. Structure of JSR-170 compliant application

The JCR-170 API has different advantages for different stakeholders in content repository space.

  • Developers do not have to spend time learning each vendor's repository-specific API. Instead, once she is comfortable with JSR-170, a developer should be able to work with any JSR-170 compliant content repository. In the past, developers had to make choice between a CMS with great features and poor development tools, or one with great development tools but poor features. Now that the interface between content repository and CMS applications is standardized, you can choose the best choices from both worlds.
  • Corporations won't have to face problem of vendor lock-in. More commonly, many corporations have more than one CMS either because different departments choose to use different CMSs in the past, or because some acquired company used a different CMS system. In the past, corporations spent a lot of money getting these different systems to interact with each other. With JSR-170, they can be assured that same application will work with all CMSs.
  • CMS vendors were forced to develop and maintain their own content repository implementations, which meant lots of infrastructure code. Now they can leave development of the content repository to some other vendor and concentrate more on their core competency: developing CMS applications.

Content Repository Model

JSR-170 says that a content repository is composed of a number of workspaces, which should normally contain similar content. A repository can have one or more workspaces. Each workspace contains a single rooted tree of items. An item is either a node or a property. Each node may have zero or more child nodes and zero or more child properties. Only the root node does not have parent and all other nodes have exactly one parent. Every workspace has only one root node. Properties have one node as a parent and cannot have children; they are leaves of the trees. All of the actual content in the repository is stored within the values of the properties.

Figure 2 describes a content repository model for a sample blogging application. Every child node of the root node represents one blog entry. Any actual data related to a blog entry is stored as properties of blogEntry. The properties blogTitle, blogAuthor, and creationTime should all be self-evident, while the blogContent property contains actual entry data, and a blogAttachment property holds a binary image file that is image attached:

Thumbnail, click for full-size image.
Figure 2. Content repository model (click for full-size version)

In addition to this repository model, JSR-170 also defines different features or operations that should be supported by a compliant repository. To make it easy for existing CMS vendors to adopt to these new standards, JSR-170 has brought in the concept of compliance levels, which define the number of features that must be supported for a given level of compliance. JSR-170 defines three different compliance levels:

  • Level 1 defines a read-only repository: This includes functionality for the reading of repository content, export of content to XML and searching. This functionality should meet the needs of presentation templates and basic portal applications comprising a large portion of existing codebase of content-related applications. Level 1 is also designed to be easy to implement on top of an existing content repository.
  • Level 2 defines a writable repository: Level 2 repository is a superset of Level 1. In addition to Level 1's functionality, it defines methods for writing content and importing content from XML. Applications written against Level 2 features include any application that generates data, information or content, both structured and unstructured.
  • Advanced options: In addition to Level 1 or Level 2 features, the specification defines five additional functional blocks: Versioning, (JTA) Transactions, Query using SQL, Explicit Locking and Content Observation. In addition to being either Level 1 or Level 2 compliant, any repository can decide to implement one or more of these functional blocks. A repository that implements all of these features in addition to being Level 2 compliant can be used as a general purpose off-the-shelf infrastructure for content management, document management, code management, or just about any other application that persists content

So, if you are a CMS vendor, the first step is to make your repository Level 1 compliant. As time progresses, you can decide to move to Level 2 compliance and implement advanced features based on your needs or client base.

What Is Apache JackRabbit?

Apache Jackrabbit is fully JSR-170 compliant, Level 2 compliant, and implements all optional feature blocks. Beyond the JCR-170 API, Jackrabbit features numerous extensions and administrative features that are needed to run a repository but are not specified by JCR-170.

We have decided to use Apache Jackrabbit as the content repository in our sample application. One problem with Apache Jackrabbit is that it doesn't offer a binary release, so developers need to build it from source code before installing it. See Building Jackrabbit for information on how to build Apache Jackrabbit from source code.

How to Configure Apache Jackrabbit

After downloading and building the Jackrabbit source code successfully, let's configure it. Jackrabbit needs two parameters at runtime to configure a content repository instance.

  1. Repository home directory: The filesystem path of the directory that usually contains all the repository content, search indexes, internal configuration, and other persistent information managed within the content repository. The directory structure of the content repository will look something like this:

       c:/temp
            |
            |--Blogging
                    |
                    |-repository
                    |       |
                    |       |-index
                    |       |-meta
                    |       |-namespaces
                    |       |-nodetypes             
                    |
                    |-version
                    |
                    |-workspace
                            |
                            |--default

    In this case, value of repository home directory parameter should be c:/temp/Blogging.

  2. Repository configuration file: The filesystem path of the repository configuration XML file. This file contains configuration information for the repository, including class names for Jackrabbit components (deciding which implementation we want to use) and configuration information required for that component. Take a look at the following listing, which represents what a typical configuration file would look like:

    
     
      
     
     
      
      
        
      
     
     
     
      
       
      
      
       
       
      
      
       
      
     
     
      
       
      
      
       
       
      
      
      
       
      
    

    In the repository configuration file, the element is a top-most or root element. One element is equivalent to one repository configuration information and it contains following elements

    • : The filesystem element represents virtual filesystem implementation that would be used for storing global data--data that is applicable at level of repository, such as registered namespace, custom node types, etc. Apache Jackrabbit provides a few options to store this data. One option is to store it on an underlying filesystem, which we are doing in our sample application by using LocalFileSystem. If you want this data to be stored in a database, then use DbFileSystem.
    • : The security element contains security configuration information for this repository. It has two child elements: and . The value of indicates the class that should be queried to determine if a user has rights to perform a particular action on a particular item. The element allows you to configure a class of LoginModule type, which is used for implementing authentication.
    • : This element holds configuration that is common across all workspaces in that repository. Its rootPath attribute points to the root directory containing all workspace folders. In our sample directory configuration it would be c:/temp/Blogging/Workspace. The value of defaultWorkspace attribute contains default name of the workspace.
    • : This element represents the default template for all workspaces in this repository. So, when you create a new workspace in this repository, its workspace.xml file will look like this element. The element has three child elements. The first is , which configures the virtual filesystem that should be used for storing data related to this workspace. The element indicates how you want to persist content of this workspace. Apache Jackrabbit gives you with a choice of storing it on the filesystem, in a database, in memory as hashtable, or as an XML file. In our sample we are planning to persist that content in a Derby database. The last element is , which is an optional element. The value of this element points to a class which is used for indexing as well as actual query execution.
    • : This element configures a versioning-related object. You may have noticed that it contains the same child elements FileSystem and PersistentManager as seen in Workspace. That's because JSR-170 treats version as nodes, and so the same structure can be reused.
    • : This element configures the index that is used for searching repository-wide content.

The repository home directory and repository file configuration parameters are passed either directly to Jackrabbit when a repository instance is created or indirectly through settings for the JNDI object factory. You can set the value of the org.apache.jackrabbit.repository.home system property to point to the repository home directory. In our example, we will set it to c:/temp/Blogging. Then again, if you have a repository.xml file and you want to use that for setting up the repository, then you can set the value of the org.apache.jackrabbit.repository.conf system property to point your repository.xml. In our case, we don't want to use an existing repository.xml, instead we want Jackrabbit to generate a default repository.xml file for us. If you don't set either of these properties, then Jackrabbit will treat the current folder as the home directory and create a repository directory structure file as well as a repository.xml file in it. Refer to the Apache Jackrabbit online documentation to configure Apache Tomcat to create a repository configuration object and bind it in the JNDI tree.

Developing a Blogging Application

With our Apache Jackrabbit installation built and configured, it's time to take the next step and build a sample application. In this section, we will develop a sample blogging application using the JCR-170 API. We need two things for developing this sample application: a backend to add, update, delete, and remove content in the content repository, and a client to provide a UI for performing these operations.

First we create clear-cut separation between these two parts by defining a DAO interface for the backend layer. So, create BlogEntryDAO.java interface like this

public interface BlogEntryDAO {
    public void insertBlogEntry(BlogEntryDTO blogEntryDTO)
        throws BlogApplicationException;
    public void updateBlogEntry(BlogEntryDTO blogEntryDTO)
        throws BlogApplicationException;
    public ArrayList getBlogList()
        throws BlogApplicationException;
    public BlogEntryDTO getBlogEntry(String blogTitle)
        throws BlogApplicationException;
    public void removeBlogEntry(String blogTitle)
        throws BlogApplicationException;
    public ArrayList searchBlogList(String userName)
        throws BlogApplicationException;
    public void attachFileToBlogEntry(String blogTitle, InputStream uploadInputStream)
        throws BlogApplicationException;
    public InputStream getAttachedFile(String blogTitle)
        throws BlogApplicationException;
}

As you can see, this class has methods for adding, updating, searching for blog entries, and two methods for dealing with binary content. Next, we need a DTO class that will be used for carrying data between the web layer and backend layer. Create the BlogEntryDTO class like this:

public class BlogEntryDTO {

    private String userName;
    private String title;
    private String blogContent;
    private Calendar creationTime;

    //Getter and setter methods for each of these properties        
}

Every blog entry will have four properties associated with it: userName, title, blogContent, and creationTime. With this interface between the UI layer and backend layer in place we can implement either layer. Because of space constraint, we have decided not to spend time talking about the UI layer; instead you can download sample code for this application from the resources section, where you'll find a sample Struts-based UI for BlogEntryDAO. The next section describes implementing the backend for the blogging application by implementing BlogEntryDAO.

Connecting to Jackrabbit

The first thing that we want to do in developing the backend is to write the component that gets a connection to Jackrabbit. To keep things simple, we will get the connection to Jackrabbit at application startup time and drop that connection when the application shuts down. Since we are developing a Struts application, we need to create our own PlugIn class that gets control at application startup and shutdown times, like this:

public class JackrabbitPlugin implements PlugIn{
    public static Session session;
    public void destroy() {
        session.logout();
    }
    public void init(ActionServlet actionServlet, ModuleConfig moduleConfig) 
    throws ServletException {
        try {
            System.setProperty("org.apache.jackrabbit.repository.home",
                "c:/temp/Blogging");
            Repository repository = new TransientRepository();
            session = repository.login(new SimpleCredentials("username",
                    "password".toCharArray()));
        } catch (LoginException e) {
            throw new ServletException(e);
        } catch (IOException e) {
            throw new ServletException(e);
        } catch (RepositoryException e) {
            throw new ServletException(e);            
        }
    }
    public static Session getSession() {
        return session;
    }
}

The init() method of JackrabbitPlugin class will get called at application startup and destroy() method will get called at shutdown. The code inside the init() method is used for getting the connection to Jackrabbit. The first thing we do is set the org.apache.jackrabbit.repository.home system property to point to c:/temp/blogging, indicating where Jackrabbit should store its data. Next, create a new instance of TransientRepository. This is a class provided by Apache Jackrabbit, offering a proxy to the repository. It starts up the repository automatically when the first session is opened, and automatically stops the repository when the last session is closed.

Once you have a repository object, you can call its login() method to open a connection. login() takes an object of type Credential as an and argument; if this is null, it is assumed that authentication is handled by mechanism external to the repository itself (for example, the JAAS framework). Since we are not passing a workspace name argument to login(), Jackrabbit will create default a workspace and return a Session object for that particular workspace. The Session object encapsulates both the authorization settings of a particular user and a binding to the workspace specified by the workspaceName passed on login. Please note that there is a one-to-one mapping between session and workspace.

Add Content

With Apache Jackrabbit set up properly and our code to connect to it, we can implement the methods of BlogEntryDAO. The first method that we want to implement is insertBlogEntry(), which is used for adding new blogEntry nodes:

public void insertBlogEntry(BlogEntryDTO blogEntryDTO)
            throws BlogApplicationException {
        Session session = JackrabbitPlugin.getSession();
        Node rootNode = session.getRootNode();
        Node blogEntry = rootNode.addNode("blogEntry");
        blogEntry.setProperty("title", blogEntryDTO.getTitle());
        blogEntry.setProperty("blogContent", blogEntryDTO.getBlogContent());
        blogEntry.setProperty("creationTime", blogEntryDTO.getCreationTime());
        blogEntry.setProperty("userName", blogEntryDTO.getUserName());            
        session.save();
}

The first thing that we are doing in this method is getting an instance of the session object initialized in the JackrabbitPlugin class. After that, we call getRootNode() on the session object, which returns root node ("/") of the workspace. Once we have an object pointing to the root node, we can add a new child node to it by calling addNode() method on rootNode; this will create a new child node named blogEntry. After that, we can set the actual content of blogEntry as properties of the node. You might remember from the discussion on the repository model that properties are leaves and are used for storing actual content. In the case of blogEntery, every blogEntry will have four properties: title, blogContent, creationTime, and userName, each of which can be set by calling setProperty() on the newly created blogEntry node.

Notice the use of the method session.save() in the insertBlogEntry() method. This method is needed because changes made through methods of Session, Node, or Property are not immediately reflected in the persistent workspace. The changes are held in the transient storage associated with the Session object until they are either persisted using either session.save() or item.save(). Also, Session.save() validates changes and if this validation succeeds, it persists all pending changes currently stored in the Session object. Until this is done, changes made using one session are not made visible to other sessions. Conversely, Session.refresh(false) discards all pending changes currently stored in session. For more fine-grained control over which changes are persisted or discarded, the method Item.save() and Item.refresh() are also provided. Item.save() saves all pending changes in the Session that apply to a particular item or its subtree. Analogously, Item.refresh(false) discards all pending changes that apply to that item.

In the web UI you can post a new blog entry by going to the http://localhost:8080/ page and clicking on "Add node" link, fill out that form, and click submit to invoke the insertBlogEntry() method.

Traversal

How do we test that node was actually added and persisted to content repository? Implement the getBlogList() method of BlogEntryDAO method, which returns a list of all child nodes of root node whose name is equal to blogEntry. The following code listing demonstrates how to do that:

public ArrayList getBlogList() throws BlogApplicationException {
    Session session = JackrabbitPlugin.getSession();
    ArrayList blogEntryList = new ArrayList();
    Node rootNode = session.getRootNode();
    NodeIterator blogEntryNodeIterator = rootNode.getNodes();

    while (blogEntryNodeIterator.hasNext()) {
        Node blogEntry = blogEntryNodeIterator.nextNode();
        if (blogEntry.getName().equals("blogEntry") == false)
            continue;
        String title = blogEntry.getProperty("title").getString();
        String blogContent = blogEntry.getProperty("blogContent").getString();
        Value creationTimeValue = (Value) blogEntry.getProperty(
                "creationTime").getValue();
        String userName = blogEntry.getProperty("userName").getString();
        BlogEntryDTO blogEntryDTO = new BlogEntryDTO(userName, title,
                blogContent, creationTimeValue.getDate());
        blogEntryList.add(blogEntryDTO);
    }
    return blogEntryList;
}

Once you have a root node object, you can call getNodes() on it to return all its child nodes. If the node does not have any children, then an empty NodeIterator is returned. We can iterate through NodeIterator to get a list of blogEntry nodes. You can call the node's getProperty() to read a property with a supplied name. getProperty() returns an instance of Value, whose implementation class depends on the type of property stored. Once you have this object you can call type-specific methods such as getString() for reading a string stored in the property, or getDate() for a stored date.

When you go to http://localhost:8080/ in the web UI, the index page of the blog application will call the getBlogList() method and it will display all entries on index page.

Searching for Content (XPath)

JSR-170 defines two ways to search for content. One uses XPath syntax and the other uses SQL syntax. The specification mandates that every Level 1 compliant repository should provide support for XPath syntax, but support for SQL search is an optional feature that we will talk more about in the next part.

XPath is a search language originally designed for selecting elements from an XML document. Since a workspace, like an XML document, can be viewed as a tree structure, XPath provides a convenient syntax for searching workspace content.

Let's change our blogging application so that it allows you to search for all blog entries posted by a particular user (i.e., all blog entries where blogAuthor is some user name). We need two things to implement this: one is a change to the UI to accept the query userName and display results to the user. This feature can be seen in the sample application, which displays an input box named Blogger Name at the top of page, with a "Search" button. When you input text in this box and click search, control goes to SearchBlogEntriesAction.java, which calls the searchBlogList() method of BlogEntryDAO with the blogger name supplied by the user, and displays results as a list. So, the only thing that we have do is implement the searchBlogList() method of JackrabbitBlogEntryDAO class like this:

Session session = JackrabbitPlugin.getSession();
    Workspace workSpace = session.getWorkspace();
    QueryManager queryManager = workSpace.getQueryManager();

    StringBuffer queryStr = new StringBuffer(
            "//blogEntry[@"+PROP_BLOGAUTHOR +"= '");
    queryStr.append(userName);
    queryStr.append("']");
    Query query = queryManager.createQuery(queryStr.toString(),
            Query.XPATH);

    QueryResult queryResult = query.execute();

    NodeIterator queryResultNodeIterator = queryResult.getNodes();
    while (queryResultNodeIterator.hasNext()) {

        Node blogEntry = queryResultNodeIterator.nextNode();
        String title = blogEntry.getProperty(PROP_TITLE).getString();
        String blogContent = blogEntry.getProperty(PROP_BLOGCONTENT).getString();
        Value creationTimeValue = (Value) blogEntry.getProperty(
                PROP_CREATIONTIME).getValue();
        BlogEntryDTO blogEntryDTO = new BlogEntryDTO(userName, title,
                blogContent, creationTimeValue.getDate());
        blogEntryList.add(blogEntryDTO);
    }

Once we have the session object, call the getWorkspace() method on it to retrieve the workspace attached to the current session. Remember, there is a one-to-one mapping between workspace and session. This workspace object can be used to retrieve the QueryManager associated with this workspace. The QueryManager interface encapsulates methods for management of search queries. The next thing that we want to do is to create a query string, which in our case would be "//blogEntry[@blogAuthor=''"]. This means "search all nodes with name equal to blogEntry and value of blogAuthor property equal to the supplied by user". See the JSR-170 specification document for more details on the syntax of an XPath query.

You can create a new query object by calling the queryManager's createQuery() method, passing a query string and the name of the query language (XPath in our case). Once you have the Query object, you can call its execute() method to actually execute the query and return a QueryResult object. The results returned always respect the access restrictions of the current session. In other words, if the current session does not have read permission for a particular item, then that item will not be included in the result set, even if it would otherwise constitute a match. All queries are run against the peristent state of the workspace; pending changes stored in the session are not searched. Once you have results you can call getNodes() method on it to have an iterator go over nodes that match the query.

Two more methods that we want to implement are updateBlogEntry() and removeBlogEntry(), which are actually very simple. In both these methods, we retrieve the relevant node by using the title as the primary key. In the update case, set the new values for properties to update, and in the remove case, once you have have target node call the remove() method on it. Don't forgot to call session.save() method to make sure that your changes are persisted.

Handling Binary Bontent

One of the primary requirements of a content repository is that it should be able to handle binary content, such as image files. Now let's assume that we want to allow the user to attach an image to a particular blogEntry and we also want to add a retrieval method. To do that, we have added two links to every blog entry in its title bar: "attach file" for attaching an image to a blog entry and "display attached file," which displays the image attached to that blogEntry. To get this feature working we have to implement two methods from BlogEntryDAO: attachFileToBlogEntry() and getAttachedFile() :

public void attachFileToBlogEntry(String blogTitle,
  InputStream uploadInputStream) throws BlogApplicationException {
    Session session = JackrabbitPlugin.getSession();
    Node blogEntryNode = getBlogEntryNode(blogTitle, session);
    blogEntryNode.setProperty(PROP_ATTACHMENT, uploadInputStream);
    session.save();

}
public InputStream getAttachedFile(String blogTitle) throws BlogApplicationException {
    InputStream attachFileIS = null;
    Node blogEntryNode = getBlogEntryNode(blogTitle);
    Value attachFileValue = (Value) blogEntryNode.getProperty(PROP_ATTACHMENT).getValue();
    attachFileIS = attachFileValue.getStream();
  return attachFileIS;
}

As you can see from this code listing, the repository does not treat binary content any different from any other type of content. The only difference is you can add binary content by setting InputStream as the value of property; the same goes for retrieving a property with a binary value. Where this file is actually stored is determined by the value of the externalBLOBs attribute in your persistent manager. If its value is true, then this image file will be stored on filesystem, and if false it will be stored in the database as a BLOB. In our case this value is true, so the uploaded image file will be stored in the filesystem.

Summary

By now you should have a good understanding of JSR-170, Jackrabbit, and how to develop simple applications using JSR-170 API. In this article, our discussion was focused mostly on the basics of JSR-170 and Apache Jackrabbit. We started our discussion by talking about the Java Content Repository API and the benefits of standardization in this space. We then covered the repository model defined by JSR-170 and how to download, configure, and run Apache Jackrabbit. After that, we developed a sample blogging application for demonstrating the basic features of JSR-170 API.

Resources

Sunil Patil has worked on J2EE technologies for more than five years. His areas of interest include object relational mapping tools, UI frameworks, and portals.

의견 0 신규등록      목록