Archive for the ‘FAST ESP’ Category

FAST ESP to Lucene Solr Presentation: Open Call for Questions

by Karen Lynn

TNR Global is excited to be participating in the Apache Lucene EuroCon conference in Barcelona.  Our own Michael McIntosh is scheduled to present:  “Enterprise Search: FAST ESP to Lucene Solr” Here is your chance to pre-load the discussion. Before Michael puts the final touches on his talk, he wants to know what issues or questions you may be have.  In the following video, he touches on some of the highlights of his upcoming talk, and asks for your input.

Enterprise Search: FAST ESP to Lucene Solr pre-conferece video - Click to Watch

Enterprise Search: FAST ESP to Lucene Solr pre-conf video

To participate in advance, send you questions or comments to:  fast2solr@tnrglobal.com.  While Michael cannot promise he will include your question or commentary in his actual talk, he will work to address them in an upcoming White Paper, to be released after the conference in November 2011. We look forward to hearing from you!

Crawling Solr

by Karen Lynn

Recently there has been a lively discussion on Linked In’s Enterprise Search Engine Professionals Group started with this question:


“Is it an handicap for Solr to depend on third party solutions for crawling the Web like Nutch?


Our own Michael McIntosh felt compelled to respond. What follows is his post to this topic in it’s entirety.


“This topic makes me think of the saying “Write programs that do one thing and do it well.” The longer version of this philosophy, as expressed by Doug McIlroy, is this: “Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” Solr stands very well on its own and, based upon my impression of the Solr community so far, more people currently use Solr for structured content vs unstructured like web documents. I think that Solr should have some ‘out of the box’ web crawler implementation available, but it should not be the core focus. It can serve to allow new users of Solr to focus more on the Solr/Lucene side of things and not have to worry about rolling their own crawler or figuring out which is the best third-party crawling solution to use. I suspect that many people who need to do crawling can get by with a fairly basic crawler. My impression of Nutch so far is that is more complicated than most Solr users need out of the starting gate. That said, if you have a business that deals with large amounts of crawled unstructured content, its very likely they will need something more robust than you can reasonably ship & support as part of the Solr project. For one of our clients, the size of our dataset has grown from needed just a couple boxes, to multiple clusters with many machines each. One of the newest developments is the growth of the amount of unstructured content has grown to a size where we now need a crawler CLUSTER. When we first started on this, it never occurred to us that we might need multiple machines for the crawling side of the equation, but it has happened. But I think our case its less common. All in all, I think Solr should have a bare-bones reference implementation of a crawler that can easily be expanded upon, but it is probably not an effective use of effort to Solr developers to focus on the crawling side. Let a third party focus on the issues of crawling, it is a deceptively complicated issue.”


After his post I caught him in the office and asked where he was going with this line of thinking. “We are looking at creating a suitable enterprise crawler to replace the one provided by ESP to support customers doing a ESP to Solr migration.” He revealed. Sounds like a very promising solution to a fairly big, and common problem for companies with vast amounts of metadata. And as for unstructured content? Well, it’s the proverbial elephant in the room, don’t you think?


To see the entire conversation, with contributions from experts in the field of search architecture, click here. To get in touch with Michael directly to discuss your architecture and crawling needs, contact us.

Open Source Search: Isn’t It Expensive?

by Karen Lynn

You’ve heard the debate on open source search vs. proprietary search. One question that constantly comes up for prospective clients is “What’s all this going to cost me?”

In these times, it’s a good question. Because proprietary has neatly packaged, practically shrink wrapped plans, it’s much easier to discern how much you will spend on a solution. But how much will it cost? That’s an entirely different question.

I see you cocking your head sideways.

Proprietary search has hidden costs. What if the software doesn’t perform the way you need it to? Does the software understand the nuances of your business? How adaptable is it? How much will it cost to adapt that software to get it to perform the way my business needs it to? Questions like this need to be asked, and answered. Eventually you will ask yourself….why am I paying for all of this? And your developer will ask, “why can’t I access the source code?”

What I’m getting at is this: It is a reassuring feeling for a customer to see what a package costs, to understand what services you will get with a solution, and to anticipate what the licensing fee will cost on an annual basis. If it’s your job to research a solution and present findings to your executive team to make a decision, then proprietary search, on the surface, seems a more secure choice. But rarely, if ever, are these solutions a perfect fit for the customer. It’s like buying a Ferrari, with all the brand recognition and polish a Ferrari offers, and not ever driving it past second gear, or cutting the wheel more than 15 degrees, or getting a chance to have your trusted mechanic look under the hood. This is why open source is such a good solution for businesses who want their IT to move quickly.

We’re hearing more buzz about companies waking up to the agility of an open source solution. Most recently, with the acquisition of Autonomy by HP, the industry is telling stories of ex Autonomy customers migrating to Solr (open source search) with only the annual licencing budget to finance the migration. Without an annual expenditure of cash for licensing, and the freedom of not being under a licensing agreement, companies quickly recoup the initial expenditure of a migration.

What kind of car does your company drive?

If you are examining the different choices for implementing search technology in your organization, contact us.  We’re happy to talk to you about the best solution for your business.


Migration Still Looms Large on the Horizon for FAST ESP Customers

by Karen Lynn

Microsoft acquired FAST all the way back in 2008 and then in early 2010 disclosed it’s plans to stop updating the FAST product on a Linux operating system after 2010, making FAST ESP 5.3 the latest and greatest, and very last update Linux users will see involving any improvements to the proprietary search platform. It was clear to anyone on Linux that a migration would need to occur, and as content grows, depending upon the size of your organization, that migration should probably happen sooner than later.

Buzz about migration ensued–an inevitable certainty for many companies, especially ones with huge amounts of data. But how many companies have jumped in with both feet? I had the opportunity to speak with an open source search engine expert who, along with the industry, believed that the move from Microsoft was a windfall for anyone in the business of enterprise search design and implementation. However, she admitted “we haven’t seen as large a response as we expected.”

This isn’t exactly surprising to everyone. “It’s coming” says our VP of Search Technologies, Michael McIntosh. “Corporations have an enormous investment in FAST ESP and it makes sense that they would be reluctant to move to something new until they absolutely have to.” That means, when their licenses expire.

“They will likely weigh the performance and support, or lack thereof, for the FAST ESP technical team with the timing of renewing a license and wait until they absolutely have to change to something else,” says McIntosh.

The purchase of Autonomy and the shift of HP from hardware to software could signal a recognition from Goliath HP the kind of growth opportunity enterprise search software offers, and that the “great shift” from FAST ESP to another search platform is very much on the horizon.

But as the clock continues to tick, companies using FAST ESP should be strategizing for migration now. “It’s an enormous undertaking to migrate an entire search solution from FAST to another platform. Designing a non-trivial search solution to fully meet your needs from scratch is hard enough on its own. If you are migrating an existing solution, it is very unlikely that you will find a one to one mapping of all of the features in a new search engine that you have come to depend upon with your existing implementation. Solving challenging issues like that requires both creativity and expertise to address your needs.” says McIntosh. If a need for migration is eminent, there will be a real need for expertise in the field of enterprise search on both proprietary and open source platforms, depending upon several factors like size, in house talent, and growth expectations.

How is your company preparing for the discontinuation of support of FAST ESP?  Need guidance?  Contact us for pointers, analysis, or architecture for a full migration.

Open Source Search Engines vs. Proprietary Search Engines

by Karen Lynn

There are plenty of articles about the pros and cons of open source software vs. proprietary software. I sat down with our VP of Search Technologies Michael McIntosh to discuss the benefits of each in terms of search engines.

Karen: You’ve been working with proprietary search engines for some time now. Tell me your thoughts on that. What’s the upside of proprietary?

Michael: I’ve worked with proprietary search engines for several years, specifically with FAST ESP since 2003 back when it was known as FAST Data Search (FDS). Proprietary software products often have better documentation, better support; more thought out design and are more aggressively tested. Because the product supports an entire company—it must succeed. They have nicer tools, nicer interfaces.

Karen: And the downsides?

Michael: “Over the years, we’ve run into a number of difficulties with proprietary search engines. One thing that comes up is that if a problem isn’t outlined in user troubleshooting documentation, it can become incredibly difficult to diagnosis and correct, and doing that is a frustrating, time intensive problem. The black box nature of the product is very limiting. If it’s not in documentation, it might as well not exist at all.”

Karen: There are gaps in the user manual?

Michael: “Yes. But in defense of FAST ESP, the documentation has improved by leaps and bounds over the years. However, one anomaly we find is that the clean, easy to read PDF form of user documentation (the original) for ESP is often not as up-to-date or helpful as the searchable online documentation—which is harder to read, but usually more current and correct. Sometimes even the online documentation is wrong—which is also frustrating. But it has become something we cope with regard to FAST”

Karen: Give me an example of what kind of problems you run into when integrating the proprietary search engine into a client’s website.

Michael: “The enterprise search platform uses Service Oriented Architecture (SOA). That is a bunch of different components that are able communicate with each other as services. With this type of architecture, it doesn’t matter which language you write something in as long as it’s a service another language can communicate with via RESTful interfaces, SOAP or something like XML-RPC. These services can all work together, despite the fact you don’t have a unified api—and that’s actually awesome.

Karen: Why?

Michael: “Indexers for one—you generally don’t want them written in an interpreted language for performance reasons. Indexing can be a CPU intensive operation, which can be a weak spot for interpreted languages such as Python or Ruby compared to languages like C/C++ or Java. It is both CPU and disk intensive process so a scripting language can be great if you’re writing an application that’s not CPU intensive because code speed doesn’t matter so much. The true slow-down is something outside the script. You can optimize the speed of the script to make it run as fast as humanly possible but you’re limited because the disk can only rotate so fast. Your indexing service can be written in a low level language like C vs. some of the other services in languages like Python or Ruby or Java and get good performance. BUT if you don’t have documentation for compiled programs that make up the search engine product you’re going to have a terrible time trying to figure out how to fix issues when they arise.

Karen: “So basically, because you have so many different languages being used that you lack the source code for, and documentation is spotty, it becomes a needle in a haystack trying to figure out where the problem occurs.”

Michael: “Yes. And this is not such a problem for anyone using a search engine for really basic applications. The place where you run into problems is if you push the search engine technology to its limits or you are using it in ways outside its typical usage, which is almost everything we do. We are always trying to get the best possible performance out of the search engine. We’re trying to get the search engine to deal with features we need but it doesn’t natively support.

Karen: “What is it you’re pushing specifically for the search engine to do?”

Michael: “One thing we want is for FAST ESP to have is a feature to deal with creating a faceted search for arbitrary fields. The way ESP works is that it has an index profile which is a statically defined set of fields that it indexes. Inside its index profile you can mark certain fields to have navigators. One of our customers deals with product verticals. They have a whole bunch of products that aren’t unified—all with completely different attributes. We’ve managed to work around these roadblocks in ESP to create faceted navigation on arbitrary fields.

Karen: “So you get creative to make it better.”

Michael: “Yes we constantly get creative to make it better to use its strengths and find ways to work around its limitations.”

“Another issue we have with ESP is we have a number of websites we need crawled and each website has metadata associated with it. Unfortunately the way the ESP crawler works, there are not many straightforward ways to preserve metadata associated with the seed URL which we use to crawl a website and pass the meta information along to any associated links. We can’t do this easily inside the ESP crawler. Since its proprietary and black box, we can’t look at the source code to the crawler, and can’t modify the source code to the crawler. When it does something mysterious, we can have no idea why it might be behaving in an unwanted way. We had one instance when we had a number of websites the crawler was temporarily blacklisting for some reason. When the ESP crawler automatically blacklists a site, it stops crawling the site for 30 minutes and then begins again after 30 minutes. We learned one thing that triggers blacklisting is if a website has HTTP 503 errors. If a site has more than 20 or so of those errors, the crawler temporarily blacklists the site. The problem was that the documentation is too sparse on details for that topic. When we ran into that problem—it was really difficult to know what was going on so we could properly explain the issue to the client and address the problem. Conversely, if we had an open source search engine, I could have just searched the source code and speed up the diagnosis of the problem.”

Karen: So from a business prospective—using open source allows you to invest in a technology that gives you the power to modify the code to better meet your business needs.

Michael: “It certainly can. It can accelerate development time and speed of diagnosing problems when issues pop up. And issues always do pop up. If something is not working very well, we can look at the problem which a much higher degree of granularity.

If it’s a simple problem, we never contact support. We only contact support when we’re stumped. And we’re not easily stumped. Usually, they can’t answer they question immediately because if we’re asking for help, the problem is complex. Our ticket is escalated, and eventually we talk to someone who can help us. But it does take time. Even if a support staff is top notch, there is the time is costs to deal with that, and that costs us and our clients’ time. We have a highly customized ESP installation for one of our clients it always take an enormous amount of time to explain over and over how we have our systems set up, the different parts work, and it’s a big pain to go through that every time I run across a problem. If it were open source, I can simply look at the source code and solve the problem.”

Karen: Let’s talk in more detail about open source search engines. Upsides?

Michael: “If you choose a popular open source search engine solution like Lucene Solr, you have an active, passionate community behind that solution. There are several developers looking at that engine, working on it, and actively posting in publicly available forums. You can often get your questions answered there by top notch experts in search technology. You can potentially talk to the original coders and creators of the product—and they are often happy to help you. I’ve seen people post a Solr question on their twitter feed and within 7 seconds, the creator has responded with a link to a forum explaining the solution.

Karen: Wow, that’s amazing. Other advantages?

Michael: “It’s free. That’s attractive to most companies. The downside is the formal documentation isn’t usually as good as the proprietary, and there isn’t a dedicated support team for the product. But if you have some savvy software developers on your team, the open source community is robust and willing to share information about the product. And having access to the source code is extremely valuable.”

Karen: So in your opinion, what’s the bottom line on Open Source Search Engines vs. Proprietary Search Engines?

Michael: “If you are a cutting edge company, you will be severely limited by a proprietary search engine as a solution. The more open the technology, the more able we are to refine it to meet our client’s needs.”

If you’d like more information on the pros and cons of Open Source Search Engines vs. Proprietary Search Engines that are specific to your business or organization’s needs, feel free to contact us for a free consultation.

Fast ESP Error: no doc procs registered to process a batch with priority 0

by admin

Just wanted to take this error message off of the, “Hey, we’ve seen this before… now how did we resolve this..?” pile.  This is the full text of the error:

WARNING    Could not send batch to ESP content distributor, will retry automatically.
Reason given: process() failed: exception (no_resources) no doc procs registered to
process a batch with priority 0

At first glance, it looks pretty clear that you just need to [re]start your document processor(s).  However, this won’t necessarily solve the problem.  Turns out that the a likely reason for this to pop up is a bad Document Processing Pipeline (DPP) Stage.  The docprocs fire up, hit the bad stage (e.g. python errors etc.) and don’t recover.

To debug your DPP Stage, take a look at the logs for the document processor(s).  They’re usually located in $FASTSEARCH/var/log/procserver and, in our experience, there’s probably an uncaught python exception lurking somewhere in there.

Integrate custom services with the Fast ESP Node Controller

by admin
Add your own services to ESP's Node Controller

Add your own services to ESP's Node Controller

Background

Integrating our own custom services with Fast ESP’s Node Controller provides us with several benefits:

  • Administrators without in-depth ESP knowledge can easily control services (e.g. start, stop, configure parameters)
  • Services can be started at boot time with the rest of ESP
  • espdeploy can be used to install our services in a multi-node cluster

The components required for this system are:

  1. The ESP Node Controller (with config file NodeConf.xml)
  2. The 3rd-party service (like a CherryPy server, log parser, etc.)
  3. A wrapper script (see below)

Steps for Integration

  1. Define the service you would like to integrate. It can be any script or binary that can be executed on the system. For example, the service might be a python script that takes command-line arguments and continues running itself (as is the case with a webserver).
  2. Create the wrapper script that sets up the proper environment and runs/stops the service properly. The wrapper script should be put in the $FASTSEARCH/bin directory (with executable permissions). Additionally, the wrapper script should pass $@ to your actual script so any/all arguments defined in $FASTSEARCH/etc/NodeConf.xml will be passed along properly from the Node Controller to your service. The following is an example of a wrapper script:
    #!/bin/sh
    
    # export the proper python path
    export PYTHONPATH=":/path/to/python"
    
    # run the script (backgrounded)
    python $FASTSEARCH/lib/python2.6/yourmodule/yourservice.py $@ &
    
    # determine the process id of the python script
    SCRIPT_PID="$!"
    
    # upon receiving a SIGTERM, forward it to the process
    trap "kill -TERM $SCRIPT_PID" SIGTERM
    
    # wait for SIGTERM from nctrl
    wait
  3. Define the service in $FASTSEARCH/etc/NodeConf.xml
    Add the following to the end of the <startorder> tag:
    <proc>servicename</proc>

    Add the following to the end of the <node> tag, customizing as appropriate:

    <!-- My Custom Service -->
    <process name="servicename" description="My Custom Service">
            <start>
                    <executable>binaryname</executable>
                    <parameters>-p 16940 -v</parameters>
                    <port base="4004"/>
            </start>
            <outfile>servicename.scrap</outfile>
    </process>
  4. Reload the Node Controller configuration with the following:
    nctrl reloadcfg

And that’s it!  Now you should be able to start, stop, configure, and deploy your services using Fast ESP tools.  Enjoy!

FAST ESP Overview

by Michael McIntosh

We use FAST ESP to power a large industrial search engine listing over 1 million companies and over 3 million indexed documents and receiving millions of visitors every month. I have been working with ESP since 2003 (then known as FDS 3.2).

FAST ESP is extremely flexible and can deal with indexing many document types (html, pdf, word, etc). It has a very robust crawler for web documents and you can use their intermediary FastXML format to load custom document formats into the system or use their Content APIs. Read the rest of this entry »