Adding some safeguard measures to Solr with a SearchComponent

Just to be clear I wont be taking here about adding security to Solr, some documentation about this topic could be found in several places. I’m going to talk about some other kind of problems, those problems that comes when a user of your application requests a great amount of documents from your Solr cluster. Just to clarify perhaps your particular use case requires that your users can fetch a great deal of documents from Solr and in this case you should’ve planed for CPU and RAM requirements accordingly, taking into account your particular schema, cache, analyzers, and even using DocValues to improve query speed. But there are some cases when you just want to put some security measures to prevent your users from fetching a a lot of documents from Solr. Out of the box Solr comes without any options to deal with this case, and sure you could implement this in your app, but if you have several “clients” to your Solr data (this was my case), then you’ve to spread this change into all of those clients and every time you change something in one place it will be needed to change it in all the other pieces of code.

If you just want to download/compile/use the code you can find a ready to use maven project in this Github repo.

In my personal opinion all this validation regarding Solr should be embedded in the Solr system, although in some cases where speed is a key aspect someone could decide to make changes into the simplest component, leading (in several cases) to some ulterior problems.

So, the motivation from this post basically comes from @NickVasilyevv:

In the conversation held on twitter basically we could understand that he was having some problems struggling with some users requesting a lot of documents, and the solution was to deploy a frontend patch for limiting paging, which is a perfectly valid solution. A couple of tweets later @NickVasilyevv posts the referenced tweet above got me thinking, that has to be some way of accomplishing this.

First I thought of a custom SearchHandler, but that would be just overkill (only if your SearchHandler is going to do something totally different) and you should change every search request handler in your Solr configuration to use it, so the answer was clear: a Search Component. Using a SearchComponent means that you could set some basic configurations to the SearchComponent and reuse it across all your search request handlers or even define several “instances” of your SearchComponent with different configurations and then use it separately in your search request handlers as needed. Which means a lot more of flexibility and reusability.

So let’s get to work, implementing a SearchComponent it’s not that complicate and this particular case is even more easy. Basically you just need to define a class that extends from SearchComponent, which will force you to implement a several methods:

  • void prepare(ResponseBuilder rb)
  • void process(ResponseBuilder rb)

Although this couple of methods could help you to keep things organized

  • String getDescription()
  • String getSource()

So, before we start coding let’s first define how I would like to configure this search component? Basically which options do I need to make available through configuration in the solrconfig.xml file? This is how I would expect to configure the search component:

<searchComponent name="max-parameters" class=“my.company.MaxParamsSearchComponent">
    <str name="rows">10</str>
    <str name="start">2</str>
    <str name="overwriteParams">true</str>
</searchComponent>

In this snippet you could see that what I wanted to do is basically set max values for the rows and start parameters, so with this I could limit the max start point that a user could request and even the number of documents to retrieve, one key aspect to gain flexibility is that you can configure both parameters independently in any way that you can think of. 

As usual your class also needs to implement the SolrCoreAware in case that you require to read some additional parameter from your configuration file.

The code is pretty simple, after thinking a little about what @NickVasilyevv wanted, I came to the conclusion that returning the last page it’s not really that good idea, if you consider that you set this values as a safeguard for preventing your users retrieving to much data from Solr (which  can stress your servers) then you’ll need to set values to guarantee that your average user doesn’t hit this wall, and in that case it’s also logical to assume that Solr isn’t going to keep a cache of this documents in memory, so returning the last page could also lead to a little stress in your Solr servers mainly because you can’t assure that the last “page” will be in the cache, so first of all I decided that instead of returning the last page what I wanted was to return something that my app could catch and handle gracefully, so I crafted to return an error 400 (Bad Request) indicating that actually there was something wrong with your query (although providing nice logging of the issue in the Solr logs). In my opinion I also consider that this has some “semantical” meaning because actually there is something wrong with your query, you’re exceeding the default limits for your requests . After diving through the Solr source code I came to the conclusion that returning a 400 error code wasn’t just a “me thing” as it seems to be a common practice in the Solr codebase.

So where to put your own custom logic? If you check out the source code of the org.apache.solr.handler.component.SearchComponent interface you’ll see that the prepare() method is guaranteed to be executed before any component is executed (method process). So this could be a great spot to add my custom logic, actually for this first step you could also use the process() method but If you really want to program the behavior requested by @NickVasilyevv, which requires you to change the query sent to solr, to only return the “last page” then you must do it in the prepare() method to make all the required changes in the request and then let the other components do their job. Also, one key aspect about this method is that is guaranteed to be executed before any component so you’ll don’t have access to the results that the query component has fetched from the underline index, regarding this if you need access to the results of the query component then you must guarantee that your custom component gets executed after the query component, so you could override the components section of your request handler or add your component on the last-components section, which in fact will append the list of components in this section after the default components list.

The code looks like:

public void prepare(ResponseBuilder rb) throws IOException {
    SolrParams params = rb.req.getParams();

    int rows = params.getInt("rows", -1);
    int start = params.getInt("start", -1);

    if (rows == -1 && start == -1) {
      // nothing to do, there is no rows or start parameter
      return;
    } else {
      if (rows > maxRowsParam || start > maxStartParam) {
        if (overwriteParams) {
          // overwrite the rows and start parameters
          ModifiableSolrParams modifiableSolrParams = new ModifiableSolrParams(
              params);

          if (rows > maxRowsParam) {
            modifiableSolrParams.set("rows", maxRowsParam);
          }

          if (start > maxStartParam) {
            modifiableSolrParams.set("start", maxStartParam);
          }

          rb.req.setParams(modifiableSolrParams);
        } else {
          throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
              "Your start or rows parameter has exceeded the allowed values");
        }
      }
    }
  }

It’s pretty simple first I get parameters out of the SolrQuery object, if there is no parameter of interest set (rows or start) then do nothing and let the default values set on the handler do their work. Let’s forget about the overwriteParams for a minute, which I’ll cover in the next paragraph. So after checking the start/rows parameters of the query with the values loaded from the search component configuration (maxStartParam and maxRowsParam respectively) we fire a SolrException with the specified message, basically the output of this exception (headers of the HTTP response) is:

HTTP/1.1 400 Bad Request
Cache-Control: no-cache, no-store
Pragma: no-cache
Expires: Sat, 01 Jan 2000 01:00:00 GMT
Last-Modified: Sat, 16 Aug 2014 06:05:12 GMT
ETag: "147dd6b7477"
Content-Type: application/xml; charset=UTF-8
Content-Length: 0

So we’re returning a 400 Bad Request error, although in the XML/JSON response you’ll get is a little more informative:

<lst name=“error">
  <str name="msg">Your start or rows parameter has exceeded the allowed values</str>
  <int name=“code">400</int>
</lst>

Which is exactly the message we put on our code. So the overwriteParam variable is also loaded from the configuration that I’ve defined in the SearchComponent in question. If this option is set to true in the solrconfig.xml file, then what I want to do, to match the desired behavior of @NickVasilyevv, is to overwrite the start and/or rows params values, meaning that if you pass a larger number of rows (10000) and I only allow 100 documents) then your query will be automatically rewritten and you’ll only get only those 100 documents, which is what @NickVasilyevv asked in the first place.

For this behavior the ModifiableSolrParams comes very handy, basically the Solr developers allow to do this basically by using this special class that has the particular property of letting the parameters to be set; in contrast if you convert the params of the query into a NamedList object, it won’t be as easy.

Although this works I got one behavior that I still doesn’t like, but haven’t been able to overcome. If you do the following request to Solr (from the command line using curl):

curl -GET http://localhost:8983/solr/select?q=:&start=10000000&rows=100

In the responseHeader you’ll see the original values that you put in your query (start=10000000) instead of the value that you put in the configuration of the search component, which could be a little confusing. This happens because when you call the setParams method on the query, you actually change the parameters used to perform the query, but this class keeps a record of the original parameters that were submitted into Solr, and this are the parameters used to build the responseHeader information. I haven’t found a way to change this original params, but it doesn’t  affect the functioning of the search component. Although if you use this parameters at all in your app then this could be a serious problem.

One more detail about using this custom SearchComponent, essentially you’ll want to dismiss/rewrite harmful queries (those with the start or rows parameters greater than the allowed values) as soon as possible, so my recommendation is to put this search component on the first-components section of your request handler:

<arr name="first-components">
    <str>max-parameters</str>
</arr>

Sure, as the prepare() does get called before any component’s process() method its called (i.e where the logic of the SearchComponent must happens) this is somehow accomplished, but there are components like the one we just wrote that also has logic in the prepare() method, so it could spare you a couple of operations to put your component as the first to be executed, even more if in this case if the conditions are met, then the query wouldn’t be executed at all, so my advise is to discard the query as soon as possible.

I would like to write a little about the tests that I’ve wrote to validate the implementation of this component, but this post has became larger than what I’ve expected initially, any how I hope to write a second post to explain this, but you should check the code, there is nothing too odd about it.

Advertisements

4 thoughts on “Adding some safeguard measures to Solr with a SearchComponent

  1. Hi Jorge, first of all i would like to thank you for this post.. it has been very helpful.. however when i apply it the start value is always 0 and rows values is the sum between rows and start.. if the rows > rowsMax then the start value is correctly setted.

    Can you help me here?my only goal here is not allow that searchrequesthandler to give back more than 50 results.
    Thanks

    1. Hi Nuno, thanks for your kind words!! In Solr, if you don’t specify the start parameter, is set by default to 0, and the rows parameter lets you set the amount of documents that you want to get from Solr (it defaults to 10 if not specified in the configuration/request), this two parameters let’s you paginate your results. This component allows to enforce max values on both parameters independently. The rows parameter is not actually the sum between rows and start, is more like: “I want ROWS results starting in the START document “.

      If you want to prevent returning more than 50 documents, this means: a) only 50 documents per page or b) 50 documents in total, think it like: Should Solr return 50 documents starting on the document 10 000, or just return always the 50 first documents? and not allowing to paginate any further?

      1. Hi Jorge Luis

        Could you please help me out this error ?

        Iam using nutch 2.3.1, hbase-0.98.19-hadoop2 and solr 4.10.3 and have copyed nutch schema.xml file into solr conf file and this my nutch log error

        when iam running this command

        bin/crawl urls/seed.txt TestCrawl6 http://localhost:8983/solr/#/solr 5

        IndexingJob: starting

        SolrIndexerJob: java.lang.RuntimeException: job failed: name[TestCrawl]Indexer , jobid=
        at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)

        Error running:
        /usr/local/apache-nutch-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://0.0.0.0:8984/solr/#/solr -all -crawlId TestCrawl6
        Failed with exit value 255.

      2. Usually when this happens means that there was some error on the Solr side, so Nutch tried to push the documents into Solr but Solr has some complains about the data sent. Check the Solr logs, since you’re passing the localhost:8983 URL to Nutch I’m guessing that you’re running Solr in your console/terminal, look for any exception thrown. Usually this means that the schema between Solr and Nutch is out of sync, so perhaps you’re sending some field that Solr is not expecting or you’re sending a multivalued field without being declared as such in the Solr schema. Without the actual error of Solr there is no way to tell. In this kind of errors the Nutch output/log is not very helpful, mainly because there is nothing wrong in the Nutch side, and only the failed job is reported.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s