[InterMine Dev] solr load fails, but post processing step succeeds.

daniela at intermine.org daniela at intermine.org
Thu Apr 1 12:04:07 BST 2021


Hi Joe

> Thanks for this. I’ve made some progress in figuring out what is
> wrong, but I’m not completely out of the woods yet.
Good! and thanks for sharing it

> It still bothers me that the solr returns an error message when trying
> to insert documents and encounters a problem, but the gradle build
> task reports that it is successful. I think we had a discussion on
> this some time ago, but I cannot find any record of it. email?
> discord? I can’t remember.
Yes we discussed about it in github. see here: 
https://github.com/intermine/intermine/issues/2120

> My configuration for solr when I set up the core was a little off. I
> don’t know if this was specifically the problem or not. But since i
> fixed it I have not seen an interrupted build. We have our solr server
> behind an nginx proxy and so we are susceptible to having proxy
> problems and getting a 500 return if the proxy isn’t up to handling
> the request. This would interrupt a build if it happened during an
> insertion. This also could have been the problem.
Good that it ended with success.

> I was able to create an index by excluding all classes except
> organism. This was quick and easy and I think demonstrates that our
> infrastructure is OK. Now I’m trying with genes, mrnas, proteins and
> a few other classes.
> Here is what I see in intermine.log:
> 
>> 2021-03-31 09:41:47 INFO
>> org.intermine.api.searchengine.solr.SolrObjectHandler  [] - QUERY:
>> SELECT DISTINCT a1_ FROM org.intermine.model.InterMineObject AS a1_
>> WHERE a1_.class NOT IN ? 1: [interface
>> org.intermine.model.bio.OntologyAnnotationEvidenceCode, interface
>> org.intermine.model.bio.Chromosome, interface
>> org.intermine.model.bio.Comment, interface
>> org.intermine.model.bio.GOEvidenceCode, interface
>> org.intermine.model.bio.TFBindingSite, interface
>> org.intermine.model.bio.MeshTerm, interface
>> org.intermine.model.bio.TransposableElement, interface
>> org.intermine.model.bio.SyntenyBlock, interface
>> org.intermine.model.bio.MiRNA, interface
>> org.intermine.model.bio.GOAnnotation, interface
>> org.intermine.model.bio.SOTerm, interface
>> org.intermine.model.bio.Primer, interface
>> org.intermine.model.bio.ThreePrimeUTR, interface
>> org.intermine.model.bio.MobileGeneticElement, interface
>> org.intermine.model.bio.OntologyTermSynonym, interface
>> org.intermine.model.bio.ChromosomalDuplication, interface
>> org.intermine.model.bio.ForwardPrimer, interface
>> org.intermine.model.bio.Sequence, interface
>> org.intermine.model.bio.PointMutation, interface
>> org.intermine.model.bio.AlternateDescription, interface
>> org.intermine.model.bio.ProteinDomain, interface
>> org.intermine.model.bio.Oligo, interface
>> org.intermine.model.bio.ChromosomalInversion, interface
>> org.intermine.model.bio.ReversePrimer, interface
>> org.intermine.model.bio.Author, interface
>> org.intermine.model.bio.OntologyRelation, interface
>> org.intermine.model.bio.ProteinFamilyMember, interface
>> org.intermine.model.bio.Ontology, interface
>> org.intermine.model.bio.ChromosomeStructureVariation, interface
>> org.intermine.model.bio.RRNA, class
>> org.intermine.model.bio.RNASeqEnrichment, interface
>> org.intermine.model.bio.CRM, interface
>> org.intermine.model.bio.SnoRNA, interface
>> org.intermine.model.bio.OntologyAnnotation, interface
>> org.intermine.model.bio.FivePrimeUTR, interface
>> org.intermine.model.bio.SequenceVariant, interface
>> org.intermine.model.bio.ProteinFeature, interface
>> org.intermine.model.bio.ChromosomalTransposition, interface
>> org.intermine.model.bio.PathwayComponent, interface
>> org.intermine.model.bio.Enhancer, interface
>> org.intermine.model.bio.CrossReference, interface
>> org.intermine.model.bio.DataSource, interface
>> org.intermine.model.bio.SyntenicRegion, interface
>> org.intermine.model.bio.CDS, interface org.intermine.model.bio.EST,
>> interface org.intermine.model.bio.Exon, interface
>> org.intermine.model.bio.MSA, interface
>> org.intermine.model.bio.DataSet, interface
>> org.intermine.model.bio.PathwayInfo, interface
>> org.intermine.model.bio.Publication, interface
>> org.intermine.model.bio.Strain, class
>> org.intermine.model.bio.CoexpressionJSON, interface
>> org.intermine.model.bio.ChromosomalTranslocation, interface
>> org.intermine.model.bio.Allele, interface
>> org.intermine.model.bio.GeneFlankingRegion, interface
>> org.intermine.model.bio.NaturalTransposableElement, interface
>> org.intermine.model.bio.BindingSite, interface
>> org.intermine.model.bio.MicroarrayOligo, interface
>> org.intermine.model.bio.NcRNA, interface
>> org.intermine.model.bio.SnRNA, interface
>> org.intermine.model.bio.ChromosomeBand, interface
>> org.intermine.model.bio.TRNA, interface
>> org.intermine.model.bio.SequenceCollection, interface
>> org.intermine.model.bio.ProteinFamily, class
>> org.intermine.model.bio.RNASeqExpression, interface
>> org.intermine.model.bio.ChromosomalDeletion, interface
>> org.intermine.model.bio.TransposableElementInsertionSite, interface
>> org.intermine.model.bio.RegulatoryRegion, interface
>> org.intermine.model.bio.IntergenicRegion, interface
>> org.intermine.model.bio.OntologyEvidence, interface
>> org.intermine.model.bio.Location, interface
>> org.intermine.model.bio.GOEvidence, interface
>> org.intermine.model.bio.GoldenPathFragment, class
>> org.intermine.model.bio.RNASeqExperiment, interface
>> org.intermine.model.bio.PCRProduct, interface
>> org.intermine.model.bio.Intron, interface
>> org.intermine.model.bio.Synonym, interface
>> org.intermine.model.bio.OverlappingESTSet, interface
>> org.intermine.model.bio.CDNAClone, interface
>> org.intermine.model.bio.UTR]
> 
>> …..  <deleted rows> …..
> 
>> 2021-03-31 10:51:46 INFO
>> org.intermine.api.searchengine.solr.SolrObjectHandler  [] - Query
>> returned 36770935 results
> 
> Yeah. It took more than an hour to return 37M results. But that’s
> not the bad part. We have 9M genes in the mine so 37M indexable items
> is about right. And having it take an hour is not so bad in the big
> picture. The bad thing is the rate of document creation. Here is a
> plot I made of the number of documents indexed over time:
> 
> It was screaming along just fine at first: a few thousand per second.
> At that rate it would have only taken less than 2 hours to complete.
> Then it hit a wall and slowed done to ~ 1k/minute. The slowdown
> happened as soon as it hit the collection part of the indexing.
> 
> So what I’m wondering is can I simplify the index creation. What
> exactly is getting indexed here? Suppose I only wanted to index the
> names of bioentities and not worry about anything else. In my
> imagination what’s happening is that the index is hoping to return
> gene results if someone types in a GO term (I’m assuming these would
> be picked up in the collection queries.) We have another search
> mechanism on our website that we’ll use for that type of query.
> I’d just like to give someone the ability to locate a gene by name.
> 
> Or, do you think it’s a problem with our db? It’s big. (3.2Tb,
> 1.9G items from 249 organisms). It looks to me that all the indexes
> look are in place.
> 
> By the way, I was trying some searching on flymine to see if my
> supposition about the role of collections in the indexing was correct.
> I’m getting an internal error page.
> 
> Thanks,

Joe, I haven't looked at the code into details but i did the following 
test on flymine which confirms you assumption.
There are 375 genes annotated with the goterm  GO:0000978. This is the 
query I ran:
<query name="" model="genomic" view="Gene.secondaryIdentifier 
Gene.symbol Gene.goAnnotation.ontologyTerm.name 
Gene.goAnnotation.ontologyTerm.identifier" longDescription="Find all 
genes that are associated with a particular  GO term in a specific 
organism. This template will return genes that have been assigned the 
given GO term as well as genes that have a more specific GO term." 
sortOrder="Gene.secondaryIdentifier asc" constraintLogic="B and A">
   <pathDescription pathString="Gene.goAnnotation.ontologyTerm" 
description="GO annotation > child term"/>
   <constraint path="Gene.goAnnotation.ontologyTerm" type="GOTerm"/>
   <constraint path="Gene.organism.name" code="B" op="=" 
value="Drosophila melanogaster"/>
   <constraint path="Gene.goAnnotation.ontologyTerm.identifier" code="A" 
op="=" value="GO:0000978"/>
</query>

And when I quick search for the term GO:0000978, I have in return the 
term itsself and the 375 genes.

Daniela

> 
> Joe
> 
>> On Mar 29, 2021, at 8:24 AM, daniela at intermine.org wrote:
>> 
>> Hi Joe,
>> sorry for the issue.
>> I have just run create-search-index in biotestmine where there are
>> 195915 intermineobject but only 121786 are indexed (because we
>> exclude Comment CrossReference Location OntologyAnnotation
>> OntologyRelation Sequence Synonym which are 74129 in total).
>> 
>> Please have a look in the intermine.log where yu should have
>> something like:
>> 
>> 2021-03-29 16:00:10 INFO
>> org.intermine.api.searchengine.solr.SolrObjectHandler  [] - QUERY:
>> SELECT DISTINCT a1_ FROM org.intermine.model.InterMineObject AS a1_
>> WHERE a1_.class NOT IN ? 1: [interface
>> org.intermine.model.bio.OntologyRelation, interface
>> org.intermine.model.bio.OntologyAnnotation, interface
>> org.intermine.model.bio.Sequence, interface
>> org.intermine.model.bio.Comment, interface
>> org.intermine.model.bio.Location, interface
>> org.intermine.model.bio.GOAnnotation, interface
>> org.intermine.model.bio.Synonym, interface
>> org.intermine.model.bio.CrossReference]
>> 2021-03-29 16:00:11 INFO
>> org.intermine.sql.precompute.PrecomputedTableManager  [] - Loaded 15
>> precomputed table descriptions (plus 0 failed) in 316 ms
>> 2021-03-29 16:00:12 INFO
>> org.intermine.api.searchengine.solr.SolrObjectHandler  [] - Query
>> returned 121786 results
>> 
>> Here the query where we exclude the ignored classes
>> 
> https://github.com/intermine/intermine/blob/dev/intermine/api/src/main/java/org/intermine/api/searchengine/solr/SolrObjectHandler.java#L190.
>> 
>> How is your QUERY in the intermine.log file?
>> 
>> Daniela
>> 
>> On Mar 29, 2021, at 1:42 AM, contrino at intermine.org wrote:
>> hi joe,
>> the output is somewhat misleading, but it means that the creation of
>> solr indexes failed at certain point and that you should find the
>> problem and run again just the offending process.
>> regarding the reason for the failing, could it be that there was an
>> issue writing to the intended machine?
>> please let me know if i misunderstood your question. are you using
>> solr7 or  8?
>> Hi Sergio,
>> Im using solr 7.2.1. I’ve even successful in the past without any
>> troubles. It’s just this most recent build that has been having
>> troubles.
>> The index is partially populated, so I don’t think it’s an issue
>> of
>> not having created the core or permissions for inserting records. As
>> a
>> test I tried to build an index after naming just about every class
>> in
>> the index.ignore list of classes. But I still have a HUGE number of
>> documents in the search index (4,260,000). This is compared to
>> 149,000
>> in my previous build. (Granted, there are more organisms in this
>> build, but not THAT many more.) At first I thought it may have been
>> a
>> disk space problem but I see no signs of that.
>> And also it’s a mystery why index.ignore seems to have had not
>> effect.
>> I’m still poking at it. Going to run it through the debugger to
>> see if
>> I can get more details on the exception.
>> Joe
>> thanks!
>> sergio
>> On 2021-03-28 03:05, Joe Carlson wrote:
>> I’ve seen this a few times in this recent build. My solr index is
>> missing many searchable items, but it does have some. What is
>> strange
>> is that even though the solr load failed, the post processing
>> operation was “successful"
>> The output from the post processing step is:
>>> Task :dbmodel:postProcess
>> POSTPROCESS create-search-index FAILED.
>> org.apache.solr.client.solrj.SolrServerException: IOException
>> occured when talking to server at:
>> https://njp-spin.jgi.doe.gov/solr/phytomine-search-13
>> Please correct the error and run again ONLY THE POSTPROCESS
>> ./gradlew postprocess -Pprocess=create-search-index
>> NO NEED TO RE-RUN THE ENTIRE BUILD
>> BUILD SUCCESSFUL in 4h 51m 41s
>> The contents of intermine.dev when this happens is:
>> 2021-03-27 13:40:14 INFO
>> org.intermine.api.searchengine.solr.SolrIndexHandler  [] - docs
>> indexed=22160000; thread state=RUNNABLE; docs/ms=75.02607;
>> memory=14271451k/59652608k; time=17492081ms
>> 2021-03-27 13:40:14 ERROR
>> org.intermine.api.searchengine.solr.SolrIndexHandler  [] - Error
>> while commiting the SolrInputdocuments to the Solrclient. Make sure
>> the Solr instance is up
>> org.apache.solr.client.solrj.SolrServerException: IOException
>> occured when talking to server at:
>> https://njp-spin.jgi.doe.gov/solr/phytomine-search-13
>> at
>> 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:657)
>> ~[solr-solrj-7.2.1.jar:7.2.1
>> b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10
>> 00:54:23]
>> So the solr post operation is not able to handle the request. I see
>> nothing in the solr logs that shows an error. Has anyone else seen
>> this behavior? I seem to recall that I’ve see the behavior of
>> “build
>> successful but the operation failed” before. I cannot recall if I
>> asked the group about that, or what the resolution was.
>> Has anyone seen this behavior and have pointers on what to look for?
>> Thanks,
>> joe
>> _______________________________________________
>> dev mailing list
>> dev at lists.intermine.org
>> https://lists.intermine.org/mailman/listinfo/dev
>  _______________________________________________
> dev mailing list
> dev at lists.intermine.org
> https://lists.intermine.org/mailman/listinfo/dev



More information about the dev mailing list