Tuesday, March 27, 2012

Advice Needed On Very Large Full Text Indexes

I'm currently building a product that will need to index and search roughly
500 million documents (perhaps more). Currently I have about 60 million
documents in the index (table size about 100GB and index size about 35GB)
Performance is (obviously) getting worse and I doubt my current architecture
will support the needed growth and still provide the required speed.
Documents are added 24x7 at the rate of 100's per minute (roughly 250,000
new ones each day). Plus, due to the search requirements, I had to clear
the noise file. I've been thinking of doing table partitioning (using
2005), creating indexed views and then using multiple full-text indexes to
query the data (however, of course, I will not be able to rank effectively
then).
So, I'm curious if any one else has this type of volume and has come up with
a solid solution. Appeciate any ideas.
I should also mention that I'm looking for short term consulting help on
this, as well as a full-time position - so while I'm grateful for free
suggestions from this board, should anyone be looking for work, please
contact me as well (company located in New Jersey).
Thanks!
Joel
Partitioning works, but you need to have well defined partitions which the
data and the queries will align itself with.
For example if you are querying by a search phrase and your
buckets/partitions are by date partitioning won't help you. If you are
partitioning by date you will get some advantages if you have time
dependencies. For example in Google, if you query on your name
http://www.google.com/search?sourcei...=Joel+Macaluso
you'll notice 70,100 hits. Then as you move to other pages you'll notice
that the number goes down to 69,900. What has happened is the first hit is
on one server which has mainly new stuff. The subsequent queries hit the
bigger archive which represents the totality. The time partition works for
them even though most of their queries are have no real time dependency, ie
web pages from 2001.
You might want to analyze your queries looking for colocants - words which
are search for as a unit - Bacon and Eggs, Sonny and Cher, George Bush.
Likewise you might choose not to index hapax legomenon, dis legomenon, tris
legomenon or even tetrakis legomenon. You also might want to look at words
over a certain length and not index them.
One other point that you aren't answering is while I know about your
indexing requirements - what are you querying requierments like? IE How may
queries per day/min/sec.
I am facing some of the same problems where I work now. I'd be interested in
following up with you on this one. I'm in NJ as well.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Joel Macaluso" <jmacaluso@.digitalgrit.com> wrote in message
news:%23OjwJkXGGHA.1088@.tk2msftngp13.phx.gbl...
> I'm currently building a product that will need to index and search
> roughly 500 million documents (perhaps more). Currently I have about 60
> million documents in the index (table size about 100GB and index size
> about 35GB) Performance is (obviously) getting worse and I doubt my
> current architecture will support the needed growth and still provide the
> required speed.
> Documents are added 24x7 at the rate of 100's per minute (roughly 250,000
> new ones each day). Plus, due to the search requirements, I had to clear
> the noise file. I've been thinking of doing table partitioning (using
> 2005), creating indexed views and then using multiple full-text indexes to
> query the data (however, of course, I will not be able to rank effectively
> then).
> So, I'm curious if any one else has this type of volume and has come up
> with a solid solution. Appeciate any ideas.
> I should also mention that I'm looking for short term consulting help on
> this, as well as a full-time position - so while I'm grateful for free
> suggestions from this board, should anyone be looking for work, please
> contact me as well (company located in New Jersey).
> Thanks!
> Joel
>

No comments:

Post a Comment