Search doesn't find content with exactly matching title #150
Labels
No labels
area: app c
area: app d
area: devops
area: discovery
area: docs
area: proposal
area: X-device Sync
Chainquery
consider soon
dependencies
Epic
Fix till next release
good first issue
hacktoberfest
help wanted
icebox
Invalid
level: 1
level: 2
level: 3
level: 4
needs: exploration
needs: grooming
needs: priority
needs: repro
needs: tech design
on hold
Parked
priority: blocker
priority: high
priority: low
priority: medium
Tom's Wishlist
type: bug
type: discussion
type: improvement
type: new feature
type: refactor
type: task
type: testing
unplanned
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: LBRYCommunity/lighthouse.js#150
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
@eggplantbren commented on Sun Mar 10 2019
Feel free to close this if I've done something silly or misunderstood something.
The Issue
Sometimes, when I search for something I know the exact title of, the search results cannot find that item (or it appears really low in the ranking).
Steps to Reproduce
lbry://jp-DtiRzQMgBDM#461f1b1b421ac2f2198e8a918a90a775978b9931
) is not returned.Suggested Solutions
I don't know anything about search, to be honest, but perhaps matches in the title should be prioritized in the results?
System Configuration
Screenshots
That first result looks dodgy at first but I think it's a bald guy's head, hahaha
Thanks for opening the issue @eggplantbren! Search results are weighed by a variety of factors like where it finds a hit (title vs desc), how many times, LBC on the claim and possibly a few others. We can look into giving more weight for exact matches.
Unfortunately, we have adjusted the search algorithm to remove common words to prevent different aspects of weight forcing unrelated claims to the top of search.
So
Psychology of Redemption in Christianity
becomesPsychology Redemption Christianity
.of
andin
are considered terms we exclude from search queries. Since there would no longer be a "perfect match" it does not find that exact title, even though a perfect match is something it does.I would suggest we remove the "washing" part of the match phrase sub-query and make set it to 0 slop. This way it is only checking for an exact match. Currently it allows some slop but it is kind of useless in a way if we are intending it to catch exact phrases. We could also add another sub query just for an unwashed search.
I don't understand the technical parts, but even without
of
andin
, why wouldn't that be a good match?Good question. There are lots of claims with
Psychology
andChristianity
in the name, title and description. There are different weights assign to different sub-queries, name is most valuable, then title then description. However, a claim withPsychology
in the description 4 times would be more powerful than a claim with it just once. The way elasticsearch works in its primal form(more complex queries are used too), is called hits. The hits are matches found for terms in a query. These hits can have weights assigned to signify importance. The "Formula" so to speak gets pretty complex. Also ours is still pretty immature. I found some mature ones with a query for search that was well over 2K lines.However, if you are really interested in how weights are assigned, below is the responsible function. Suggestions are welcome always! ( I see you are a statistician 🥇 )
4056fb86cf/server/controllers/lighthouse.js (L25-L251)
It looks quite complex to weight the different factors in a search. However, for reference, Google, Duckduckgo, Bing, and Youtube search all gave that video as the first result, and I think searches that include a few uncommon keywords that are all in a single title should generally return that title as a top result.
I can find other examples if you like.
Thanks for the feedback. It's the main reason we also added the thumbsdown button:
(cc: @tiger5226 )
Bahahahahaha...I love this!
@eggplantbren I agree. Search is very complex unfortunately, especially when you get into weights were there is an optimization that needs to happen. Google had a novel idea to use backlinks to show the importance of content. We have made great strides but there is still much more that can be done as you correctly point out. Another idea I have been looking into is to have something akin to backlinks like views. Views sort of tell us the importance or relevance of content. Big search engines have their own proprietary software and algorithms and they are really good at it.
Elasticsearch is really great and helps a lot to make this task possible. Hooray for open source. I probably would not venture to say we could be as good as them quickly though. Elasticsearch basically uses something akin to what Google succeeded, which is assigning a weight to a
hit
which is a search term that was found in an elastic document.Examples are always welcome too because that provides us with the equivalent of a unit test to base our expectations on for the query after we make changes. They are also easy to pinpoint too. Unfortunately, what happens is we adjust a weight to give us what we want in a particular result, but then many other undesired consequences arise.
That's unsurprising.
Sorry if this is a silly issue.
Certainly isn't. Not all scenarios end up that way. Actually, as I noted above, this particular case should be resolvable without side effects which is great. Identifying an issue is the 1st step to making the software better! Thanks for reporting it!
Here's another pretty bizarre search result. "levitation baby mackenzie" fails to return
lbry://six#441c77b2dcd6cc344904d3746f04060a02414a5c
Can confirm the above, but it does return under just
mackenzie
. Also returns forMackenzie levitation
but notMackenzie levitation baby
.Here's another example:
If I search Final Fantasy VIII Remastered: Nintendo Switch trailer E3 2019 I should get a result leading to this claim:
lbry://y2matecom-finalfantasyviiiremasterednintendoswitchtrailernintendoe32019ywNYKWQEbZI1080p#ca73086e3d897c8a77935fd63f7ef5f48b0d34f8
But that claim is nowhere to be found in the search results.
I just realised this (MH's example) could be because it's a recent publish and chainquery is playing up and I think search might use chainquery.
On Mon, Jun 17, 2019 at 6:38 PM Michael H. notifications@github.com wrote:
--
Dr Brendon J. Brewer
Department of Statistics, The University of Auckland, New Zealand
Ph: +64 27 500 1336
Web: https://www.brendonbrewer.com/
Similar to the example above. If I search morgonaut, the channel @morgonaut should come up as a result (lbry://morgonaut/#118d5abf71473407d12eed67802daa3193d4b330). Instead the channel @Hackintosh comes up (lbry://@Hackintosh#f07599446da48a01e6836c307cbfdbe5a547827c).
Both of these channels are made by the same person and have the same content on them. Not sure why only one of them comes up.
Mark is already looking into that one, thanks.
All of these cases listed are resolved except for the spent one. This is now fixed with the latest push to master. We have removed query washing for phrase matching and partial string comparison.
Fantastic, thanks!
On Sat, 13 Jul 2019, 1:09 PM Mark, notifications@github.com wrote:
psychology of redemption in christianity
still doesn't return that JBP vid for me, nor doespsychology redemption christianity'.
levitation baby mackenzie' works (finds my new claim for the same content)It does show up now https://lighthouse.lbry.com/search?s=Psychology%20of%20Redemption%20in%20Christianity&size=225
it's just not in the top 10 results. The claim name holds more prominence than the title. I think this is just a case of SEO. They should claim a better name than
jp-DtiRzQMgBDM
. Thoughts?Interesting, thanks. I am slightly surprised about the name counting more than the title. On the one hand I understand how important names are in the LBRY system, but I think people tend to put more thought into the title than the claim name. Might be worth experimenting with different weightings?
I increased the weighting of names because there were 2 other issues dealing with Channels not being returned in search results. Weights venture into a very sensitive area for search. There are many gives and takes. Like if we think the title is more important, then common words in titles will push channels down in the search results since they are based on the name.
I am certainly open to tweaking the weights though. Ideally, we have a test to pass. Even then it can be nearly impossible meet all the requirements. Maybe I make the phrase match hold a lot more weight. So for the title phrase match query I give it the highest weight.
@tiger5226 can channels and streams be given different treatments?
Either way, I think it would be a good idea to write a utility that searches the names and titles of the top
n
streams and channels and reports on what percentage appear in the topm
results. This could then be used to tune parameters to find the values that hit the highest percent.So I did some research on this. We do leverage filters for the api. A
bool
query is what can have thefilter
sub query. So right now these additional queries added can be put inside a bool query with ashould
along with afilter
. Then we can use the filter to make sure it only searches channels for these additional query added for channel names anyway.Regarding the utility...you are right, we should just bang this out really quick. We should keep the KPI simple too. What defines "top n streams"? n is the parameter but what is top? We can use internal-apis to get the channels with the highest subscribers on youtube. We can also use views to get the top streams in the last 7 days so the KPI is dynamic and then output the results to slack so we can see it every day.
maybe we scale the KPI score:
1 - it does not appear in the results ( max is 10K )
2 - appears in the top 500 results
3 - " " " top 100 results
4 - " " " top 50 results
5 - " " " top 10 results
Then have a score for name and title
@tiger5226 nice!
Yes, I was proposing to take the top YouTube creators by subscribers or views (maybe both) from our database.
In terms of where to output the KPI, I think it'd be better to write it to something readable by metabase.
A more dynamic scoring is a cool idea, but I would weight it more strongly towards needing to be in the top, e.g.
10 - 1st result
8 - Top 3 results
5 - First page
2 - Second page
1 - First 3-5 pages
I forget the stats, but a very large percentage (> 90) do not page search results.
Are you proposing using youtube subscribers as part of LBRY's search metric? That sounds terrible but hopefully I just misunderstood.
No not as part of the search metric but for the sampling. When taking a KPI ( key performance indicator ), we need a sample. Ideally we have a meaningful sample rather than a random sample. Most viewers are looking for specific creators on the platform and use search to find them. So grabbing the sample from the creators with high views or subscribers gives us an idea of how likely someone is to search for the creator. When sampling I want to make sure what people are likely to search for are actually being returned in the results.
That makes sense, thanks for clarifying.
On Fri, Jul 19, 2019 at 12:15 PM Mark notifications@github.com wrote:
--
Dr Brendon J. Brewer
Department of Statistics, The University of Auckland, New Zealand
Ph: +64 27 500 1336
Web: https://www.brendonbrewer.com/
@kauffj expecting things to be in the top 10 will probably be volatile and less meaningful because the results are 10K and it makes the KPI very binary. However, I see it's where we want to be so we should probably try it first and if it does turn out to be volatile, we can easily tweak it.