Better Lookup
DaveDaveFind is now returning pretty good results, but there are a few problems I’ve put off that I still need to fix:
- Searches including stopwords return no results, because the stopwords aren’t in the index. The search code should ignore stopwords, too.
- Since multiword lookup still isn’t perfect, it would be nice if the line above each search that says “You searched for:” included separate links for each word in the search query.
- Everything is ordered by Daverank/views. Search results that match the exact query but have low Daverank show up below less relevant pages.
Here’s the code I added to deal with stopwords. First, I added them in a list at the very top of main.py. App Engine won’t allow file reading and writing, so I can’t get them from an external file. I considered storing them in the database, but that means writing a new model and performing an extra database query with each search. Hardcoding them might not be the best solution, but it works. This block (below the Python term stuff), breaks the query into a list of words and gets rid of stopwords:
query_words = query.split()
for word in query_words:
if word in stopwords:
query_words.remove(word)
These two extra lines were all it took to solve the stopword problem. Improving the results links was pretty easy, too. Before passing all the information to the template, I split the query string into a list called query_string_words and passed this in place of query_words. We can’t pass in query_words because the stopwords are gone and the string won’t make sense when it’s printed for the user. And we can’t create this list earlier, because the changes made to query_words will also affect it (remember Secret Agent Man?).
Here’s how the results template handles the new information (remember, it’s now a list instead of a string):
<h2>You searched for: <strong class="orange">
%for word in query_string_words:
<a href="/search?search_query={{ word }}">{{ word }}</a>
%end
Pretty simple: it prints each word as a link, with a URL that points to a DaveDaveFind search for that word. This means that stopwords will be links, but this isn’t a big problem—they just won’t return any results.
Last (but definitely not least) is fixing the order of results. I solved this by adding an extra entry to the dictionary associated with each page: True if it contained an exact match, and False otherwise. Here’s the block that iterates through each result and stores its information:
for page in page_results:
page_info = {}
query_index = page.text.find(query)
if query_index != -1:
i = page.text.find(' ', query_index-25)
excerpt_words = page.text[i:].split(' ')
page_info['exact_match'] = True
else:
excerpt_words = page.text.split(' ')
page_info['exact_match'] = False
excerpt = ' '.join(excerpt_words[:50])
page_info['text'] = excerpt
page_info['title'] = page.title
page_info['url'] = page.url
page_info['daverank'] = page.dave_rank
page_info['doc'] = page.doc
page_dicts.append(page_info)
page_dicts.sort(key=itemgetter('exact_match'), reverse=True)
The last line is the most important. Before the pages are passed to the template, it sorts the list of dictionaries by the value of the key 'exact_match'. The itemgetter method from the operator library is responsible (make sure you add from operator import itemgetter!) for this wizardry. After trying this out, I added reverse=True, which reverses the order, so that pages with an exact match are put at the beginning of the list. One cool thing about this method is that it doesn’t change the order of the elements without an exact match. They stay at the end of the list, ordered by Daverank, and those with exact matches get pulled to the beginning. The template doesn’t use the 'exact_match' key, and the structure of the list hasn’t changed, so this is all we need to fix the result rankings.
It feels more like a whimper than a bang, but I think I’m done with the code for DaveDaveFind! The rest of the changes I need to make involve cleaning up the HTML and styles and performing one last big crawl to provide a detailed search index. I might make minor changes to the code, but all the essentials are done. If I change anything major, I’ll make sure to note it in a post here. To see the latest version, click here.
