Wagtail默认搜索无法与非英语字段一起使用

3

我在我的项目中使用默认的数据库后端来进行搜索功能:

from __future__ import absolute_import, unicode_literals

from django.core.paginator import EmptyPage, PageNotAnInteger, Paginator
from django.shortcuts import render

from home.models import BlogPage, get_all_tags
from wagtail.wagtailsearch.models import Query


def search(request):
    search_query = request.GET.get('query', None)
    page = request.GET.get('page', 1)

    # Search
    if search_query:
        search_results = BlogPage.objects.live().search(search_query)
        query = Query.get(search_query)

        # Record hit
        query.add_hit()
    else:
        search_results = BlogPage.objects.none()

    # Pagination
    paginator = Paginator(search_results, 10)
    try:
        search_results = paginator.page(page)
    except PageNotAnInteger:
        search_results = paginator.page(1)
    except EmptyPage:
        search_results = paginator.page(paginator.num_pages)

    return render(request, 'search/search.html', {
        'search_query': search_query,
        'blogpages': search_results,
        'tags': get_all_tags()
    })

博客页面:

class BlogPage(Page):
    date = models.DateField("Post date")
    intro = models.CharField(max_length=250)
    body = StreamField([
        ('heading', blocks.CharBlock(classname="full title")),
        ('paragraph', blocks.RichTextBlock()),
        ('image', ImageChooserBlock()),
        ('code', CodeBlock()),
    ])
    tags = ClusterTaggableManager(through=BlogPageTag, blank=True)

    search_fields = Page.search_fields + [
        index.SearchField('intro'),
        index.SearchField('body'),
    ]
    ...

搜索功能只在BlogPage模型中的body字段使用英语时有效,如果我尝试在body字段中使用一些俄语单词,则无法搜索任何内容。 我查看了数据库,发现BlogPagebody字段是这样的:

[{"value": "\u0442\u0435\u0441\u0442\u043e\u0432\u044b\u0439", "id": "3343151a-edbc-4165-89f2-ce766922d68e", "type": "heading"}, {"value": "<p>\u0442\u0435\u0441\u0442\u0438\u043f\u0440</p>", "id": "22d3818d-8c69-4d72-967e-7c1f807e80b2", "type": "paragraph"}]

因此,问题在于wagtail将Streamfield字段保存为Unicode字符,如果我手动在phpmyadmin中更改为以下内容:
[{"value": "Тест", "id": "3343151a-edbc-4165-89f2-ce766922d68e", "type": "heading"}, {"value": "<p>Тестовый</p>", "id": "22d3818d-8c69-4d72-967e-7c1f807e80b2", "type": "paragraph"}]

搜索开始工作,也许有人知道如何防止 wagtail 将 Streamfield 字段保存为 Unicode?


你没有提到你使用哪个搜索后端。你使用Elasticsearch吗?我成功地使用Elasticsearch整合了德语语言搜索。看起来你没有添加任何额外的字段到索引。或者你只是从“BlogPage”中省略了search_fields声明? - Moritz
我已经指定了search_fields(将这些行添加到问题中),我猜我正在使用搜索的默认数据库后端。我应该做什么来切换到Elasticsearch?我应该将数据库更改为elastichsearch并更改wagtailsearch配置吗? - Alexey
你应该查看文档以开始使用。不过,PostgreSQL后端的设置更加容易,请参考此处 - Moritz
3个回答

2

我不喜欢这个解决办法,但我决定添加另外两个字段search_bodysearch_intro,然后使用它们进行搜索:

class BlogPage(Page):
    date = models.DateField("Post date")
    intro = models.CharField(max_length=250)
    body = StreamField([
        ('heading', blocks.CharBlock(classname="full title")),
        ('paragraph', blocks.RichTextBlock()),
        ('image', ImageChooserBlock()),
        ('code', CodeBlock()),
    ])
    search_intro = models.CharField(max_length=250)
    search_body = models.CharField(max_length=50000)
    tags = ClusterTaggableManager(through=BlogPageTag, blank=True)

    def main_image(self):
        gallery_item = self.gallery_images.first()
        if gallery_item:
            return gallery_item.image
        else:
            return None

    def get_context(self, request):
        context = super(BlogPage, self).get_context(request)
        context['tags'] = get_all_tags()
        context['page_url'] = urllib.parse.urljoin(BASE_URL, self.url)
        return context

    def save(self, *args, **kwargs):
        if self.body.stream_data and isinstance(
                self.body.stream_data[0], tuple):
            self.search_body = ''
            for block in self.body.stream_data:
                if len(block) >= 2:
                    self.search_body += str(block[1])
        self.search_intro = self.intro.lower()
        self.search_body = self.search_body.lower()
        return super().save(*args, **kwargs)

    search_fields = Page.search_fields + [
        index.SearchField('search_intro'),
        index.SearchField('search_body'),
    ]
    ...

search/views.py:

def search(request):
    search_query = request.GET.get('query', None)
    page = request.GET.get('page', 1)

    # Search
    if search_query:
        search_results = BlogPage.objects.live().search(search_query.lower())
        query = Query.get(search_query)
    ...

0

Alexey,谢谢你!

但是我得到了两次调用save方法。

我应该使用这段代码:

    def save(self, *args, **kwargs):
    search_body = ''
    if self.blog_post_body.stream_data and isinstance(
            self.blog_post_body.stream_data[0], dict):
        for block in self.blog_post_body.stream_data:
            if block.get('type', '') in ('some_header', 'some_text'):
                search_body += str(block['value'])
    self.search_body = search_body
    super(BlogPost, self).save(*args, **kwargs)

是的,也许你的代码更好,但我认为这个问题与SqLite数据库有关。当我切换到Postgres时,问题消失了,所以我认为最好不要使用这种方法。 - Alexey

0

StreamField使用DjangoJSONEncoder编码JSON,其中ensure_ascii = True。然后您将看到Unicode表示为"\u...."。默认的数据库搜索后端仅使用数据库文本匹配,并且在具有非ASCII关键字的查询中将失败。

    def get_prep_value(self, value):
        if isinstance(value, StreamValue) and not(value) and value.raw_text is not None:
            # An empty StreamValue with a nonempty raw_text attribute should have that
            # raw_text attribute written back to the db. (This is probably only useful
            # for reverse migrations that convert StreamField data back into plain text
            # fields.)
            return value.raw_text
        else:
            return json.dumps(self.stream_block.get_prep_value(value), cls=DjangoJSONEncoder)

您需要子类化StreamField并提供一个带有ensure_ascii=False的自定义JSONEncoder。但是,您需要确保您的数据库默认可以处理utf-8字符串。(对于PostgreSQL应该没问题)。
如果您切换到另一个后端,例如PG搜索后端,则在构建索引时会从StreamField提取文本(由https://github.com/wagtail/wagtail/pull/982引入)。您不会遇到问题。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接