Trouble with Global Search and Chinese Characters

vrms · May 6, 2020, 11:44am

I am having trouble with the Global Search. We have many Items, Suppliers, etc with Chinese Characters and I have trouble finding them via the Global Search.

I see the same behavior on a local v11, as well as a v12 instance on erpnext.com. Can anyone advise whether there is anything on a global level what can be done to (like character encoding on the OS or database level for example?)

szufisher · May 6, 2020, 12:16pm

Add the following Bold parameters via
sudo nano /etc/mysql/mariadb.cnf file

innodb_ft_min_token_size=2
ft_min_word_len=2

[mysqld]
innodb_ft_min_token_size=2
ft_min_word_len=2

vrms · May 6, 2020, 12:34pm

I’ll check that out tx. Just curious … is that related to length of the Search string?

szufisher · May 6, 2020, 12:51pm

yes. try and let me know the result?

vrms · May 8, 2020, 7:46am

this did not solve the issue. The behaviour is pretty illogical.

Some examples:

I can not find: 常琴
I can not find: 吴玲华
I can find: 上海
I can find: JL

I’ll prepare a small demo in the coming days. In order to not reveal our real data I have to prepare some sort of demo data which will take a bit.

szufisher · May 10, 2020, 9:50am

my test finding is: search OK if searched text is at beginning of the field, failed if middle or end of the field.

seems that extra handling of Chinese words split needed.

szufisher · May 10, 2020, 3:25pm

what I have done to solve the above problem

add fulltext search for Chinese 中文分词支持
install python library pkuseg GitHub - lancopku/pkuseg-python: pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation 安装中文分词库
./env/bin/pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg
update the library to most updated version 升级
./env/bin/pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg

adapt 对写入__global_search的字段内容进行分词

in any custom app hooks.py file add the below code

from frappe.utils import cint, strip_html_tags
from six import text_type
import pkuseg

seg = pkuseg.pkuseg()

from frappe.utils import global_search

def get_formatted_value(value, field):
	"""
	Prepare field from raw data
	:param value:
	:param field:
	:return:
	"""

	from six.moves.html_parser import HTMLParser

	if getattr(field, 'fieldtype', None) in ["Text", "Text Editor"]:
		h = HTMLParser()
		value = h.unescape(frappe.safe_decode(value))
		value = (re.subn(r'<[\s]*(script|style).*?</\1>(?s)', '', text_type(value))[0])
		value = ' '.join(value.split())
	value =  strip_html_tags(text_type(value))
	try:
		value = ' '.join(seg.cut(value))
	except:
		pass
	return field.label + " : " + value

global_search.get_formatted_value = get_formatted_value

the end result is like this

vrms · May 12, 2020, 11:47am

1st of all … Thanks a ton for getting so involved with this 非常感谢！ I really appreciate this.

I am clear about these 2 steps above.

but what do you mean with this?

szufisher · May 12, 2020, 12:19pm

it is a kind of my remarks which changes needed for this solution to work.

iHello · May 12, 2020, 2:13pm

Please check the Name field to confirm if it’s mixed with First name and last name.
常琴 ,is first name or last name or first + last name?

vrms · May 12, 2020, 9:15pm

I have installed GitHub - lancopku/pkuseg-python: pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation as suggested but unfortunately nothing has change in my scenario. I have noted that the Search behaves a little weird when using Chinese Characters to begin with.

Then in the end I can find some strings and other I can not

here is a short demo of the behavior

https://imgur.com/N6eYIS7

Any ideas what this can be?

szufisher · May 13, 2020, 2:01am

2 suggestions:

search 零花 instead of 吴零花
create new supplier 吴玲花, then search 吴玲花。

share you result, then maybe I can explain to you something.

Fisher

vrms · May 13, 2020, 1:23pm

same behavior as in the first 7 seconds in the gif I shared yesterday for "零花“ as well as for “吴玲华”

szufisher · May 13, 2020, 2:57pm

have you changed this setting and reloaded mariadb? then after this create new docs to see the result, also you can check the content by bench mariadb
then select name,doctype, content from __global_search,

szufisher · May 13, 2020, 3:01pm

szufisher:

custom app hooks.py file add the below code

from frappe.utils import cint, strip_html_tags
from six import text_type
import pkuseg

seg = pkuseg.pkuseg()

from frappe.utils import global_search

def get_formatted_value(value, field):
	"""
	Prepare field from raw data
	:param value:
	:param field:
	:return:
	"""

	from six.moves.html_parser import HTMLParser

	if getattr(field, 'fieldtype', None) in ["Text", "Text Editor"]:
		h = HTMLParser()
		value = h.unescape(frappe.safe_decode(value))
		value = (re.subn(r'<[\s]*(script|style).*?</\1>(?s)', '', text_type(value))[0])
		value = ' '.join(value.split())
	value =  strip_html_tags(text_type(value))

remember to install a new custom app and with the above code in hooks.py file.

vrms · May 13, 2020, 3:04pm

oh sorry. I didn’t get that. I thought you meant, if there was any custom app, you needed to add this to that existing app’s hooks.py file. I’ll look into it and see what I can come up with.

Need to understand how to create a custom app.

szufisher · May 13, 2020, 3:05pm

no, if you already have existing custom app, you can add the code to the hooks.py , no problem, but you got to bench restart to make the new updated code to reload.

Good Luck, waiting for good news.