Skip to content

Support non-ASCII characters in Tokenizer (or directly accept tokenized text) #5

@ndandanov

Description

@ndandanov

Thank you for the great and performant package!

I was experiment with it today and found a potential limitation in the tokenizer related to language support.
Currently, it seems to support only ASCII characters:

        for ch in text.chars() {
            if ch.is_ascii_alphanumeric() {
                current.push(ch.to_ascii_lowercase());

If one wanted to use bb25 for other texts, e.g., containing Cyrillic alphabet, the non-ASCII characters would be stripped and tokens would be missing.

With this issue, I would like to demonstrate this.
Perhaps the tokenizer may be improved by supporting non-ASCII characters.

Alternatively, Corpus may simply support directly passing tokens when adding a new document (below is a simple example how to enable this).
Or Corpus could accept a custom Tokenizer written in Python which inherits bb25's Tokenizer.

This is a short example featuring a document which contains Bulgarian text:

import bb25 as bb

corpus = bb.Corpus()

corpus.add_document("d0", "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене", [0.2] * 15)
corpus.add_document("d1", "neural networks for ranking", [0.1] * 8)
corpus.build_index()  # must be called before creating scorers

bm25 = bb.BM25Scorer(corpus, 1.2, 0.75)
print(bm25.idf("bm25"))
# Prints out: 0.0

for doc in corpus.documents():
    print(f'Document "{doc.id}"')
    print(f'- text "{doc.text}"')
    print(f'- tokens "{doc.tokens}"')
    print(f'- length (number of tokens) "{doc.length}"')
    print(f'- term frequencies "{doc.term_freq}"')
    print()

# Prints out:
# Document "d0"
# - text "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене"
# - tokens "['bm25', 'best', 'matching', '25', 'ranking']"
# - length (number of tokens) "5"
# - term frequencies "{'ranking': 1, 'bm25': 1, '25': 1, 'best': 1, 'matching': 1}"
#
# Document "d1"
# - text "neural networks for ranking"
# - tokens "['neural', 'networks', 'for', 'ranking']"
# - length (number of tokens) "4"
# - term frequencies "{'networks': 1, 'for': 1, 'neural': 1, 'ranking': 1}"

You can see that d0 is seen as containing only 5 tokens - the ones which do not contain non-ASCII characters.

If the tokenizer would support non-ASCII characters, the result would have been:

# Prints out:
# Document "d0"
# - text "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене"
# - tokens "['BM25', 'Best', 'Matching', '25', 'е', 'усъвършенстван', 'алгоритъм', 'за', 'класиране', 'на', 'документи', 'ranking', 'в', 'информационното', 'търсене']"
# - length (number of tokens) "15"
# - term frequencies "{'класиране': 1, 'алгоритъм': 1, 'ranking': 1, 'усъвършенстван': 1, '25': 1, 'Matching': 1, 'е': 1, 'документи': 1, 'търсене': 1, 'информационното': 1, 'за': 1, 'BM25': 1, 'Best': 1, 'в': 1, 'на': 1}"
# 
# Document "d1"
# - text "neural networks for ranking"
# - tokens "['neural', 'networks', 'for', 'ranking']"
# - length (number of tokens) "4"
# - term frequencies "{'neural': 1, 'ranking': 1, 'for': 1, 'networks': 1}"

And the IDF would be correct: print(bm25.idf("bm25")) would print out 1.6094379124341003.

Adding support for directly passing tokens to add_document would require subtle changes to Corpus:

diff --git a/src/corpus.rs b/src/corpus.rs
index e11b77d..7993289 100644
--- a/src/corpus.rs
+++ b/src/corpus.rs
@@ -35,6 +35,10 @@ impl Corpus {
 
     pub fn add_document(&mut self, doc_id: &str, text: &str, embedding: Vec<f64>) {
         let tokens = self.tokenizer.tokenize(text);
+        self.add_document_with_tokens(doc_id, text, tokens, embedding);
+    }
+
+    pub fn add_document_with_tokens(&mut self, doc_id: &str, text: &str, tokens: Vec<String>, embedding: Vec<f64>) {
         let mut term_freq = HashMap::new();
         for token in &tokens {
             *term_freq.entry(token.clone()).or_insert(0) += 1;

and the respective binding:

diff --git a/src/pybindings.rs b/src/pybindings.rs
index aafdd43..f17df5f 100644
--- a/src/pybindings.rs
+++ b/src/pybindings.rs
@@ -171,7 +171,8 @@ impl PyCorpus {
         }
     }
 
-    fn add_document(&self, doc_id: &str, text: &str, embedding: Vec<f64>) -> PyResult<()> {
+    #[pyo3(signature = (doc_id, text, embedding, tokens=None))]
+    fn add_document(&self, doc_id: &str, text: &str, embedding: Vec<f64>, tokens: Option<Vec<String>>) -> PyResult<()> {
         if self.shared.borrow().is_some() {
             return Err(PyRuntimeError::new_err(
                 "Corpus is frozen and cannot be modified",
@@ -181,7 +182,11 @@ impl PyCorpus {
         let Some(corpus) = inner.as_mut() else {
             return Err(PyRuntimeError::new_err("Corpus is unavailable"));
         };
-        corpus.add_document(doc_id, text, embedding);
+        if let Some(toks) = tokens {
+            corpus.add_document_with_tokens(doc_id, text, toks, embedding);
+        } else {
+            corpus.add_document(doc_id, text, embedding);
+        }
         Ok(())
     }

Then, documents can be added like this:

corpus = bb.Corpus()

text = "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене"
tokens = list(tokenize(text))
corpus.add_document(doc_id=f"d0", text=text, tokens=tokens, embedding=[])

where tokenize is a custom tokenization function, e.g., as simple as:

def tokenize(text: str):
    return text.split()

Hope this would be useful!

Kind regards,
Nikolay

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions