Support non-ASCII characters in Tokenizer (or directly accept tokenized text)

Thank you for the great and performant package!

I was experiment with it today and found a potential limitation in the tokenizer related to language support.
Currently, it seems to support only ASCII characters:
```rust
        for ch in text.chars() {
            if ch.is_ascii_alphanumeric() {
                current.push(ch.to_ascii_lowercase());
```

If one wanted to use `bb25` for other texts, e.g., containing Cyrillic alphabet, the non-ASCII characters would be stripped and tokens would be missing.

With this issue, I would like to demonstrate this.
Perhaps the tokenizer may be improved by supporting non-ASCII characters.

Alternatively, `Corpus` may simply support directly passing tokens when adding a new document (below is a simple example how to enable this).
Or `Corpus` could accept a custom `Tokenizer` written in Python which inherits `bb25`'s `Tokenizer`.

This is a short example featuring a document which contains Bulgarian text:
```python
import bb25 as bb

corpus = bb.Corpus()

corpus.add_document("d0", "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене", [0.2] * 15)
corpus.add_document("d1", "neural networks for ranking", [0.1] * 8)
corpus.build_index()  # must be called before creating scorers

bm25 = bb.BM25Scorer(corpus, 1.2, 0.75)
print(bm25.idf("bm25"))
# Prints out: 0.0

for doc in corpus.documents():
    print(f'Document "{doc.id}"')
    print(f'- text "{doc.text}"')
    print(f'- tokens "{doc.tokens}"')
    print(f'- length (number of tokens) "{doc.length}"')
    print(f'- term frequencies "{doc.term_freq}"')
    print()

# Prints out:
# Document "d0"
# - text "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене"
# - tokens "['bm25', 'best', 'matching', '25', 'ranking']"
# - length (number of tokens) "5"
# - term frequencies "{'ranking': 1, 'bm25': 1, '25': 1, 'best': 1, 'matching': 1}"
#
# Document "d1"
# - text "neural networks for ranking"
# - tokens "['neural', 'networks', 'for', 'ranking']"
# - length (number of tokens) "4"
# - term frequencies "{'networks': 1, 'for': 1, 'neural': 1, 'ranking': 1}"
```

You can see that `d0` is seen as containing only 5 tokens - the ones which do not contain non-ASCII characters.

If the tokenizer would support non-ASCII characters, the result would have been:
```python
# Prints out:
# Document "d0"
# - text "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене"
# - tokens "['BM25', 'Best', 'Matching', '25', 'е', 'усъвършенстван', 'алгоритъм', 'за', 'класиране', 'на', 'документи', 'ranking', 'в', 'информационното', 'търсене']"
# - length (number of tokens) "15"
# - term frequencies "{'класиране': 1, 'алгоритъм': 1, 'ranking': 1, 'усъвършенстван': 1, '25': 1, 'Matching': 1, 'е': 1, 'документи': 1, 'търсене': 1, 'информационното': 1, 'за': 1, 'BM25': 1, 'Best': 1, 'в': 1, 'на': 1}"
# 
# Document "d1"
# - text "neural networks for ranking"
# - tokens "['neural', 'networks', 'for', 'ranking']"
# - length (number of tokens) "4"
# - term frequencies "{'neural': 1, 'ranking': 1, 'for': 1, 'networks': 1}"
```
And the IDF would be correct: `print(bm25.idf("bm25"))` would print out `1.6094379124341003`.


Adding support for directly passing tokens to `add_document` would require subtle changes to `Corpus`:
```patch
diff --git a/src/corpus.rs b/src/corpus.rs
index e11b77d..7993289 100644
--- a/src/corpus.rs
+++ b/src/corpus.rs
@@ -35,6 +35,10 @@ impl Corpus {
 
     pub fn add_document(&mut self, doc_id: &str, text: &str, embedding: Vec<f64>) {
         let tokens = self.tokenizer.tokenize(text);
+        self.add_document_with_tokens(doc_id, text, tokens, embedding);
+    }
+
+    pub fn add_document_with_tokens(&mut self, doc_id: &str, text: &str, tokens: Vec<String>, embedding: Vec<f64>) {
         let mut term_freq = HashMap::new();
         for token in &tokens {
             *term_freq.entry(token.clone()).or_insert(0) += 1;
```
and the respective binding:
```patch
diff --git a/src/pybindings.rs b/src/pybindings.rs
index aafdd43..f17df5f 100644
--- a/src/pybindings.rs
+++ b/src/pybindings.rs
@@ -171,7 +171,8 @@ impl PyCorpus {
         }
     }
 
-    fn add_document(&self, doc_id: &str, text: &str, embedding: Vec<f64>) -> PyResult<()> {
+    #[pyo3(signature = (doc_id, text, embedding, tokens=None))]
+    fn add_document(&self, doc_id: &str, text: &str, embedding: Vec<f64>, tokens: Option<Vec<String>>) -> PyResult<()> {
         if self.shared.borrow().is_some() {
             return Err(PyRuntimeError::new_err(
                 "Corpus is frozen and cannot be modified",
@@ -181,7 +182,11 @@ impl PyCorpus {
         let Some(corpus) = inner.as_mut() else {
             return Err(PyRuntimeError::new_err("Corpus is unavailable"));
         };
-        corpus.add_document(doc_id, text, embedding);
+        if let Some(toks) = tokens {
+            corpus.add_document_with_tokens(doc_id, text, toks, embedding);
+        } else {
+            corpus.add_document(doc_id, text, embedding);
+        }
         Ok(())
     }
```

Then, documents can be added like this:
```python
corpus = bb.Corpus()

text = "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене"
tokens = list(tokenize(text))
corpus.add_document(doc_id=f"d0", text=text, tokens=tokens, embedding=[])
```
where `tokenize` is a custom tokenization function, e.g., as simple as:
```python
def tokenize(text: str):
    return text.split()
```

Hope this would be useful!

Kind regards,
Nikolay

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-ASCII characters in Tokenizer (or directly accept tokenized text) #5

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support non-ASCII characters in Tokenizer (or directly accept tokenized text) #5

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions