Skip to content

Fix: Added check the type of tokenizer.json's model_merges#987

Open
N-E-W-T-O-N wants to merge 10 commits into
microsoft:mainfrom
N-E-W-T-O-N:merges
Open

Fix: Added check the type of tokenizer.json's model_merges#987
N-E-W-T-O-N wants to merge 10 commits into
microsoft:mainfrom
N-E-W-T-O-N:merges

Conversation

@N-E-W-T-O-N

@N-E-W-T-O-N N-E-W-T-O-N commented Aug 9, 2025

Copy link
Copy Markdown

Modern Tokenizer store merge's value as list of lists of strings. But models like GPT-2 or gemma-2 store as a list of strings
Ex Mistral

Added a check for this fix

@N-E-W-T-O-N N-E-W-T-O-N requested a review from a team as a code owner August 9, 2025 18:33
@N-E-W-T-O-N

Copy link
Copy Markdown
Author

Mistral

merges": [
      [
        "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n",
        "\n"
      ],
      [
        "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
        "▁"
      ],
      [

@N-E-W-T-O-N

Copy link
Copy Markdown
Author

Gpt-2

"merges":["Ġ t","Ġ a","h " ...

@N-E-W-T-O-N

N-E-W-T-O-N commented Aug 11, 2025

Copy link
Copy Markdown
Author

@shaaga @sayanshaw24

@sayanshaw24

Copy link
Copy Markdown
Collaborator

Modern Tokenizer store merge's value as list of lists of strings. But models like GPT-2 or gemma-2 store as a list of strings Ex Mistral

Added a check for this fix

awesome, thanks for this! could you add a unit test here as well to ensure expected functionality: https://github.com/microsoft/onnxruntime-extensions/tree/main/test?

@sayanshaw24

Copy link
Copy Markdown
Collaborator

/azp run onnxruntime-extensions.CI

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@N-E-W-T-O-N

Copy link
Copy Markdown
Author

Using phi-4 as an example

@N-E-W-T-O-N

N-E-W-T-O-N commented Aug 13, 2025

Copy link
Copy Markdown
Author

The following Functions is only called when use_fast=True used in passed to tokenizer

@sayanshaw24 sayanshaw24 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use test/data/phi-4-base or test/data/phi-4-mini-reasoning, we'd like to minimize tokenizer files checked into the repo; (also, FYI onnxruntime-extensions only needs the tokenizer.json and tokenizer_config.json files to load tokenizers).

Comment thread test/data/phi-4-mini-reasoning/tokenizer_config.json Outdated
@azure-pipelines

Copy link
Copy Markdown
Commenter does not have sufficient privileges for PR 987 in repo microsoft/onnxruntime-extensions

@N-E-W-T-O-N

N-E-W-T-O-N commented Aug 19, 2025

Copy link
Copy Markdown
Author

Hi @sayanshaw24
Revert back the change as requested and updated the branch .
Please merge my pr

@sayanshaw24

Copy link
Copy Markdown
Collaborator

/azp run onnxruntime-extensions.CI

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@N-E-W-T-O-N N-E-W-T-O-N left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge the latest code .Please rerun the pipeline

@snnn

snnn commented Sep 10, 2025

Copy link
Copy Markdown

/azp where

@azure-pipelines

Copy link
Copy Markdown
Azure DevOps orgs getting events for this repository:

@N-E-W-T-O-N N-E-W-T-O-N left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sayanshaw24 pr please merge my PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants