Skip to content

get chinese word in by GetToc or GetTextWords has some unknow word #170

@DHclly

Description

@DHclly

the test file:
20250309第三章.pdf

mupdf.net version:

<PackageReference Include="MuPDF.NET" Version="3.2.5" />

I use this code

public static void T1()
{
    Document doc = new Document(@"D:\learn\python-pdfplumber-learn\pdf-docs\20250309第三章.pdf");
    doc.SetLanguage("zh-CN");
    doc.FontInfos.Add(new FontInfo()
    {
        Name="微软雅黑",
    });

    var toc = doc.GetToc();
    var t0 = toc[0];
    var title = t0.Title;
    Console.WriteLine(title);

    var p1 = doc[0];
    var list = p1.GetTextWords(sort: true);
    foreach (var wb in list)
    {
        Console.WriteLine(wb.Text);
    }
}

result:

Image

open by wps or google chrome:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions