Skip to content

First bigbio release of msgf+#14

Open
ypriverol wants to merge 141 commits intomasterfrom
dev
Open

First bigbio release of msgf+#14
ypriverol wants to merge 141 commits intomasterfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented Apr 16, 2026

This pull request adds comprehensive project documentation and internal investigation records to the MS-GF+ codebase. It introduces a high-level project context file, detailed documentation of command-line flags, and a new system for tracking technical investigations and plans. The main themes are improved onboarding/documentation, enhanced transparency into known issues and technical debt, and groundwork for future development and parameter modernization.

Project documentation and onboarding:

  • Added .claude/CLAUDE.md with an overview of the MS-GF+ project, directory structure, build instructions, conventions, and performance-sensitive invariants. This serves as a central onboarding and context document for the codebase.
  • Added .claude/plans/README.md and .claude/plans/parameter-modernization-flag-inventory.md to document implementation plans, design documents, and a detailed inventory of all command-line flags and their parsing semantics. This supports the upcoming parameter modernization effort. [1] [2]

Investigations and technical debt tracking:

  • Introduced .claude/investigations/README.md as an index for tracked issues, bugs, and behaviors needing further analysis, establishing a process for technical investigations.
  • Added detailed investigations:
    • Investigation 001: MGF scan number extraction failure, documenting the parser's inability to handle PRIDE/ProteomeXchange TITLE formats and proposing regex-based fixes.
    • Investigation 002: E-value leakage to Percolator, analyzing how MS-GF+ E-value output leaks target/decoy information, impacting FDR estimation in downstream tools, and proposing output changes and CLI flags to mitigate.

ypriverol and others added 30 commits April 13, 2026 07:51
…Java 17

- Change skipTests from true to false so mvn test actually runs
- Update maven-compiler-plugin source/target from 1.8 to 17 (matches runtime)
- Add missing compile dependencies: jmzml 1.7.11, fastutil 8.5.12,
  slf4j-api 1.7.36, logback-classic 1.2.12, commons-io 2.15.1
  (master code references these classes but they were not declared)
- @ignore TestMzML test that requires Windows-specific DMS files

Result: 120 tests run, 53 active, 67 skipped, 0 failures

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…shold

In MZIdentMLGen.addSpectrumIdentificationResults(), change `break` to
`continue` when a match has DeNovoScore below the minimum threshold.
The `break` was incorrectly stopping emission of all subsequent matches
for that spectrum, silently dropping valid PSMs from the mzid output.

Also add null safety check for spectrum index lookup — if a spectrum
index is not found in the spectrum file, log a warning and skip
instead of throwing a NullPointerException.

Add TestMZIdentMLGen with two integration tests:
- testMzidScoreCompleteness: runs MSGF+ search, verifies every SII
  has all 4 score CVParams (RawScore, DeNovoScore, SpecEValue, EValue)
- testMzidStructuralValidity: verifies output mzid has required
  mzIdentML structure elements

Closes MSGFPlus#157

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add new -msLevel CLI parameter to filter spectra by MS level.
Accepts single value (e.g., -msLevel 2) or comma-separated range
(e.g., -msLevel 2,3). Default is 2 (MS2 only).

Changes:
- ParamManager: add MS_LEVEL enum and registration
- IntRangeParameter: enable single-value parsing, fix typo
- SearchParams: add minMSLevel/maxMSLevel fields
- SpecKey: filter spectra by MS level in getSpecKeyList()
- SpectraAccessor: add setMSLevelRange(), wire to parsers
- MzMLAdapter/MzXMLSpectraMap: fix maxMSLevel to be inclusive
- MSGFPlus/MSGFDB/MSGFDBLib: wire MS level parameters
- pom.xml: remove fastutil shade filter (jmzml 1.7.11 needs full fastutil)

Tests: TestIntRangeParameter (9 tests), TestMSLevelFiltering (6 tests)

Benchmark (TMT 1.1GB, TDA):
  Baseline: 1245s, 6654 PSMs@1%FDR
  -msLevel 2: 957s (-23%), 6936 PSMs@1%FDR (+4.2%)

Closes MSGFPlus#159

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat(MSGFPlus#159): add -msLevel parameter for MS level filtering
fix(MSGFPlus#157): preserve PSM scores when DeNovoScore is below threshold
fix: enable test suite and fix broken build dependencies
Remove standalone scripts, legacy tools, and unused classes that are
not referenced by the core MSGF+ search pipeline, reducing codebase
by ~22,000 lines.

Deleted entire packages:
- ims/ (9 files) — legacy IMS utilities
- ipa/ (5 files) — unused isotope pattern analysis
- msgf2d/ (8 files) — abandoned 2D scoring experiment
- msdictionary/ (7 files) — unused genome dictionary tool
- mstag/ (3 files) — unused sequence tagging
- scripts/ (6 files) — standalone CLI utilities
- msutil/test/ (3 files) — misplaced test classes
- msgf/test/ (2 files) — legacy test stubs
- msgf/analysis/ (1 file) — unused ROC generator

Cleaned mixed packages:
- misc/: removed 59 standalone scripts, kept 5 core utilities
- msgf/: removed 6 unused graph/scoring classes
- msutil/: removed 9 unused filter/annotation classes
- msdbsearch/: removed 4 standalone DB tools
- parser/: removed 9 legacy format parsers (InsPecT, Mascot, etc.)
- ui/: removed 6 legacy entry points (MSGF, MSGFLib, etc.)
- mzid/: removed 1 unused adapter stub
- msscorer/: removed 1 unused stats class
- suffixarray/: removed 1 unused mass array class

Also removed dead test methods and cleaned dangling imports.
Tests: 119 run, 0 failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite README.md:
- Full parameter reference tables covering all 30+ flags organized
  by category (core search, fragmentation, enzyme, filtering, etc.)
- Quick start examples for basic and TMT searches
- Modification file format documentation with examples
- Build-from-source instructions
- Updated requirements to Java 17+

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add parameter docs to README and CI/CD workflow
Remove dead code: 150 unused classes, -22K lines
Write search results directly to TSV from in-memory objects, bypassing
mzIdentML serialization. Output is column-identical to MzIDToTsv
(verified by diff on test.mgf search). This avoids generating large
.mzid files when only TSV is needed downstream (e.g. OpenMS
MSGFPlusAdapter, Percolator).

- New DirectTSVWriter class with same score/protein/mod logic as
  MZIdentMLGen but streaming tab-delimited output
- New -outputFormat parameter: 0=mzid (default), 1=tsv, 2=both
- Includes fixed + variable mods, MGF Title column, decoy filtering
- Backwards compatible: default remains mzid

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When -addFeatures 1 is used with -outputFormat tsv, the TSV now
includes all PSMFeatureFinder columns needed for Percolator:
ExplainedIonCurrentRatio, NTermIonCurrentRatio, CTermIonCurrentRatio,
MS2IonCurrent, MS1IonCurrent, IsolationWindowEfficiency,
NumMatchedMainIons, and all error statistics (MeanError/StdevError
for All and Top7, both absolute and relative).

These features were previously only available as UserParams in mzid
and were not extracted by OpenMS's addMSGFFeatures() — now they are
directly accessible as TSV columns.

The peptide modification format (M+15.995) is already compatible with
OpenMS MSGFPlusAdapter's modifySequence_() converter which transforms
it to bracket notation M[+15.995] for AASequence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the jmzml JAXB-based MzMLUnmarshaller with a lightweight StAX
streaming parser that extracts only the 11 fields MSGF+ needs. The new
parser builds a spectrum index in a single pass, then preloads all
spectra into memory on first random access, eliminating repeated XML
parsing during the search phase.

Benchmark (TMT 1.1GB mzML, target-decoy, 4 threads):
- Wall time: 957s -> 853s (-10.9%)
- PSMs at 1% FDR: 6,936 (unchanged)
- Score completeness: 100% (unchanged)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…port

- Remove jmzml (JAXB-based mzML parser) dependency from pom.xml
- Delete old jmzml-dependent classes: MzMLAdapter, MzMLSpectraMap,
  MzMLSpectraIterator, SpectrumConverter
- Add referenceableParamGroupRef resolution to StaxMzMLParser: builds
  a map of param groups during index pass, resolves refs during
  spectrum parsing (critical for files that define polarity, MS level,
  etc. in referenceable groups)
- Move turnOffLogs() utility to StaxMzMLParser, update all callers
- Keep fastutil dependency (needed by jmzidml at runtime)

JAR size reduced from 39.5MB to 38MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jmzReader (uk.ac.ebi.pride.tools:jmzreader:2.0.6) had zero imports
anywhere in the codebase — a dead dependency from earlier development.
All spectrum file format parsing uses custom implementations:
mzML (StaxMzMLParser), mzXML (embedded jrap/stax), MGF/MS2/PKL
(custom parsers).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove mzXML file format support entirely:
- Delete embedded jrap/stax library (20 files, ~5,800 lines)
- Delete MzXMLSpectraMap, MzXMLSpectraIterator, MzXMLToMgfConverter
- Delete MzXMLToMgf utility and mzXML test resources (38MB)
- Remove MZXML from SpecFileFormat enum, SpectraAccessor, ParamManager
- Update misc/scripts/ui classes to remove mzXML code paths

mzXML is a legacy format superseded by mzML. Users with mzXML files
can convert to mzML using msconvert (ProteoWizard).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- StaxMzMLParser: use ConcurrentHashMap for thread-safe spectrum cache,
  fix class-level doc (preload-all, not bounded LRU), check index before
  preloading, propagate exceptions instead of returning null
- StaxMzMLSpectraIterator: throw NoSuchElementException when exhausted
- SpectraAccessor: throw exception instead of System.exit(-1),
  validate specFormat is non-null in constructor
- SelectSpectra: update stale .mzXML reference to .mzML
- pom.xml: fix duplicate <manifest>, remove stale comments, note
  fastutil is required by jmzidentml at runtime

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Write search results directly to TSV from in-memory objects, bypassing
mzIdentML serialization. Output is column-identical to MzIDToTsv
(verified by diff on test.mgf search). This avoids generating large
.mzid files when only TSV is needed downstream (e.g. OpenMS
MSGFPlusAdapter, Percolator).

- New DirectTSVWriter class with same score/protein/mod logic as
  MZIdentMLGen but streaming tab-delimited output
- New -outputFormat parameter: 0=mzid (default), 1=tsv, 2=both
- Includes fixed + variable mods, MGF Title column, decoy filtering
- Backwards compatible: default remains mzid

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When -addFeatures 1 is used with -outputFormat tsv, the TSV now
includes all PSMFeatureFinder columns needed for Percolator:
ExplainedIonCurrentRatio, NTermIonCurrentRatio, CTermIonCurrentRatio,
MS2IonCurrent, MS1IonCurrent, IsolationWindowEfficiency,
NumMatchedMainIons, and all error statistics (MeanError/StdevError
for All and Top7, both absolute and relative).

These features were previously only available as UserParams in mzid
and were not extracted by OpenMS's addMSGFFeatures() — now they are
directly accessible as TSV columns.

The peptide modification format (M+15.995) is already compatible with
OpenMS MSGFPlusAdapter's modifySequence_() converter which transforms
it to bracket notation M[+15.995] for AASequence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace ConvertToMgf-based tests (class removed in PR #7) with
StaxMzMLParser and SpectraAccessor mzML parsing tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: native TSV output — bypass mzIdentML for OpenMS/Percolator pipelines
perf: replace jmzml JAXB parser with StAX-based mzML reader
* chore: add CI/release packaging and benchmark scaffolding

Split infra and repository maintenance updates into a dedicated reviewable change set, including workflow automation, Docker packaging, benchmark scripts/docs, and project documentation updates. Exclude large local benchmark artifacts and keep this PR focused on non-hot-path code organization and release hygiene.

Made-with: Cursor

* chore: keep benchmark folder local-only

Remove benchmark scripts/docs from this branch and ignore the entire benchmark directory so local benchmarking assets do not appear in review PRs.

Made-with: Cursor

* docs: keep single canonical primitives plan

Fold memory-reduction guidance into the balanced primitives plan and remove the old duplicate plan file so review and maintenance use one canonical document.

Made-with: Cursor

* chore: narrow PR1 plans to scope-only docs

Remove unrelated strategy and optimization plan documents from PR1 so this branch stays focused on infra/packaging cleanup. Keep only the plans index file in this PR.

Made-with: Cursor

* chore: remove legacy ZippedReleases folder

Delete the obsolete Windows release helper scripts and reference files under ZippedReleases from the repository.

Made-with: Cursor

* chore: remove legacy extlib dependency jar

Delete the obsolete jrap/stax legacy jar under extlib as part of repository cleanup.

Made-with: Cursor

* fix: address copilot review feedback for PR11

Align docs with actual supported legacy formats, update release pipeline to build from tag version with tests, and fix Docker build JDK requirement.

Made-with: Cursor

* chore: minor packaging/docs hygiene for PR1

Normalize ignore files, shrink Docker build context, align agent README with dev/CI, and clarify release workflow step naming.

Made-with: Cursor

* docs: trim examples folder to small referenced artifacts

Remove duplicate Tryp_Pig_Bov DB/index copies (tests use src/test/resources),
drop large unlinked Excel/PNG teaching files, and add docs/examples/README.md
so the directory purpose is obvious. Link the index from the main README.

Made-with: Cursor

* chore: remove IntelliJ IDEA tips screenshots from docs

Made-with: Cursor

* docs: replace legacy HTML manuals with Markdown

Convert docs/*.html to GitHub-flavored Markdown (pandoc), fix internal links,
add docs/README.md as the documentation index, and remove unused style.css.

Made-with: Cursor

* docs: strip leftover HTML span wrappers from converted Markdown

Made-with: Cursor
Add a workflow-dispatch benchmark pipeline on a fixed self-hosted runner profile, with public-data download, metrics emission, and baseline TSV comparison under benchmark/ci/PXD001819 for future dataset expansion.

Made-with: Cursor
Use uppercase PXD001819 naming in workflow-visible labels/artifacts and update README to state mzXML is not available in this fork.

Made-with: Cursor
- run_ci.sh: count only opening <SpectrumIdentificationItem> tags for
  sii_count (prior substring match double-counted closing tags and
  picked up SpectrumIdentificationItemRef)
- run_ci.sh: always emit peak_rss_kb and cpu_percent (NA when GNU time
  does not expose them) so metrics file format is consistent
- compare_metrics.py: support an `optional` column; optional missing/NA
  metrics warn instead of failing CI
- baseline.tsv: add optional column, mark peak_rss_kb optional, fix
  ubuntu-latest note to reference the self-hosted runner, widen
  sii_count floor to match the de-duplicated count
- README pointers: update stale references to a non-existent
  benchmark/run_pxd001819_benchmark.sh script
- benchmark/README.md: describe the actual committed CI scaffold
  instead of an uncommitted local harness layout

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- extract_metrics.py: stream-parse mzIdentML with ElementTree.iterparse
  so SII counting and PSM 1% FDR counting no longer rely on line-shaped
  regex matches over XML
- run_ci.sh: use a bash array for SEARCH_ARGS (safe against future
  flags with spaces), atomic .part downloads, validate cached gzip,
  default MSGFPLUS_THREADS to 8 to match the workflow, drop the
  always-zero java_exit metric, and emit integer wall_time_sec
- workflow: pin Python via actions/setup-python@v5 so self-hosted
  runners have a known 3.11 interpreter for the helper scripts
- compare_metrics.py: add test_compare_metrics.py covering
  in-range pass, out-of-range fail, missing required/optional,
  NA, non-numeric, and empty-range rows (7 tests, all passing)
- .gitignore: drop redundant benchmark/** patterns (already covered
  by benchmark/* + ci/ allowlist); add __pycache__/ and *.pyc
- docs: describe new helper and test scripts in both READMEs
fix(benchmark): harden PXD001819 scaffold per review feedback
…ull sink

Three small follow-ups to the per-task-buffer commit that pay back the
allocation churn the buffer change shifted around:

- ConcurrentMSGFPlus: cache a static NULL_PRINT_STREAM instead of
  allocating one per RunMSGFPlus.run() invocation. Same NullOutputStream
  sink either way; we just stop newing it 12+ times per search.
- ConcurrentMSGFPlus.RunMSGFPlus: add drainResultsTo(dest) and
  getResultCount(). Drain transfers the task-local buffer's contents
  to the destination AND clears the local list, so the per-task heap
  is released immediately after merge instead of dangling on the task
  reference until the next major GC.
- MSGFPlus.runMSGFPlus: pre-size the merged ArrayList based on
  sum(t.getResultCount()) before draining, avoiding the resize-and-copy
  cycle that addAll otherwise triggers as it doubles the backing array.
  Then submittedTasks.clear() drops the strong refs to all 12 task
  instances right after the wall summary, so their per-task heap
  (specScanner, scanner, aaMass arrays, etc.) is collectible before
  the FDR / write phase.

TestConcurrentMSGFPlus: new drainsTaskLocalResultsIntoCallerBuffer
test pins the new drain semantics — destination receives all entries,
source is empty afterwards.

Astral 3-arm parity (clean re-run, post-Rancher-shutdown system):
  armA wall=848.8s targets=89479 decoys=46792
  armB wall=752.2s targets=89479 decoys=46792 (-11% vs A)
  armC wall=798.2s targets=89360 decoys=46913 (-6% vs A)
Percolator at 1% FDR: armB 35818 / armC 35767 — both bit-identical
to dev. All 8 parity numbers match exactly.

Scoped tests: 59/59 green.
…list

MSGFDB.java mirrored the same pattern PR #25 fixed in MSGFPlus.java:
each task's ScoredSpectraMap was constructed with a
Collections.synchronizedList(specKeyList.subList(...)) wrapper. The
sublist is a read-only view of the parent specKeyList, accessed only
from the single worker thread that owns the task's ScoredSpectraMap;
the synchronization wrapper served no purpose.

This is the legacy MS-GFDB CLI path, kept consistent with the
matching cleanup landed in MSGFPlus.

Deliberately NOT changed in this commit: the global resultList at
MSGFDB.java:296 is still Collections.synchronizedList(...). Unlike
MSGFPlus, the MSGFDB path still has N worker threads concurrently
calling scanner.addDBSearchResults(sharedList, ...). The previous
PR #25 commit (957f6c2) removed the `synchronized` modifier from
DBScanner.addDBSearchResults; the synchronizedList wrapper is now
what keeps individual gen.add(...) calls atomic. Removing it without
also doing the per-task-buffer refactor that MSGFPlus got would
introduce a real data race on the gen list.

The full per-task-buffer refactor for MSGFDB (matching what
MSGFPlus has via getResults() + drainResultsTo()) is a follow-up PR.
MSGFDB doesn't have an Astral-style parity benchmark to validate
against, so it deserves its own focused validation cycle rather than
piggy-backing here.
… change)

Cleanup pass over the perf/search-sync-cleanup branch. No behavior
change; tests still pass; net -43 LOC.

ConcurrentMSGFPlus.java
- TaskWallStats: 14-line final class → 2-line Java 17 record. Same
  immutable shape, less ceremony.
- Drop volatile on the wallStats field — the only reader is the main
  thread after executor.awaitTermination, which establishes happens-
  before per JLS §17.4.5. The volatile-plus-comment was self-
  contradictory.
- Trim restate-the-code Javadocs from getResults / getResultCount /
  drainResultsTo / getWallStats and the resultList field. Replaced
  with a one-line WHY comment on the wallStats field about the
  happens-before rationale.
- Use the short-form ArrayList import instead of the fully-qualified
  java.util.ArrayList<>().

MSGFPlus.java
- Promote "msgfplus.useForkJoin" to USE_FORK_JOIN_PROPERTY constant
  alongside the existing TASKS_PER_THREAD_PROPERTY pattern. Move both
  constants near DISABLE_THREADING at the top of the class to match
  CompactSuffixArray.SA_BUILD_THREADS_PROPERTY convention.
- Extract shutdownPoolNow(executor, fjp) helper. Three identical
  catch-block ternaries collapse to one call each.
- Add fjp.shutdownNow() on the FJP early-return-on-Future-failure
  path so any still-running tasks get interrupted instead of running
  to completion in the background.
- Drop the T1/T2/T3 planning-doc tags from comments — those are
  internal task-numbers from the work plan that don't belong in
  source per the project's CLAUDE.md "no comments referencing the
  task / caller" rule.
- Trim "default thread pool", "FJP-mode pool", "retain task references
  for the tail summary", and "T1: tail-imbalance summary" labels that
  restated the code. Kept the JLS §17.4.5 happens-before comment
  before the drain loop (genuine non-obvious WHY) and the deferred-
  ScoredSpectraMap-construction comment.

Skipped from the review:
- SysProps.intOr extraction across resolveTasksPerThread +
  CompactSuffixArray.resolveSortThreads — only two copies; per
  CLAUDE.md "three similar lines is better than premature
  abstraction."
- Median helper extraction — MassErrorStat.median takes List<Float>
  not List<Long>; wrapping is more code than the inline version.
- Per-task ArrayList pre-size from subListSize — would re-add a
  constructor parameter that the per-task-buffer commit deliberately
  removed.

Scoped tests green: 66/66.
The sysprop introduced in commit c3afe11 (T2) duplicated functionality
the existing CLI flag -tasks -N already provides. Both produce
numTasks = numThreads * N when no -tasks is specified explicitly.
Having two parallel knobs for the same thing is clutter, not feature.

Removed: TASKS_PER_THREAD_PROPERTY constant, resolveTasksPerThread()
helper. The default multiplier survives as DEFAULT_TASKS_PER_THREAD
constant, used inline.

Users who want a different multiplier should use -tasks -N. The
ParamManager already documents this (negative argument means "tasks
per thread").

The -Dmsgfplus.useForkJoin sysprop is kept — it's not a tuning knob,
it's an A/B measurement toggle for an alternative executor we're
not committing to. CLI exposure would imply users should reach for
it; sysprop is the right level for an opt-in evaluation flag.

Scoped tests green: 70/70.
Move classes into clearer top-level packages so the CLI entry points,
write-side outputs, and parser support are colocated:

- ui.MSGFPlus, ui.MSGFDB                          -> cli.*
- mzid.DirectPinWriter, DirectTSVWriter,
  Unimod, UnimodComposition                       -> output.*
- net.pempek.unicode.UnicodeBOMInputStream        -> parser.UnicodeBOMInputStream
- mslibsearch.ProcessedSpectrum                    deleted (no remaining refs)

Update the executable jar wiring to follow the new MSGFPlus location:
- META-INF/MANIFEST.MF Main-Class
- pom.xml mainClass entries (assembly + shade plugin)

Pure rename + import-fix; no behavior change. clean compile +
scoped tests pass (TestDirectPinWriter, TestMSUtils, TestSA,
TestMisc, TestRunManifestWriter -- 40 tests, 0 failures, 0 errors).
- search-sync-cleanup.md: ship record for PR #25 (search-path sync
  cleanup + per-task result buffers).
- parameter-modernization.md: proposed plan to migrate the custom
  edu.ucsd.msjava.params hierarchy to picocli with a thin
  MSGFPlus-specific compat layer for legacy aliases, config files,
  and repeated mod/CustomAA entries. Status: proposed.

The parameter-modernization plan is documentation only on this
branch -- it explicitly recommends a separate refactor branch for
implementation, not bundling with this performance cleanup.
Picocli is the foundation for the parameter-modernization plan
(see .claude/plans/parameter-modernization.md): replace the custom
edu.ucsd.msjava.params hierarchy with declarative @option flags,
auto-generated help, and a thin compat layer for the legacy
config-file format.

The inventory artifact enumerates every MSGFPlus CLI flag (34 total:
27 visible, 7 hidden) plus the parsing semantics each one currently
relies on (asymmetric tolerance, dynamic-enum registries, range
inclusivity flips, trailing-! number quirk, ~13 config-file aliases,
etc.). It is the foundation for the typed cli.MSGFPlusOptions class
in the next commit.

No code changes yet -- dep + docs only.
Declarative replacement for the imperative addParameter() calls
in ParamManager.addMSGFPlusParams(). All 34 MSGFPlus flags (27
visible + 7 hidden) are declared as picocli @option fields with
typed Java fields where the Java type is natural (File, Integer,
Double, String).

Complex domain-typed flags (precursor tolerance, isotope-error
range, ms-level range, output format, precursor-cal mode, dynamic
enums for fragmentation/instrument/enzyme/protocol) are captured
as raw String/Integer for now; the Phase 1c adapter will round-trip
them through the existing params.Parameter#parse(String) hierarchy
to preserve current behavior. Phase 3 collapses that round-trip
once the old hierarchy is deleted.

No wiring change yet -- MSGFPlus.main() still uses
ParamManager.parseParams(). Adapter and main() switch land in
follow-up commits. Compile passes.
Phase 1 of parameter-modernization. All non-config-file MSGFPlus
invocations now flow through:

    argv -> picocli (MSGFPlusOptions) -> MSGFPlusOptionsAdapter -> ParamManager

instead of imperative ParamManager.parseParams. -conf still falls
through to the legacy path; Phase 2 ports the config-file reader.

The adapter round-trips each typed field through the existing
params.Parameter#parse(String) hierarchy so the downstream
SearchParams.parse(ParamManager) build path is unchanged. Phase 3
collapses that round-trip when the old hierarchy is deleted.

New equivalence test (MSGFPlusOptionsAdapterTest) asserts that the
picocli path populates the same ParamManager state as legacy
parseParams for a representative CLI corpus, including asymmetric
tolerance ("-t 0.5Da,2.5Da"). Picocli's required-flag enforcement
also verified.

Scoped test sweep (TestDirectPinWriter, TestMSUtils, TestSA, TestMisc,
TestRunManifestWriter, SearchParamsTest, TestPercolator,
MSGFPlusOptionsAdapterTest): 45 tests, 0 failures, 0 errors.
Drop the special-case "-conf in argv -> legacy parseParams" fallback
from MSGFPlus.parseArgs. The picocli path now handles every MSGFPlus
invocation, including config-file inputs. Two changes:

1. -s and -d are no longer picocli-required. They were already
   declared isOptional=true on the legacy ParamManager side
   (FileParameter.setAsOptional in addSpecFileParam(true) /
   addDBFileParam(true)), so requiring them at the picocli level
   was stricter than legacy. Without this change, a valid
   "-conf params.txt" invocation (where SpectrumFile / DatabaseFile
   live in the config file) would be rejected by picocli before the
   adapter could populate ParamManager.

2. The config-file overlay is unchanged. SearchParams.parseConfigParamFile
   already runs after CLI parsing and only fills in
   !commandLineParam.isValueAssigned() entries (SearchParams.java:588),
   so CLI flags continue to override config-file values transparently.

The flow is now: argv -> picocli -> MSGFPlusOptions -> adapter ->
ParamManager -> SearchParams.parse (config-file overlay happens here).
No legacy parseParams call remains in the MSGFPlus entry point.

Test: picocliPathRejectsMissingRequiredFlags replaced with
picocliPathAcceptsConfigOnlyInvocation (picocli must accept
"-conf X" without -s/-d, and the adapter must record the config
file path so SearchParams.parse can read it).

Scoped sweep (TestDirectPinWriter, TestMSUtils, TestSA, TestMisc,
TestRunManifestWriter, SearchParamsTest, TestPercolator,
MSGFPlusOptionsAdapterTest): 45 tests, 0 failures, 0 errors.
Net -874 / +2 lines.

Removed:
- src/main/java/edu/ucsd/msjava/cli/MSGFDB.java -- deprecated since
  2012 (version 8091, "08/06/2012"), no external references in src,
  docs, pom.xml, or manifest. The class was @deprecated and only
  referenced its own legacy entry point.
- ParamManager.addMSGFDBParams() -- only caller was the deleted
  MSGFDB.main.
- ParamManager.addMSGFParams() and addMSGFLibParams() -- no entry
  points existed at all (cli.MSGF, cli.MSGFLib never created); both
  were pure dead code accumulating since the original mzid pipeline
  removal.
- ParamManager.addOutputFileParam() / addDBFileParam(String, String,
  boolean) -- private helpers used only by the deleted methods.
- ParamNameEnum.OUTPUT_FILE -- the "-o for MSGF and MS-GFDB" variant;
  MSGFPlus uses SEARCH_OUTPUT_FILE (same key "o", different label).
  ParamManager.getOutputFileParam() now reads SEARCH_OUTPUT_FILE
  directly so SearchParams.parse continues to resolve the user's -o.
- ParamNameEnum.C13, NNET, UNIFORM_AA_PROBABILITY -- MSGFDB-only.
- docs/ms-gfdb.md -- documentation for the removed tool.
- TestMinSpectraPerThread.msgfdbEntryPointAlsoRegistersTheParam --
  test for the removed addMSGFDBParams.

The remaining 3 tests in TestMinSpectraPerThread continue to verify
MSGFPlus's MIN_SPECTRA_PER_THREAD param.

Scoped sweep (TestDirectPinWriter, TestMSUtils, TestSA, TestMisc,
TestRunManifestWriter, SearchParamsTest, TestPercolator,
MSGFPlusOptionsAdapterTest, TestMinSpectraPerThread): 48 tests, 0
failures, 0 errors.
Replace String/Integer holders for the four range-shaped flags with
small typed value classes carrying their own parse/converter logic:

- cli/PrecursorTolerance (Tolerance left, Tolerance right) -- handles
  both symmetric ("20ppm") and asymmetric ("0.5Da,2.5Da") forms,
  validates unit-match and non-negative, exposed via picocli converter.
- cli/IntRange (int min, int max) -- inclusive range, accepts "min,max"
  or single "n" (interpreted as "n,n"), used by -ti, -msLevel, -index.

MSGFPlusOptions fields now use these typed objects directly:
  precursorTolerance:  String -> PrecursorTolerance
  isotopeErrorRange:   String -> IntRange
  msLevel:             String -> IntRange
  specIndexRange:      String -> IntRange

The adapter still round-trips them through Parameter.parse(String)
via toString() for now -- that round-trip goes away in Phase 4c
step 3 when SearchParams.parse reads from MSGFPlusOptions directly
and the legacy params/ hierarchy is deleted.

Existing equivalence test still passes (asymmetric "0.5Da,2.5Da"
case included). Build green.
SearchParams now reads directly from MSGFPlusOptions; the legacy
ParamManager + Parameter.parse round-trip is gone.

What changed:

1. MSGFPlusOptions grew effective*() resolvers for every flag
   (with defaults: 6/40 lengths, 2/3 charges, proton mass, etc.),
   typed lookups for ActivationMethod/InstrumentType/Enzyme/Protocol
   from the existing registries, and an applyConfigFile(File) overlay
   that reproduces the legacy alias rewrites (IsotopeError ->
   IsotopeErrorRange, FragmentationMethod -> FragmentationMethodID,
   etc., 14 total) plus collects DynamicMod/StaticMod/CustomAA into
   ordered lists. CLI fields take precedence -- config-file values
   only fill in null fields.

2. SearchParams.parse(MSGFPlusOptions) replaces parse(ParamManager).
   The 35+ paramManager.getXxx() call sites translate one-for-one
   to opts.effective*() / opts.* reads. parseConfigParamFile is
   gone (its work is now opts.applyConfigFile). Spectrum-format
   "isSupported" check is now a small whitelist instead of routing
   through FileParameter.

3. AminoAcidSet.getAminoAcidSetFromModFile and the new
   getAminoAcidSetFromModEntries (replacing getAminoAcidSetFromList)
   take MSGFPlusOptions and call opts.setMaxNumModsFromMetadata when
   the loaded mod metadata declares a different value, mirroring the
   former paramManager.setMaxNumMods bidirectional handshake.

4. MSGFPlus.main no longer creates a ParamManager. argv -> picocli
   -> MSGFPlusOptions -> SearchParams.parse(opts), with picocli's
   own usage()/printToolInfo()/printJVMInfo() replacing the
   ParamManager equivalents.

5. Adapter (MSGFPlusOptionsAdapter) deleted -- the round-trip via
   Parameter.parse(String) is no longer needed.

6. Tests migrated to construct MSGFPlusOptions directly:
   SearchParamsTest, TestRunManifestWriter, TestDirectPinWriter,
   TestPrecursorCalScaffolding, TestPrecursorCalIntegration,
   TestPercolator, TestMSUtils, TestIPRG, TestCollaboration,
   TestCandidatePeptideGrid(ConsideringMetCleavage), TestSA,
   TestMinSpectraPerThread. MSGFPlusOptionsAdapterTest deleted
   (adapter is gone). TestIntRangeParameter deleted (params/
   IntRangeParameter is the next thing to go in Phase 3).

Validation: scoped sweep -- 73 tests, 0 failures, 0 errors, 5 skipped.
The legacy params/ hierarchy still compiles but has no live callers
on the MS-GF+ path; Phase 3 deletes it next.
After Phase 4c routed SearchParams + AminoAcidSet through
MSGFPlusOptions directly, the entire params/ package became
unreferenced on the live MSGFPlus path. Deleting:

- ParamManager (1,059 LOC after Phase 4a/4b cleanup) -- replaced by
  cli.MSGFPlusOptions + its effective*() resolvers and applyConfigFile().
- Parameter, NumberParameter, RangeParameter (abstract bases)
- IntParameter, FloatParameter, DoubleParameter,
  IntRangeParameter, FloatRangeParameter (typed leaf classes)
- StringParameter, FileParameter, FileListParameter (file/string types)
- ToleranceParameter (replaced by cli.PrecursorTolerance)
- EnumParameter, ObjectEnumParameter (enum machinery; the dynamic
  enum dispatch now lives inline in MSGFPlusOptions.effective*())
- ParamParser (legacy config-file reader; replaced by
  MSGFPlusOptions.applyConfigFile)
- CaseInsensitiveLinkedHashMapParam, CaseInsensitiveMap (the
  Parameter map that backed ParamManager)

Two small helper types (ParamObject interface + UserParam config-file
helper) are still consumed by msutil's runtime registries
(ActivationMethod, InstrumentType, Enzyme, Protocol). They have no
dependency on the rest of the params/ hierarchy, so they relocate
to edu.ucsd.msjava.msutil where their consumers already live.

Validation:
- Clean compile (main + tests) passes.
- Scoped sweep (TestDirectPinWriter, TestMSUtils, TestSA, TestMisc,
  TestRunManifestWriter, SearchParamsTest, TestPercolator,
  TestMinSpectraPerThread, TestPrecursorCalScaffolding,
  TestCandidatePeptideGrid + ConsideringMetCleavage):
  73 tests, 0 failures, 0 errors, 5 skipped.

The legacy MSGFPlus CLI surface is fully preserved via the typed
picocli @option fields + applyConfigFile()'s alias rewrites; this is
purely a maintainability cleanup that drops ~2,100 LOC of custom
parameter-parsing scaffolding.
Remove legacy text-format parsers; only MGF and mzML are retained.

- Delete MS2SpectrumParser, PklSpectrumParser, PNNLSpectrumParser,
  PNNLSpectraIterator, PNNLSpectraMap from parser/
- Delete dead-code SPTxtParser, SpectrumParserWithTitle,
  TSVParser, TSVResultParser, FullyBufferedLineReader from parser/
- Delete SpectraMapByTitle and SpectrumAccessorByTitle from msutil/
  (only callers were in the deleted SPTxtParser)
- Remove MS2/PKL/DTA_TXT/MZDATA entries from SpecFileFormat
- Prune corresponding branches from SpectraAccessor.getSpecMap()
  and getSpecItr() and getSpectrumIDFormatCvParam()
- Prune Spectrum.getSpectrumFileFormat() to mzML + MGF only
- Update SearchParams.isSupportedSpectrumFormat() to mzML + MGF only
- Update MSGFPlusOptions -s description to list only *.mzML, *.mgf
- Remove @ignore generateTRexPRMSpectra test (used deleted TSVParser)
Pure file-move + package/import update; no behaviour change.

- git mv the 6 remaining files under parser/ into mgf/
  (BufferedLineReader, BufferedRandomAccessLineReader, LineReader,
   MgfSpectrumParser, SpectrumParser, UnicodeBOMInputStream)
- Update package declaration in each moved file from
  edu.ucsd.msjava.parser → edu.ucsd.msjava.mgf
- Update import edu.ucsd.msjava.parser.* → edu.ucsd.msjava.mgf.*
  across 13 callers in fdr/, mzml/, msutil/, msscorer/, msdbsearch/
The Phase 4c review surfaced four issues that all bottomed out in
small details of the new typed-options path. Bundling the fixes plus
a regression test:

1. **CustomAA= crash (critical, was code review issue #1).**
   AminoAcidSet.getAminoAcidSetFromModEntries was prepending
   "CustomAA=" to each entry before handing it to parseConfigEntry,
   but parseConfigEntry only strips the "nummods=" prefix -- every
   other line is split on commas and modInfo[0] is parsed as a mass
   or empirical formula. With the prefix attached, modInfo[0] became
   "CustomAA=C3H5NO" which fails Double.parseDouble + Composition.getMass
   and triggers the System.exit(-1) in the caller. Any -conf file with
   a CustomAA= line crashed the process. MSGFPlusOptions.applyConfigEntry
   already strips the "Key=" prefix when populating opts.customAAs;
   the fix is to drop the literal at AminoAcidSet:826 so the bare
   value reaches parseConfigEntry, matching how staticMods/dynamicMods
   are passed at line 831. New regression test
   MSGFPlusOptionsConfigFileTest.configFileWithCustomAAParsesWithoutCrashing
   pins this with a tiny synthetic config file.

2. **-decoy default doc.** @option description now says "Default: XXX"
   (the actual value returned by effectiveDecoyPrefix and the same
   constant the legacy code used via MSGFPlus.DEFAULT_DECOY_PROTEIN_PREFIX).
   The "Default: DECOY_" string was wrong since Phase 1a.

3. **Single-file spectrum-format null check.** The single-file branch
   in SearchParams.parse used to silently store a null SpecFileFormat
   into DBSearchIOFiles when the user supplied -s file.bogus, which
   later NPE'd at MSGFPlus:305 (specFormat.getPSIName()). It now
   short-circuits with the same message the directory branch's
   isSupportedSpectrumFormat filter implies:
   "Spectrum file extension does not match a supported format
   (*.mzML, *.mgf): <name>".

4. **Unrecognized config-key URL hint.** The legacy parseConfigParamFile
   tracked an invalid-parameter counter and, after closing the file,
   printed the example-params URL hint exactly once if the count was
   non-zero. MSGFPlusOptions.applyConfigFile now restores that
   behaviour with a private unrecognizedConfigEntries counter
   incremented inside the default branch of applyConfigEntry, plus
   the same end-of-file hint.

Scoped tests pass (68 tests, 0 failures, 0 errors).
The Phase 4c MSGFPlusOptions.effectiveActivationMethod() switch
hardcoded indices 0..3 (ASWRITTEN/CID/ETD/HCD) and threw
IllegalArgumentException for index 4. The legacy
addFragMethodParam(ActivationMethod.ASWRITTEN, doNotAddMergeMode=true)
hid the registry's FUSION (slot 4), so the user-facing menu was
0..3 + UVPD at index 4. The Phase 4c rewrite silently dropped UVPD
support; this commit restores it.

- Add case 4: return ActivationMethod.UVPD; to the switch.
- Update the @option description to enumerate 4=UVPD.
- New unit test (MSGFPlusOptionsActivationMethodTest) pins the
  full 0..4 mapping plus the default and the out-of-range guard.

docs/msgfplus.md already documents -m 4 = UVPD; this brings the
code in line with the doc.
Remove stale references that no longer match the modernized code:

- README.md: Quick-Start examples now write .pin (default) or
  use -outputFormat tsv; the deleted MzIDToTsv conversion step is
  gone. The "What is MS-GF+?" paragraph and "What is different in
  this fork?" bullets describe the actual current state (mzIdentML
  output removed; spectrum input narrowed to mzML + mgf only;
  picocli-based CLI). The required-input table now shows
  *.mzML / *.mgf only and the -o default is [input].pin.
- docs/msgfplus.md: -s synopsis trimmed to "*.mzML or *.mgf" in
  three places; -allowDenseCentroidedPeaks no longer mentions
  mzXML; the duplicated "mzML, mzXML, mzML" typo is fixed.
- docs/examples/MSGFPlus_Params.txt: SpectrumFile comment trimmed
  to *.mzML / *.mgf.
- docs/readme.md: Input/Output summary now matches the fork's
  actual format support; obsolete ms-gfdb.md link removed (the
  doc and entry point were deleted in 5a2ec4e).
- docs/troubleshooting.md: FASTA-split workaround now describes
  concatenating .pin / .tsv outputs (mzIdentML and MzidMerger no
  longer apply). The OpenMS TOPPAS workaround now feeds the .pin
  via PercolatorAdapter instead of importing a non-existent .mzid.
- src/test/resources/MSGFDB_Param.txt: removed showDecoy=1 and
  uniformAAProb=auto (both ParamNameEnum entries were dropped in
  5a2ec4e and now produce "unrecognized parameter" warnings on
  every test run). Normalized ParentMassTolerance ->
  PrecursorMassTolerance and IsotopeError -> IsotopeErrorRange to
  use the canonical keys; the canonicalConfigKey aliases keep the
  old names working but the test fixture should be self-documenting.

No code changes; the existing scoped test sweep continues to pass.
The user code-reviewed the unpushed branch tip and surfaced three
behaviour regressions that all date back to the Phase 4c
ParamManager retire (commit 03f32c1) plus a few small cleanups.

P1 -- config-file keys are now matched case-insensitively, restoring
the legacy ParamManager.parseConfigParamFile semantics. The Phase 4c
switch was exact-case so test fixtures using lowercase first-letter
keys (e.g. "minCharge=2", "maxCharge=5" in MSGFDB_Param.txt) were
silently dropped to defaults instead of overriding them. The fix
lowercases canonicalConfigKey output and makes every applyConfigEntry
case label lowercase. New regression test
configFileKeysAreMatchedCaseInsensitively pins the contract with a
mix of canonical, lowercased-first-letter, and ALLCAPS forms.

P2 -- invalid enum-like CLI indices (-m 99, -inst 99, -e 99,
-protocol 99) and out-of-range numerics now produce a clean
user-facing error from SearchParams.parse instead of an
IllegalArgumentException stack trace from the resolver. validate()
is now invoked in place of validateRequired() and runs the bounds
checks up-front.

P2 -- restored the legacy IntParameter.minValue/maxValue range
checks for: -thread, -tasks, -minSpectraPerThread, -minLength,
-maxLength, -minCharge, -maxCharge, -n, -ntt, -tda, -verbose,
-addFeatures, -allowDenseCentroidedPeaks, -edgeScore,
-ignoreMetCleavage, -maxMissedCleavages, -numMods, -ccm, -u, -m,
-inst, -e, -protocol, plus the hidden flags. New unit test
validateRejectsOutOfRangeFlags pins a representative set.

Polish:
- Drop the dead Phase 1 / MSGFPlusOptionsAdapter / ParamManager
  rollout narrative from the MSGFPlusOptions class header (both
  the adapter class and ParamManager were deleted in earlier
  commits; the comment was stale).
- Collapse the unrecognizedConfigEntries field + local probe
  counter in applyConfigFile into a single counter reset at start
  and read at end -- one piece of state, simpler control flow.
- Strip the CRLF (\r before \n) on docs/examples/MSGFPlus_Params.txt
  and src/test/resources/MSGFDB_Param.txt, which git diff --check
  was flagging as trailing whitespace. The rest of the codebase is
  LF-only.

Verified:
- mvn -B -o test on the scoped sweep (77 tests, 0 failures, 0 errors,
  3 skipped).
- SearchParamsTest no longer warns "unrecognized parameter
  'minCharge=2'" / "'maxCharge=5'" when loading MSGFDB_Param.txt.
- git diff --check on this commit is clean.
Drop the lingering String + numeric backcompat for the two enum-shaped
flags whose values are real names rather than IDs (per user direction:
'unless are options like 1 2 3 we should do only string'). After this
commit:

- -outputFormat accepts only `pin` / `tsv` (case-insensitive). The
  legacy numeric forms `0` / `1` are no longer recognised; users on
  those invocations should switch to the named values. This is a
  deliberate breaking change called out in the parameter-modernization
  cleanup -- consistency over backcompat in this corner.
- -precursorCal continues to accept only `auto` / `on` / `off`
  (case-insensitive), but now via picocli's typed enum matcher rather
  than a String + fromString fallback. Invalid values fail fast at
  parse time instead of silently mapping to AUTO.

Numeric-ID flags (-m, -inst, -e, -protocol) and 0/1 boolean-style
flags (-tda, -verbose, -addFeatures, -allowDenseCentroidedPeaks,
-edgeScore, -ignoreMetCleavage, -u, -ntt) keep their integer types --
those values are IDs, not names.

Implementation:

- New cli.OutputFormat enum (PIN, TSV).
- MSGFPlusOptions.outputFormat: String -> OutputFormat.
- MSGFPlusOptions.precursorCalMode: String -> SearchParams.PrecursorCalMode.
- effectiveOutputFormat() now returns OutputFormat (was int 0/1).
- effectivePrecursorCalRaw() collapsed into effectivePrecursorCal()
  returning the typed enum.
- applyConfigEntry parses both flags via Enum.valueOf so config-file
  values like `OutputFormat=pin` and `PrecursorCal=auto` flow through
  the same case-insensitive contract as the CLI.
- SearchParams.outputFormat field: int -> OutputFormat. writePin() /
  writeTsv() helpers retained (callers in MSGFPlus.runMSGFPlus).
- SearchParams.PrecursorCalMode.fromString() deleted -- no callers
  after the resolver returns the typed enum directly.
- New static factory MSGFPlusOptions.commandLine(opts) returns a
  CommandLine with caseInsensitiveEnumValuesAllowed(true). All
  call sites (MSGFPlus.main + 5 test files) routed through it so
  enum case-insensitivity is uniform.
- docs/output.md updated to show `-outputFormat pin` / `tsv` and
  notes the numeric forms are no longer accepted.

Tests: TestDirectPinWriter.outputFormatAcceptsOnlyPinAndTsv pins the
new contract (numeric/legacy values rejected, named values accepted
case-insensitively). TestPrecursorCalScaffolding migrated to enum
constants and to a picocli rejection check for invalid values. The
old fromString-fallback test is replaced by the rejection test.

Scoped sweep: 78 tests, 0 failures, 0 errors, 4 skipped.
Expand the vNEXT entry to cover the full PR #25 modernization stack:

- CLI parser modernisation: picocli-driven MSGFPlusOptions; -conf
  flow with case-insensitive keys + 13 legacy aliases preserved;
  numeric/enum range validation surface restored.
- Breaking changes: -outputFormat numeric forms (0/1) removed in
  favour of named pin/tsv (typed enum); -precursorCal switched to
  typed enum (rejects unknown values instead of silently mapping
  to AUTO); spectrum input narrowed to mzML + mgf only (mzXML, MS2,
  PKL, _dta.txt parsers deleted); deprecated MSGFDB entry point and
  its dead MSGF/MSGFLib siblings removed.
- Internal refactor: edu.ucsd.msjava.params package deleted (~2,100
  LOC across 18 classes); package reorg (ui/ -> cli/, mzid/ -> output/,
  parser/ -> mgf/, net.pempek.unicode -> mgf); new typed value
  classes (MSGFPlusOptions, PrecursorTolerance, IntRange, OutputFormat);
  picocli 4.7.6 dep added.
- Bench gate: prior Astral 3-arm run confirmed bit-identical PSM
  target/decoy counts (89,479 / 46,792) between baseline and new
  branch in -precursorCal off mode. Table embedded in the entry.
- Earlier in cycle: precursor calibration + Percolator pin output
  bullets retained.

No code change; pure docs.
Combined sweep of cuts surfaced by the post-modernization LOC audit.
Net change: 13 files, +57 / -2,131 = -2,074 LOC.

### Deleted dead classes (-2,051 LOC, 6 files)

Verified zero callers across src/main and src/test before removal:

- msscorer/ScoringParameterGenerator (732 LOC) and
  ScoringParameterGeneratorWithErrors (880 LOC) -- standalone main()
  scoring-param tools from the pre-modernization era. Pre-built
  scoring model .tsv files are committed in resources; these
  generators are not invoked at search time and have no remaining
  consumers.
- output/Unimod (85 LOC) and output/UnimodComposition (133 LOC) --
  residue from the deleted mzIdentML write side. Atom.java's "Unimod
  mod bricks" comment is the only remaining reference and refers to
  upstream data, not the deleted classes.
- msgf/ToolLauncher (154 LOC) -- abstract launcher with no concrete
  implementations.
- msutil/ScoredString (67 LOC) -- duplicate of fdr/ScoredString with
  no live callers (all usages resolve to the fdr.* version).

### Inlined effective*() resolvers + helper collapse (-23 LOC net)

- ~20 of the trivial `field != null ? field : default` resolvers in
  MSGFPlusOptions are inlined at the SearchParams.parse call sites.
  The non-trivial registry-resolving ones (effectiveActivationMethod,
  effectiveInstrumentType, effectiveEnzyme, effectiveProtocol) and a
  handful of frequent ones (effectiveOutputFormat, effectiveMin/Max
  PeptideLength etc.) stay since their length pays for itself.
- ActivationMethodAvailability nested class collapsed -- it only
  hid one InstrumentType.getAllRegisteredInstrumentTypes().length call
  that is now inline in validate().
- SearchParams.getOutputFormat() and SearchParams.writePin() removed
  -- writePin() had two callers in MSGFPlus.java which now use
  !writeTsv() instead; getOutputFormat() had zero callers.

### stripComment dedup (-7 LOC)

The two implementations of "split-on-#-and-trim" (SearchParams.
getConfigLineWithoutComment and MSGFPlusOptions.stripComment)
collapsed: stripComment is the canonical version (package-public),
SearchParams.getConfigLineWithoutComment delegates to it, and
AminoAcidSet.parseConfigEntry calls stripComment directly.

### Validation surface expanded

MSGFPlusOptions.validate() now also rejects a -mod / ModificationFile=
path that does not exist, returning a user-facing error string. New
regression test (validateRejectsMissingModificationFile) pins both
the CLI path and the config-file path.

Verified: scoped sweep (TestDirectPinWriter, TestMSUtils, TestSA,
TestMisc, TestRunManifestWriter, SearchParamsTest, TestPercolator,
TestMinSpectraPerThread, TestPrecursorCalScaffolding,
TestCandidatePeptideGrid + ConsideringMetCleavage,
MSGFPlusOptionsConfigFileTest, MSGFPlusOptionsActivationMethodTest):
78 tests, 0 failures, 0 errors, 3 skipped.
…quences

Net change: 21 files, +24 / -409 = -385 LOC. Combines the audit's
medium-value pass with a parallel sweep across the codebase for
methods and fields with no live callers.

### Dependencies dropped from pom.xml

- commons-text: zero direct usages.
- commons-io: NullOutputStream replaced by Java 11's
  OutputStream.nullOutputStream() in ConcurrentMSGFPlus;
  FilenameUtils.removeExtension replaced by an inline lastIndexOf('.')
  + substring in BuildSA.

Removing commons-text also drops its transitive commons-lang3, which
was used by mgf/BufferedRandomAccessLineReader for Pair<String,
Integer>. Replaced with a small local record BomStripResult
(text, bomLength) — single-purpose, no API consumers, simpler at
the use sites (result.text() / result.bomLength() vs result.getKey()
/ result.getValue()).

### Dead code removed across packages

Verified zero remaining callers for each:

- fdr/TargetDecoyAnalysis: ~38 lines of commented-out legacy
  constructors that referenced a non-existent TargetDecoyPSMSet
  class. Pure comment cleanup.
- fdr/TSVPSMSet: 14 LOC of dead methods.
- mgf/MgfSpectrumParser: 35 LOC of dead methods.
- msgf/AAFrequencyCounter: 44 LOC of dead helpers.
- msgf/MassListComparator: 15 LOC.
- msscorer/NewRankScorer + NewScorerFactory: 31 LOC of dead
  helpers (kept all live scoring-path code per CLAUDE.md invariants).
- msutil/AminoAcidSet, Composition, CompositionFactory, IonType,
  Peptide: ~110 LOC of dead helpers and fields.
- sequences/FastaSequence + FastaSequences + ProteinFastaSequence
  + ProteinFastaSequences: ~76 LOC of dead methods.

### Other cleanup

- SearchParams.toString: drop 6 lines of commented-out spectrum-list
  output and switch StringBuffer (synchronized, single-threaded
  caller) to StringBuilder.

Verified: scoped sweep (TestDirectPinWriter, TestMSUtils, TestSA,
TestMisc, TestRunManifestWriter, SearchParamsTest, TestPercolator,
TestMinSpectraPerThread, TestPrecursorCalScaffolding,
TestCandidatePeptideGrid + ConsideringMetCleavage,
MSGFPlusOptionsConfigFileTest, MSGFPlusOptionsActivationMethodTest):
78 tests, 0 failures, 0 errors, 3 skipped.
Four test files (TestDirectPinWriter, TestPrecursorCalScaffolding,
TestPrecursorCalIntegration, TestRunManifestWriter) had nearly
identical buildOpts()/parsedSearchParams() helpers loading the
standard MSGFDB_Param.txt + test.mgf + human-uniprot-contaminants.fasta
fixture set. Collapse into a single shared
SearchTestFixtures.standardOpts() helper in src/test/java/edu/ucsd/
msjava/cli; the two tests that need extra fields (output file,
maxMissedCleavages) call standardOpts() and tweak the result.

Net change: -25 LOC, plus tests now stale-import-free.

Verified: scoped sweep (42 tests across the 4 affected files +
MSGFPlusOptionsConfigFileTest + MSGFPlusOptionsActivationMethodTest
+ SearchParamsTest): 0 failures, 0 errors.
Scrub 31 Java source files of five comment categories:
  - single-line javadoc restating the method/field name
  - block javadoc containing only @param/@return lines
  - /* (non-Javadoc) */ markers
  - section headers (GETTERS / SETTERS / CONSTRUCTORS / etc.)
  - commented-out code blocks (dead debug printlns, old algorithms,
    inactive fields, alternative scorer variants)

Non-obvious WHY comments, license headers, performance invariants,
CLI contract javadoc, and Picocli @option descriptions are preserved
unchanged.

LOC delta: -2262 lines, +42 lines across 31 files.

Scoped test sweep: 78 tests run, 0 failures, 0 errors, 3 skipped
(TestMSUtils×1, TestSA×2 — pre-existing skips unrelated to this change).
Continues the comment cleanup pass with a focused removal of
disabled-but-kept Java code masquerading as comments. Net change:
32 files, +1 / -323 = -322 LOC.

What was removed:

- fdr/TargetDecoyAnalysis: dead 10-line if/else block at the tail of
  getFDRMap() that re-set fdrMap edges no longer needed after the
  q-value conversion.
- fdr/ComputeFDR: 2-line commented-out specFileCol guard.
- msdbsearch/{CandidatePeptideGrid,CandidatePeptideGridConsideringMetCleavage,
  CompactFastaSequence, CompactSuffixArray, DBScanner, MassErrorStat,
  PSMFeatureFinder} and msgf/{ProfileGF, FlexAminoAcidGraph,
  ScoredSpectrumSum}: scattered // foo() / /* alt impl */ blocks
  documenting old algorithm variants and disabled debug printlns.
- msutil/{AminoAcid, Composition, Constants, IonType, VolatileAminoAcid}:
  ~90 LOC of commented-out static initializers, alternate composition
  tables, and debug-only branches.
- sequences/{FastaSequence, FastaSequences, ProteinFastaSequences}
  and suffixarray/SuffixArray: ~37 LOC of dead disabled stubs.

Genuine prose comments (TODOs, performance invariants, why-explanations,
URLs to upstream issues, the `applyShift` bit-identical-path guard)
are preserved.

Verified: scoped sweep (TestDirectPinWriter, TestMSUtils, TestSA,
TestMisc, TestRunManifestWriter, SearchParamsTest, TestPercolator,
TestMinSpectraPerThread, TestPrecursorCalScaffolding,
TestCandidatePeptideGrid + ConsideringMetCleavage,
MSGFPlusOptionsConfigFileTest, MSGFPlusOptionsActivationMethodTest):
78 tests, 0 failures, 0 errors, 3 skipped.
Two follow-ups from the PR #25 review thread:

1. BOM strip in BufferedLineReader (Copilot finding).
   The constructor wrapped the FileInputStream in
   UnicodeBOMInputStream but never called skipBOM() -- BOM bytes
   leaked into line 1, breaking config / mod / FASTA files saved by
   editors that prepend a UTF-8 BOM (Windows Notepad, some VS Code
   configurations, etc.). Fix: chain .skipBOM() in the super(...)
   call. The BOM-stripping path is per-instance; the wrapper still
   detects the BOM in its constructor, we just now consume it.

   New test BufferedLineReaderTest pins the contract: a file with a
   UTF-8 BOM followed by `ParentMassTolerance=20ppm` returns the
   bare key=value as line 1 (not `ParentMassTolerance=20ppm`),
   and a no-BOM file is unchanged.

2. Delete unused MSGFResult.java.
   Verified zero callers in src/main and src/test before removal.
   The reviewer flagged this as a small follow-up to land alongside
   the modernization sweep.

Verified: scoped sweep (BufferedLineReaderTest,
MSGFPlusOptionsConfigFileTest, MSGFPlusOptionsActivationMethodTest,
SearchParamsTest, TestPrecursorCalScaffolding, TestRunManifestWriter,
TestDirectPinWriter, TestMSUtils, TestMisc, TestMinSpectraPerThread):
55 tests, 0 failures, 0 errors, 2 skipped.
Convert eight value-shaped types to Java records, per the audit:
small immutable carriers with no inheritance constraints, where the
record's auto-generated equals/hashCode/toString and accessor methods
replace hand-rolled boilerplate. No behavioral change; net trim of
~80 LOC plus a clearer surface.

Converted:

- cli.IntRange -- record IntRange(int min, int max). Compact
  constructor keeps the min<=max validation. parse() and
  Converter (picocli ITypeConverter) preserved.
- cli.PrecursorTolerance -- record PrecursorTolerance(Tolerance left,
  Tolerance right). Compact constructor enforces the matching-unit
  and non-negative invariants in-place; parse() / Converter retained.
- msutil.CvParamInfo -- record CvParamInfo(String accession, String
  name, String value, String unitAccession, String unitName). The
  stored hasUnit field is gone; hasUnit() is now derived from
  unitAccession != null. Compatibility getters (getAccession(),
  getValue(), getUnitAccession(), getUnitName(), getHasUnit()) keep
  existing call sites untouched.
- msutil.Annotation -- record Annotation(AminoAcid prevAA, Peptide
  peptide, AminoAcid nextAA). Setters (which had no live callers)
  removed; the parsing constructor delegates via this(...).
  isProteinNTerm/isProteinCTerm and the dotted toString are kept.
- msgf.ProfilePeak<T extends Matter> -- record. Setters were unused;
  Comparable<ProfilePeak<T>> is preserved.
- msutil.Atom -- record Atom(String code, double mass, int
  nominalMass, String name). Static atomArr table + atomMap +
  static initializer block all kept verbatim. getCode() / getName() /
  getMass() / getNominalMass() retained as compatibility wrappers.
- msdbsearch.CompactSuffixArray.RangeMetadata -- record (4 file/int
  fields). Three call sites updated from md.numEntries to
  md.numEntries() etc.
- msgf.MassListComparator.MatchedPair<T extends Matter> -- record.
  getMass1/getMass2 retained as compatibility wrappers.

Call-site updates in SearchParams.parse: tol.left/.right, isotope.min/.max,
specIdx.min/.max, ms.min/.max all switched to record accessor methods
(.left()/.right()/.min()/.max()).

Verified: scoped sweep (incl. BufferedLineReaderTest,
MSGFPlusOptionsConfigFileTest, MSGFPlusOptionsActivationMethodTest,
SearchParamsTest, TestPrecursorCalScaffolding, TestRunManifestWriter,
TestDirectPinWriter, TestMinSpectraPerThread, TestMSUtils, TestSA,
TestMisc, TestCandidatePeptideGrid + ConsideringMetCleavage):
80 tests, 0 failures, 0 errors, 3 skipped.
perf(search): instrumentation + dead-sync removal + per-task buffers + ForkJoinPool toggle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants