-
Notifications
You must be signed in to change notification settings - Fork 46
Spatz Platform #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Bumblebee00
wants to merge
105
commits into
pulp-platform:devel
Choose a base branch
from
Bumblebee00:spatz-integration
base: devel
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Spatz Platform #168
Changes from all commits
Commits
Show all changes
105 commits
Select commit
Hold shift + click to select a range
86cfce6
add placeholdeer code for spatz platform
Bumblebee00 0774556
code generation with generic c code
Bumblebee00 a8fd323
modified spatz c code to use proper memory allocation and copying fun…
Bumblebee00 bbf2751
tmp commit to send to badie103
Bumblebee00 b1a5868
modified Makefile and cmakefiles to build and use spatz runtime. ugly…
0acf6e7
vsim simulator runnable by deeployRunner
8e4e8c3
Removed reverence to conda environment and added commands to create v…
9cab786
typo
e3c46c4
double gvsoc temporaney configuration
651d4cf
reunited gvsoc build for spatz and other platforms
908261d
forgot comment
edc461f
forgot comment
cd13ce4
added topk generic binding (hardcoded k=10)
f72e147
added matmul softmax and topk generic bindings to spatz
93db81b
switched default simulator to gvsoc bc is faster
eee910c
added topk test network
1523825
added sparse attention test network
8b99e3a
added topk binding to generic platform
b541e17
improved generic gather node to support more than one index
e35433d
added gather binding for spatz
7eb2e35
now for any k inot the graph
8f99e67
added big attention
bfcdda6
first draft of tiling (not working)
bdcdd70
minimalloc fix
64e3874
fixed makefile adding missing things
23017e8
added yaml for my conda environment
8b46fd5
modified bindings to use snrt_dma_wait_all function
3187d64
added different dimensions of FP32/MatMul
1624642
fixed simulation staling issue
08ddf29
fixed memory levels
27f9558
added cycles indication
e77936d
added fp32 matmul kernel that uses vector instructions
0798c50
went back to not using snrt_l3alloc because its not working
8e552d7
[format] alignment fix
6a412c0
[tiling] enable tiling extension on Spatz
81dbb89
[template] Add proper allocation template and fix DMA template
eee94d5
[sw] use memcpy instead of DMA for DRAM buffer init
6e3a0f0
changed commit hash to include new version of spatz that has snrt_l3a…
8f073d1
removed redundant memcpy
ae0ef79
added gather only test
020df36
improved gather template
ea9e073
added nice dimensions of Matmul
693d1cb
improved main
3112c0f
spatz matmulfunction now splits work between cores
f209ae4
fixed name of input in graph
a8f547b
removed unnecessary print
b083883
gather tiling
213e832
topk tiling
3aca310
updated tiling to work with constant buffers
fae9d65
updated tiling to work with nodes with >1 output
9c4f9e7
fixed topk template
3e225aa
added softmax tiled (not working)
0ad2cfc
added another softmax test
5085400
added softmax function with custom exp and inv functions
aad67f8
detect quiet nan float output
97b9762
use non vector for when one dim is one
d73f9d8
fixed matmul kernel
5b6b1ca
divide on p to have double performance
7c2d7a9
added benchmarking code transformation pass
3e85ef2
include necessary for benchmark code
8942e51
improved code transformation passes for allocation of all necessary b…
7b42738
split gather workload on two cores
a1cfdd1
cleaner matmul template
3360028
initial version of dynamic dma code
61bc767
fixed sizes of gather template
da83d8b
fixed sizes of gather template
b51d1c7
fixed gather tile contraint (removed contraint on tile fo data_in)
84c1a9a
removed unnecessary prints
124b5b3
modified matmul variable to be bigger integers
5f723c4
removed very long memcopy
a1a83bb
added hw barrier
b61b850
added new tests
d57d8b6
added another main.c for spatz
be21c13
improved hack to avoid memcpy
d2f342e
new faster expf function, and softmax smartly devided between cores
82cc8a6
minor fixes
aeeef40
added new template for topk that uses min heap
4c72054
smal fix
7b6f9b2
added topk function
a297ec9
added new test networks
cb6b34a
work on matmul kernel
e817f2f
[fix] enforce 8-byte alignment for memory allocation
40661ea
[fix] 8-byte aligned tiling constraints
5238f32
[cleanup] removed deadcode, cleaned allocation template and bindings
b9c5cef
[cleanup] revert debug print
9fdf584
[cleanup] only keep necessary changes in tiling extension
d98e6c1
[cleanup] fix lint issues (partially)
f1a556b
[cleanup] revert docstring
17b3f32
cleaned topk template
e229dad
added matmul columns reduction function
9365193
better condition for column matmul
afab3a0
how to do double buffering on spatz
1ebcf9b
investigated double buffering
a103702
Merge branch 'spatz-integration' of github.com:Bumblebee00/Deeploy in…
83a67a5
removed unnecessary tests
cbdf855
modified tests
b941002
removed unised comments
c414332
necessary spatz makefile modification
c25a855
unecessary comment
7b0f848
missing slash in makefile
348eda3
removed old gather templates
855065e
removed comments
e773691
removed unnecessary matmuls
82f7071
rmoved unnecessary softmax
ba320c6
renamed .yml
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -10,8 +10,18 @@ | |
| width = int(data_in_type.referencedType.typeWidth/8) | ||
| %> | ||
| BEGIN_SINGLE_CORE | ||
| % if num_indices == 1: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we have this case here? what is the meaning of
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is to not generate a loop with one iteration: is cleaner c code |
||
| for (uint32_t i=0; i<${batch}; ++i) { | ||
| memcpy(${data_out} + i * ${axis_length}, ${data_in} + i * ${batch_length} + ${index} * ${axis_length}, ${axis_length} * ${width}); | ||
| } | ||
| % else: | ||
| for (uint32_t i=0; i<${batch}; ++i) { | ||
| for (uint32_t j=0; j<${num_indices}; ++j) { | ||
| memcpy(${data_out} + i * (${num_indices} * ${axis_length}) + j * ${axis_length}, | ||
| ${data_in} + i * ${batch_length} + ${indices}[j] * ${axis_length}, | ||
| ${axis_length} * ${width}); | ||
| } | ||
| } | ||
| % endif | ||
| END_SINGLE_CORE | ||
| """) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| from typing import Dict, List, Tuple | ||
|
|
||
| from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation | ||
|
|
||
|
|
||
| referenceTemplate = NodeTemplate(""" | ||
| // TopK (Name: ${nodeName}, Op: ${nodeOp}) | ||
| BEGIN_SINGLE_CORE | ||
| // Find the top ${k_value} values and their indices | ||
| // Assumes 1D input for simplicity | ||
| typedef struct { | ||
| ${data_in_type.referencedType.typeName} value; | ||
| uint32_t index; | ||
| } topk_pair_t; | ||
|
|
||
| topk_pair_t pairs[${data_in_size}]; | ||
| for (uint32_t i = 0; i < ${data_in_size}; ++i) { | ||
| pairs[i].value = ((${data_in_type.referencedType.typeName}*)${data_in})[i]; | ||
| pairs[i].index = i; | ||
| } | ||
| // Simple selection sort for top-k | ||
| for (uint32_t i = 0; i < ${k_value}; ++i) { | ||
| uint32_t max_idx = i; | ||
| for (uint32_t j = i + 1; j < ${data_in_size}; ++j) { | ||
| if (pairs[j].value > pairs[max_idx].value) { | ||
| max_idx = j; | ||
| } | ||
| } | ||
| // Swap | ||
| if (max_idx != i) { | ||
| topk_pair_t tmp = pairs[i]; | ||
| pairs[i] = pairs[max_idx]; | ||
| pairs[max_idx] = tmp; | ||
| } | ||
| // Write output | ||
| ((${values_out_type.referencedType.typeName}*)${values_out})[i] = pairs[i].value; | ||
| ((${indices_out_type.referencedType.typeName}*)${indices_out})[i] = pairs[i].index; | ||
| } | ||
| END_SINGLE_CORE | ||
| """) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| from functools import partial | ||
|
|
||
| from Deeploy.DeeployTypes import CodeTransformation, NodeBinding | ||
| from Deeploy.CommonExtensions.CodeTransformationPasses.MemoryAllocation import ArgumentStructGeneration, \ | ||
| MemoryManagementGeneration | ||
| from Deeploy.Targets.Spatz.CodeTransformationPasses.Benchmarking import SpatzBenchmarkInnerPass, SpatzBenchmarkOuterPass | ||
|
|
||
| from Deeploy.FutureExtension.CodeTransformationPasses.FutureCodeTransformation import FutureGeneration | ||
| from Deeploy.AbstractDataTypes import PointerClass | ||
| from Deeploy.CommonExtensions.DataTypes import IntegerDataTypes, SignedIntegerDataTypes, float32_t, int8_t, int32_t | ||
| from Deeploy.Targets.Generic.TypeCheckers import GatherChecker, MatMulChecker, TopKChecker, SoftmaxChecker | ||
|
|
||
| from Deeploy.CommonExtensions.CodeTransformationPasses.Closure import ClosureGeneration, MemoryAwareClosureGeneration | ||
| from Deeploy.Targets.Snitch.CodeTransformationPasses.SnitchClusterTiling import SnitchClusterTiling | ||
| from Deeploy.Targets.Snitch.CodeTransformationPasses.SnitchClusterSynch import SnitchSynchCoresPass | ||
| from Deeploy.Targets.Spatz.DMA.SpatzDma import SpatzDma | ||
| from Deeploy.Targets.Spatz.Templates import GatherTemplate, MatMulTemplate as SpatzMatMulTemplate, TopKTemplate, SoftmaxTemplate | ||
| from Deeploy.Targets.Generic.Templates import MatMulTemplate, FloatMatMulTemplate | ||
| from Deeploy.TilingExtension.CodeTransformationPasses.TilingVariableReplacement import TilingVariableReplacement, \ | ||
| TilingVariableReplacementUpdate | ||
|
|
||
| TilingCallClosure = partial(ClosureGeneration, closureSuffix = "_tiling_closure") | ||
| MemoryAwareFunctionCallClosure = partial(MemoryAwareClosureGeneration, | ||
| closureSuffix = "_closure", | ||
| startRegion = "L3", | ||
| endRegion = "L1") | ||
|
|
||
| BasicTransformer = CodeTransformation( | ||
| [ArgumentStructGeneration(), | ||
| MemoryManagementGeneration(), | ||
| FutureGeneration()]) | ||
|
|
||
| TiledTransformer = CodeTransformation([ | ||
| TilingVariableReplacement("L1"), | ||
| TilingCallClosure(writeback = False), | ||
| SnitchSynchCoresPass(), # snrt_cluster_hw_barrier() | ||
| # SpatzBenchmarkInnerPass(), # <- attention: increases runtime and benchmarks only when tiling loop has one iteration | ||
| TilingVariableReplacementUpdate("L1"), | ||
| SnitchClusterTiling("L3", "L1", SpatzDma()), | ||
| # SpatzBenchmarkOuterPass(), # <- attention: increases runtime and benchmarks only when tiling loop has one iteration | ||
| ArgumentStructGeneration(), | ||
| MemoryManagementGeneration("L1"), | ||
| MemoryAwareFunctionCallClosure(writeback = False, generateStruct = True), | ||
| MemoryManagementGeneration("L3"), | ||
| MemoryManagementGeneration(), | ||
| ]) | ||
|
|
||
| SpatzGatherBindings = [ | ||
| NodeBinding( | ||
| GatherChecker( | ||
| [PointerClass(float32_t), PointerClass(type)], | ||
| [PointerClass(float32_t)] | ||
| ), | ||
| GatherTemplate.dynamicDMAtemplate, | ||
| TiledTransformer | ||
| ) for type in IntegerDataTypes | ||
| ] | ||
|
|
||
| # with tiled transformer | ||
| SpatzMatMulBindings = [ | ||
| NodeBinding(MatMulChecker([PointerClass(int8_t), PointerClass(int8_t)], [PointerClass(int32_t)]), | ||
| SpatzMatMulTemplate.spatzSIMatMulTemplate, TiledTransformer), | ||
| NodeBinding( | ||
| MatMulChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]), | ||
| SpatzMatMulTemplate.spatzFloatMatMulTemplate, TiledTransformer) | ||
| ] | ||
|
|
||
| # without tiled transformer | ||
| ''' | ||
| SpatzMatMulBindings = [ | ||
| NodeBinding(MatMulChecker([PointerClass(int8_t), PointerClass(int8_t)], [PointerClass(int32_t)]), | ||
| SpatzMatMulTemplate.spatzSIMatMulTemplate, BasicTransformer), | ||
| NodeBinding( | ||
| MatMulChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]), | ||
| SpatzMatMulTemplate.spatzFloatMatMulTemplate, BasicTransformer) | ||
| ] | ||
| ''' | ||
|
|
||
| SpatzTopKBindings = [ | ||
| NodeBinding( | ||
| TopKChecker( | ||
| [PointerClass(float32_t), PointerClass(int32_t)], # inputs | ||
| [PointerClass(float32_t), PointerClass(int32_t)] # outputs | ||
| ), | ||
| TopKTemplate.minHeapTemplate, | ||
| TiledTransformer, | ||
| ) | ||
| ] | ||
|
|
||
|
|
||
| SpatzSoftmaxBindings = [ | ||
| NodeBinding( | ||
| SoftmaxChecker([PointerClass(float32_t)], [PointerClass(float32_t)]), | ||
| SoftmaxTemplate.floatTilingTemplate, | ||
| TiledTransformer | ||
| ) | ||
| ] |
23 changes: 23 additions & 0 deletions
23
Deeploy/Targets/Spatz/CodeTransformationPasses/Benchmarking.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| from Deeploy.DeeployTypes import CodeGenVerbosity, CodeTransformationPass, ExecutionBlock, NetworkContext, NodeTemplate, CodeSnippet, _NoVerbosity | ||
|
|
||
|
|
||
| class SpatzBenchmarkInnerPass(CodeTransformationPass): | ||
| def apply(self, ctxt: NetworkContext, executionBlock: ExecutionBlock, name: str, verbose: CodeGenVerbosity = _NoVerbosity): | ||
| if "include_benchmark" not in ctxt.globalObjects: | ||
| ctxt.hoistGlobalDefinition("include_benchmark", "#include <benchmark.h>\n") | ||
| if "include_printf" not in ctxt.globalObjects: | ||
| ctxt.hoistGlobalDefinition("include_printf", "#include \"printf.h\"\n") | ||
| tsop = NodeTemplate(""" tsop = benchmark_get_cycle();\n""") | ||
| teop = NodeTemplate(""" teop = benchmark_get_cycle();\n""") | ||
| executionBlock.codeSnippets.insert(1, CodeSnippet(tsop, {})) | ||
| executionBlock.codeSnippets.append(CodeSnippet(teop, {})) | ||
| return ctxt, executionBlock | ||
|
|
||
| class SpatzBenchmarkOuterPass(CodeTransformationPass): | ||
| def apply(self, ctxt: NetworkContext, executionBlock: ExecutionBlock, name: str, verbose: CodeGenVerbosity = _NoVerbosity): | ||
| t0 = NodeTemplate(""" uint32_t t0, tsop, teop, te;\n t0 = benchmark_get_cycle();\n""") | ||
| te = NodeTemplate(f"""te = benchmark_get_cycle();if (snrt_is_dm_core()) {{printf(\"Benchmark of {name}:\\n\");\nprintf(\"data_in=%d; op=%d; data_out=%d; total=%d\\n\\n\", tsop-t0, teop-tsop, te-teop, te-t0); }}\nsnrt_cluster_hw_barrier();""") | ||
|
|
||
| executionBlock.addLeft(t0, {}) | ||
| executionBlock.addRight(te, {}) | ||
| return ctxt, executionBlock |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win
Validate
kas a constant scalar and bound it to input size.Line 2916 assumes
k_in.values[0]exists and is valid. Ifkis non-constant, scalar-shaped differently, ork > data_in_size, the generated TopK loop indexes beyond the localpairsbuffer.Suggested fix
📝 Committable suggestion
🤖 Prompt for AI Agents