Performance Engineering
A True Performance Engineering team
Incredibly-skilled engineering performance team
- Most teams in industry are really Performance QA (aka Run, Report, Repeat)
Every code we’ve ever touched we’ve made faster - No matter how brilliant the original coders<
- Many examples: Amazon Cloud, GPUs, ML Training &Inference, Oracle DB, Adv Analytics, Oracle Apps, Graph Analytics, Storage, OS, HW, …
We are passionate and never satisfied with performance
- Don’t accept other preconceptions
- Work cross stack, at scale, and at getting near-linear scaling
- Work at all levels of SW/HW, dive deep to find real performance issues
- Innovative at using things for unintended purposes
Decades of being heavily involved with HW/SW Codesign
Performance Engineering team History
A truly unique blend of the right experiences which uniquely positions us for AI. (overview graphic and details)
Team accelerated Amazon Services (2017 - Present)
- Team Accelerated greater than 60 AWS Services - ML, Analytics, Database, Streaming, Graviton, Trainium, Inferentia, EC2, Networking, Containers, Lambda, and the newest services.
Team accelerated Apps/Analytics and Codesigned SW-in-Silicon with Oracle acquisition (2010)
- Focused on Commercial Apps, DB IM columnar, ETL/DSL/Java Streams, Oracle Advanced Analytics with Featurization of complex DB data, ML Analytics, Spark, Graph , NoSQL, Java/JVM, HW accelerators HW/SW Co-design, Cloud (SaaS, PaaS, IaaS) All along engineering better performance across the stack, prelim Tensorflow
Team merged App/DB/HPC tuning with Sun acquisition (1996)
- Sun thought big servers were boutique, but made greater than $50B for Sun on high-end servers. Over 540 world records and innovations throughout the stack for high-end near-linear parallelism and scalability on APPs (ERP, SCM, HR, CRM) and DBs, including 1st Columnar DBs, and HPC
Team codesigned 1st high-end general servers, then learned Commercial Apps/DB with Cray/SGI acquisition (1991)
- Patent with 84-way i860 SMP (FPS Matrix Co-Processor), then SPARC vector server which led to 1st effective 64-way SPARC SMP server. Became Oracle DB expert, then created 1st scalable multi-TB DB and Data Warehousing + OLTP optimizations established Cray as fastest commercial database and app servers
Team started at Floating Point Systems (attached HPC accelerator of mid 80’s) and stuck together through 4 acquisitions
- HPC apps, Math Library optimization Microcode VLIW SW pipelining, (Eigen, Sparse solvers, multi-radix FFTs, Intrinsics which led to Auto Diff innovations for Non-Linear solvers, MPP HW/SW codesign (Hypercube, Torus), compute intensity optimization for attached GPU
On Prem Acceleration

Below is a list of the wide variety of our unique performance engineering experiences that inform our AI performance Tuning!
Architecture
- Enterprise Architecture
- Processor Architecture
- HW Accelerator Architecture
- Server Architecture
- System Architecture
Analytics Apps (Full-stack optimization)
- Spark ML (training/scoring asymmetry), BLAS3 opt
- Oracle Adv Analytics (training/scoring asymmetry)
- Oracle PGX Graph
- TensorFlow
- SAS, SPSS, FPSMath(made public ‘89)
- Homegrown statistics packages
- Oracle Spatial
- TensorFlow, MXNet, Gluon, Numpy, R, Python, matplotlib, ...
Transactional Apps (Full-stack optimization)
- Fusion Apps(Java)
- SOA
- Oracle E-Business, SAP, Peoplesoft, Siebel, JD Edwards, Fusion Apps, Manugistics, Baan…
Data Management
- ETL for ML, BigData SQL, SAS ETL, Informatica ETL,
- Spark SQL
- Columnar In-memory: Oracle, SybaseIQ, Expressway,
- Oracle NoSQL, Cassandra NoSQL, key-value
- Oracle DB, MySQL, DB2, ...
- Data Warehousing, Datamarts, In-memory Aggregation
- Kafka Streaming
Java/JVM/GC
- Java Streams (HW DAX)
- REST (Jersey, Grizzly)
- Intrinsics - inline assembly accelerators
Cross-stack examples
- Moving DB functions to Disk controllers
- Hybrid Columnar Compression
MPP/Cloud
- Oracle Cloud, MPP, 3D torus, Vector Hypercube, Dataflow machine, …
- Storage, Network, Compute optimization
- Matrix co-processor
- Attached Processors (GPU)
Parallel Performance
- Near-linear Scaling (MPP, NUMA, SMP), major restructuring algorithms for parallel
- Modeling/estimation
- Instrumentation (w/ myriad of tools)
- Analysis(Shortfall), Rectification
- OS: sched, thread tuning, lock splitting
CPU
- VLIW SW pipelining
- RISC/CISC optimization: RAW, etc
- Vectorization
- SPARC, x86, i860, Cray YMP, FPS-VLIW, FPS XP-32,Transputer, systolic arrays,…
- In-memory accelerations (DAX,…)
HPC
- Financial Derivatives
- Signal processing, Beam-forming ,…
- Structural Analysis
- Computation Chemistry
- Physics (CFD, MHD, QCD, QED,..)
- EDA
- Seismic Oil/Gas
- Gov
- Ad Hoc Customer
- MPI
- OpenMP
- OpenCL
- InfiniBand, Ethernet Clusters
Math Library
- Solvers, Eigen, mixed-radix FFT, Derivative, Seismic, conv/deconv, linear prog, conjugate gradient, Strassen matmul, Winograd, compression, simulation, signal processing…
- Out-of-Core equation solvers
- BLAS3,2,1
- Intrinsics (various precision)
- Automatic Differentiation & nonlinear solvers
- High-accuracy long accumulator solvers
- Interval Math
Memory
- Compute intensity optimization
- Data vectorization
- BW, bisection BW
Network
- REST (Jersey, Grizzly)
- small-packet optimization
- Large-packet optimization
- Structured Asynch Pipelined for MPPs
- Interrupt tuning/scalability
- IB, various network techm
- Storage optimization, Filesystem, QFS,…
Security/Crypto
- Security Kernels
- Secure Network
- Secure Filesystem
- Oracle TDE
- Oracle Data Redaction
- SSM (Silicon Secured Memory)
Virtualization
- LDoms, Zones,
- Optimized Virtualized Storage, Network, & CPU
This journey gave us a truly rare blend of experiences that are critical for the next steps in AI performance.