Senior Cloud Data Engineer with production experience at Mercedes-Benz South Africa, co-engineering the MO360 global manufacturing data platform — a Microsoft partnership unifying production data across 30+ plants worldwide. Currently directing engineering at Mathe Tech, building enterprise-grade intelligent automation suites.
My core work spans modern cloud data platforms and Azure-native ELT pipelines (Databricks, PySpark, Delta Lake), event-driven serverless architecture (Azure Functions, Storage Queues, Blob Storage), and production-grade AI/ML systems. I bridge the gap between complex backend data infrastructure and real business outcomes — from 3NF relational schema design and columnstore analytics through to CI/CD-hardened cloud deployments.
- 🏭 Ex-MBSA — MO360 Data Platform, Manufacturing Reporting Services
- 🚀 Founder / Lead AI Engineer — Architecting proprietary agentic suites at Mathe Tech
- ⚡ Migrated a legacy ETL pipeline → real-time ELT, cutting cloud costs ~60%
- 🤖 Building multi-agent LLM systems with CrewAI + Anthropic Claude
- 🏗️ Architected end-to-end cloud-native platforms: Urban Pulse Analytics, VaxTrace Cloud, SchemaForge Studio
- 📊 Honours postgraduate in Data Analytics/Science (NQF 8)
- 🌍 English · Zulu · Xhosa (all fluent)
Cloud, Infrastructure & DevOps
July 2025 – Present | Cape Town, South Africa
- System Architecture: Founded and executed the end-to-end technical lifecycle and commercial deployment strategy for a proprietary B2B Lead Generation, Full-Stack Automation, and Agentic AI Suite.
- Multi-Agent Orchestration: Designed an asynchronous multi-agent orchestration framework utilizing CrewAI patterns and Anthropic Claude 3.5 Sonnet to automate multi-source scraping (LinkedIn, Google Maps, Hunter.io) and synthesize context-aware B2B outreach models.
- Medallion Data Processing: Built a Bronze-Silver-Gold Lakehouse topology using PySpark and Delta Lake. Implemented a "Pandas Bridge" pattern for unstructured third-party data sanitization, ensuring idempotent updates via Delta Lake
MERGE INTOsemantics. - Governance & Observability: Enforced structural data integrity using custom Data Quality Gates alongside a UUID-based Audit Spine capturing sub-second execution logs and 4-hop lineage tracking for complete "Time-Travel" forensic auditing.
- Conversational Intelligence: Built automated omnichannel WhatsApp Business chatbot funnels powered by Flask webhooks and Twilio SDK architectures, mapping context-aware knowledge layers (YAML configuration targets) to live conversation threads.
- CI/CD & Analytics: Implemented a strict 30-test
pytestunit-testing runner automated via GitHub Actions, publishing runtime performance metrics and financial pipeline intelligence directly to a multi-page Streamlit telemetry dashboard.
Feb 2024 – May 2025 | Manufacturing Reporting Services (MRS) · MO360 Data Platform
- Co-engineered the MO360 global manufacturing data platform (strategic Microsoft partnership), unifying production data across 30+ plants worldwide to support a 20% production efficiency improvement target.
- Led ETL → ELT architectural migration: consolidated Azure Functions + ADF batch upserts into a single PySpark Structured Streaming job on Databricks — reducing pipeline cloud cost contribution from ~80% to 30–40% and achieving near real-time data freshness with exactly-once processing.
- Designed and integrated Fault-Tolerant Dead Letter Queue (DLQ) error isolation patterns within serverless ingestion runtimes to intercept and quarantine malformed transaction packets, eliminating pipeline downtime.
- Data Platform Governance & DevOps: Programmatically enforced an automated principle-of-least-privilege data access framework on ADLS Gen2 using the Azure DevOps CLI, combining macro-level Azure RBAC with fine-grained Access Control Lists (ACLs).
- Implemented comprehensive telemetry via Azure Monitor, Application Insights, and Log Analytics; developed optimised KQL queries for real-time event monitoring via Azure Event Hubs and ADX clusters.
- Operated in a fast-paced Agile/ITIL environment (JIRA + ServiceNow), maintaining 99.9% SLA for critical production reporting.
- Served as primary technical liaison between Data Scientists and Business Users across simultaneous workstreams.
Jan 2023 – Dec 2023 | Work Integrated Learning · Cape Town
- Co-engineered a production-grade data system and web platform utilizing C# and ASP.NET Web Forms to streamline donor tracking, resource allocation, and aid distribution logistics.
- Designed, normalized, and deployed an optimized relational database schema in Microsoft SQL Server leveraging advanced indexing, stored procedures, and complex views to enable low-latency reporting.
- Enforced platform security by implementing strict query parameterization to eliminate SQL Injection risks, alongside secure session state management and hashing patterns.
- Managed the end-to-end software development lifecycle using Git within a simulated corporate framework.
Azure Functions v4 · C# / .NET 8 · Azure Storage Queues · Blob Storage · Azure SQL · Azurite · Docker · Bicep · GitHub Actions · xUnit / FluentAssertions
Production-grade, cloud-native vaccination record processing platform built to ingest heterogeneous provider data through an asynchronous Azure Storage Queue pipeline and surface queryable status in under one second. Runs 100% locally via Docker + Azurite — zero cloud credits required.
- Event-Driven Serverless Architecture: Implemented six Azure Functions combining HTTP and Queue triggers — mirroring the serverless ingestion and DLQ patterns applied at Mercedes-Benz MBSA. HTTP trigger validates and routes onto the queue; Queue trigger fires automatically on message arrival with no polling overhead.
- Multi-Format Message Parser: Engineered
MessageParser.csto detect and normalise two structurally different provider formats (Id:Center:Date:SerialFormat A andBarcode:Date:Center:IdFormat B) behind a single ingestion interface — replicating heterogeneous data intake challenges from manufacturing telemetry pipelines. - Three-Tier Cloud Storage Pipeline: Raw JSON archived to Blob Storage (
{year}/{month}/{day}/format{A|B}/) for audit compliance → structured records persisted to Azure SQL via idempotentMERGE-based stored procedure (usp_UpsertVaccinationRecord) with exactly-once semantics → completeQueueMessageLogaudit trail in SQL. - Idempotent SQL Layer: 3NF normalised schema with MERGE-based deduplication — the same idempotency pattern as the Mercedes-Benz Delta Lakehouse migration, applied to a relational SQL context.
- One-Command Local Stack:
docker-compose up -dstarts SQL Server 2022 + Azurite + schema-init + queue-init containers. Full Azure-equivalent environment in under 60 seconds. - CI/CD with Real Integration Tests: GitHub Actions runs build, 17-test xUnit/FluentAssertions suite, and full SQL Server + Azurite integration tests on every PR. Bicep IaC enables one-command Azure deployment when credits are available.
- Stack: Azure Functions v4 · C#/.NET 8 · Azure Storage Queues · Azure Blob Storage · Azure SQL · Azurite · Docker Compose · Azure Bicep · GitHub Actions · xUnit · FluentAssertions
SQL Server 2022 · T-SQL · MongoDB · C# / ASP.NET Core · Docker · GitHub Actions · Bicep
Production-grade database engineering platform demonstrating the complete data lifecycle — from ERD and 3NF normalisation through advanced analytical SQL, polyglot NoSQL, and a CI/CD-validated test suite.
- 20-Table 3NF Schema: Surrogate PKs, explicit FK constraints, CHECK constraints, computed columns, and a dual-layer index strategy (operational covering indexes + columnstore on the Gold analytics layer).
- 8 Stored Procedures: Full business lifecycle coverage — atomic order placement with
UPDLOCK/ROWLOCKpessimistic concurrency to prevent stock oversell, idempotent MERGE-based inventory restocking withOUTPUTclause for audit diffs,SAVE TRANSACTIONpartial rollback for warehouse transfers, and paginated dynamic search via parameterisedsp_executesql(zero injection risk). - 4 Analytics Views: Window functions throughout —
LAG/LEADfor MoM revenue growth, customer cohort retention withROW_NUMBERdeduplication,NTILE(4)product tier classification withRANK/DENSE_RANK/PERCENT_RANKside-by-side, delivery SLA breach detection withLEADfor next-event prediction. - Advanced Analytics Queries:
GROUP BY ROLLUPwith GROUPING() subtotals,GROUP BY CUBEfor all dimension combinations,PIVOTfor monthly category revenue, Pareto ABC classification, andFOR JSON PATHoutput for API consumption. - Runnable Normalisation Showcase:
09_normalisation_showcase.sqlcreates temp tables at each stage (UNF → 1NF → 2NF → 3NF), inserts data, and runs verification queries proving zero transitive dependencies — executable documentation. - MongoDB Polyglot Layer: Schema-validated collections with
$jsonSchema, TTL index for raw event auto-expiry,$facetreturning four result sets in one pipeline pass,$lookupwith sub-pipelines, and$geoNeargeospatial vendor lookup. - 11 SQL Unit Tests + CI: Constraint enforcement, 3NF integrity checks (verifying no category name stored on Product table), and FK index coverage — run against a live SQL Server 2022 container on every PR.
- Stack: SQL Server 2022 (T-SQL, ACID) · MongoDB · C#/.NET 8 · ASP.NET Core · Docker · GitHub Actions · Azure Bicep · React dashboard
ASP.NET Core (C#) · React · Azure SQL · Azure Functions · Azure Storage Queues · Blob Storage · Bicep · GitHub Actions
Production-grade, full-stack analytics platform engineered to ingest, process, and visualize real-time multi-city sensor telemetry across a Medallion-style architecture.
- Medallion Data Architecture: Bronze (raw blob ingestion) → Silver (Azure SQL, 3NF normalised) → Gold (pre-aggregated analytics snapshots) — identical layering to the Mercedes-Benz MO360 Lakehouse.
- Azure Functions Pipeline: Queue-triggered function ingests sensor readings from Azure Storage Queue, calls
usp_IngestSensorReadingstored procedure with anomaly detection (3-sigma rule), archives to Blob Storage, and refreshes Gold layer viausp_RefreshAnalyticsSnapshotMERGE upsert. - Advanced SQL Views: Rolling 24h/7-day averages (
ROWS BETWEENframes),LAG/LEADhour-over-hour deltas,PERCENT_RANKdaily percentile ranking, city health scorecards with conditional aggregation, and alert SLA analysis withSTRING_AGG. - ASP.NET Core REST API: Fully async, strict CORS, Swagger/OpenAPI docs, Application Insights telemetry.
- React Dashboard: Real-time sensor status, city health scorecards, and trend charts via Recharts — live data from the API.
- IaC & CI/CD: Bicep provisions SQL Server, Function App, Storage Account, App Service, and Application Insights. GitHub Actions CI lints SQL (sqlfluff), builds .NET, builds frontend, runs security audit, and on merge to main runs full CD with smoke test.
- Stack: C#/.NET 8 · ASP.NET Core · React · Azure Functions · Azure SQL · Azure Storage · Azurite · Docker · Bicep · GitHub Actions
🤖 B2B Lead Generation Multi-Agent AI Pipeline (Private — Mathe Tech)
CrewAI · Anthropic Claude Sonnet 3.5 · XGBoost · MLflow · PySpark · Delta Lake · GitHub Actions
Production-grade agentic pipeline automating B2B lead discovery, intent scoring, and personalised outreach for South African enterprises.
- Modular multi-agent system (CrewAI + Claude) automating web scraping (LinkedIn, Google Maps, Hunter.io) and personalised email copy generation.
- Bronze-Silver-Gold Medallion pipeline with idempotent
MERGE INTOupserts, 8-check Data Quality Gate, and UUID-based 4-hop audit lineage. - XGBoost lead-intent scorer tracked and registered in MLflow Model Registry.
- 30-test pytest suite automated via GitHub Actions CI/CD on Python 3.11/3.12.
PySpark · XGBoost · Random Forest · MLP · SHAP · MLflow · scikit-learn · imbalanced-learn
End-to-end ML pipeline over 500k synthetic SA banking records (21.6% fraud rate).
- Distributed PySpark feature engineering: log-transforms, cyclical encoding, composite risk score, velocity×amount interaction.
- XGBoost won with PR-AUC 0.7090 — outperforming RF (0.6681), Neural Network MLP (0.6894), Logistic Regression (0.5059).
- SHAP TreeExplainer revealed
composite_risk_score(SHAP = 0.53) as a top predictor invisible to raw correlation. - ZAR cost-benefit threshold optimisation (0.05): 99.7% recall, R123M+ net benefit per 75k transactions.
- Model logged to MLflow as
sa-fraud-xgboost-v1for real-time Kafka pipeline deployment.
Apache Kafka · PySpark Structured Streaming · Delta Lake · Azure Databricks · ADF · Docker · pytest
Production-grade real-time ELT architecture replicating the Mercedes-Benz pipeline pattern.
- Kafka ingest → PySpark Structured Streaming → Delta Lake
MERGE INTO(exactly-once upsert semantics). - Three output layers: Silver transactions (upserted by ID), real-time fraud alerts, Gold 1-min tumbling window aggregations by merchant category and province.
- 30-test pytest suite + GitHub Actions CI/CD; local stack via Docker Compose.
TensorFlow/Keras · PySpark · MLflow · GRU · Attention · GitHub Actions
Deep learning system targeting "card-testing" patterns invisible to point-in-time models.
- 2-layer GRU with custom Bahdanau Attention; analysed 50-step transaction histories per customer.
- PR-AUC 0.6671 (3.70× better than random baseline); ZAR threshold optimisation: R1.69M+ net benefit per test set.
- PySpark Window functions transformed 590k+ raw transactions into 3D temporal tensors.
- Custom Attention Layer enables "Sequential SHAP" values for FSCA regulatory compliance.
- Full MLOps lifecycle: MLflow experiment tracking + GitHub Actions CI/CD.
PySpark · Delta Lake · Azure Databricks · Great Expectations patterns · SparkSQL
High-scale ELT pipeline processing 701k+ Amazon reviews.
- Schema-resilient "Pandas Bridge" ingestion handling duplicate column names and physical/logical type mismatches.
- Idempotent Silver layer: Delta Lake
MERGE INTO+ salted broadcast joins to eliminate data skew. - Automated Data Quality Gate (5+ schema/business logic checks per batch) with pipeline halting and quarantine logging.
- UUID-based Audit Spine with sub-second job metrics and 4-hop data lineage / Time-Travel audits.
- JVM bottleneck resolution: Checkpoint Plan Truncation + Spark shuffle partition tuning.
BiLSTM · GloVe 100d · TensorFlow/Keras · PySpark · NLTK · scikit-learn
Binary sentiment classifier for customer feedback monitoring.
- Evaluated LSTM, Conv1D, and BiLSTM architectures; BiLSTM won with PR-AUC 0.9391, ROC-AUC 0.9537.
- Decision threshold optimised to 0.06 → 99.9% recall.
- Projected R279M annual net value and R7.5M churn reduction for SA e-commerce context.
PySpark · Azure Databricks · ADF · ADLS Gen2 · Delta Lake · pytest
Scalable batch pipeline for South African consumer debt portfolio — Medallion Architecture (Bronze → Silver → Gold).
- Custom data quality framework (14 checks per run); ERROR-severity failures halt pipeline via ADF
IfConditionactivity. - Star Schema Gold layer optimised for BI query performance; 4 Gold analytical tables consumed by dashboards and ML pipelines.
- 17-test pytest suite + GitHub Actions CI/CD.
Angular 17 · Node.js · MongoDB Atlas · JWT · TLS/SSL · GitHub Actions
Full-stack secure communication platform built to mitigate high-risk governmental data breach vectors.
- Neutralised OWASP Top 10 threats: stateless JWT, Bcrypt (cost factor 12), Helmet.js secure headers.
- 5-attempt lockout + exponential backoff against distributed brute-force attacks.
- End-to-end TLS/SSL encryption; express-mongo-sanitize for NoSQL injection prevention.
- GitHub Actions CI/CD automating security audits (
npm audit), syntax validation, and production builds.
- Azure Functions v4 — HTTP triggers, Queue triggers, Timer triggers, isolated worker model
- Azure Storage Queues — Asynchronous message decoupling, at-least-once delivery, dead-letter queue (DLQ) isolation
- Azure Blob Storage — Hierarchical namespace, lifecycle management, immutable audit archives
- Azure Event Hubs — High-throughput streaming ingestion, Kafka-protocol compatible, on-premises boundary brokers
- Azure Data Factory — Pipeline orchestration, IfCondition error routing, incremental loads
- Azure Databricks — Structured Streaming, workflow utilities, hybrid library/notebook pattern
- Azure DevOps CLI — Secure Lakehouse access frameworks, pipeline-as-code
- Azure Monitor / Application Insights / Log Analytics — Enterprise telemetry, live operational dashboards
- Azure Data Explorer (ADX) — KQL queries, ingestion lag monitoring, queue threshold alerting
- Azurite — Full local Azure Storage emulation (Blob, Queue, Table) for offline development
- Azure Bicep (IaC) — Declarative resource provisioning, zero-touch environment replication
- Error Queue Analysis & DLQ Engineering — Malformed packet quarantine, pipeline fault isolation
- PySpark (Structured Streaming + Batch) — Distributed feature engineering, Window functions, partitioning
- Delta Lake —
MERGE INTOidempotent upserts, ACID transactions, Time-Travel audits, salted broadcast joins - Medallion Architecture (Bronze → Silver → Gold) — Multi-hop data maturity, single source of truth
- Apache Kafka — Account-keyed partitioning, micro-batch processing, exactly-once semantics
- Idempotent Pipeline Design — Exactly-once processing, duplicate-safe replay, checkpoint/truncation
- Schema-Resilient Ingestion — Pandas Bridge pattern, physical/logical type mismatch handling
- Data Quality Gates — Great Expectations patterns, 14-check frameworks, automated quarantine logging
- UUID-Based Audit Spine — Sub-second job metrics, 4-hop lineage tracking, forensic debugging
- Plan Truncation / Checkpointing — JVM memory bottleneck resolution, shuffle partition tuning
- SparkSQL / HQL — Data modelling, schema design, distributed aggregations
- Schema Design — 3NF/BCNF normalisation, surrogate PKs, FK constraint enforcement, CHECK constraints
- Stored Procedures — Pessimistic concurrency (
UPDLOCK/ROWLOCK),SAVE TRANSACTION,OUTPUTclause,OPENJSONbulk ingest - Idempotent Upserts —
MERGE INTOfor SQL Server and Delta Lake, deduplication under at-least-once delivery - Analytics Views —
LAG/LEAD,NTILE,RANK/DENSE_RANK/PERCENT_RANK, rolling window frames (ROWS BETWEEN) - Advanced Query Patterns —
GROUP BY ROLLUP/CUBE,PIVOT/UNPIVOT, recursive CTEs,CROSS APPLY,FOR JSON PATH,STRING_AGG - Indexing Strategy — Covering indexes, filtered indexes, columnstore indexes, composite indexes
- Dynamic SQL — Parameterised
sp_executesql, whitelist-safe sort injection prevention - Concurrency & Transactions — ACID compliance,
XACT_ABORT, deadlock prevention, optimistic vs pessimistic patterns - KQL (Kusto Query Language) — Telemetry analysis, ADX cluster monitoring, event-driven pipeline metadata
- Platforms — SQL Server 2022, Azure SQL Database, PostgreSQL, Oracle, MySQL
- MongoDB — Schema-validated collections (
$jsonSchema), TTL indexes, aggregation pipelines,$facet,$lookupwith sub-pipelines,$geoNeargeospatial queries, text search - HBase — NoSQL wide-column storage
- Azure Data Explorer (ADX) — Time-series analytics, KQL-native queries
- Frameworks — TensorFlow 2.x / Keras, XGBoost 2.0, scikit-learn, imbalanced-learn (SMOTE)
- Architectures — BiLSTM, GRU + Bahdanau Attention, MLP (BatchNormalization, Dropout, EarlyStopping), MobileNetV2 transfer learning
- MLOps — MLflow (experiment tracking, Model Registry, serialization), GitHub Actions CI for model validation
- Explainability — SHAP TreeExplainer (global + local), custom Attention Layer "Sequential SHAP" for regulatory compliance
- Multi-Agent LLMs — CrewAI orchestration patterns, Anthropic Claude (Sonnet 3.5) API integration, context-aware inference engines
- Distributed NLP — Native Spark NLP (Regex normalization), NLTK, GloVe 100d embeddings, HTML stripping without UDF serialization overhead
- C# / .NET 8 — Strongly-typed domain models, async/await concurrency, clean architecture patterns
- ASP.NET Core — REST API design, Swagger/OpenAPI, strict CORS, Application Insights telemetry
- Azure Functions Worker Model — Isolated worker process, DI container, output bindings
- Entity Framework Core — ORM, database-first and code-first migrations
- Express.js / Node.js — Middleware security, JWT authentication flows
- Angular 17 — Component architecture, reactive forms
- React — Automated network polling, dynamic metric mapping, Recharts data visualization
- Flask / Twilio SDK — Webhook processing, WhatsApp Business API integration
- Parameterised SQL (
sp_executesql) — Zero SQL injection risk in dynamic query patterns - OWASP Top 10 Mitigation — JWT (stateless), Bcrypt (cost factor 12), Helmet.js, account lockout + exponential backoff
- TLS/SSL Encryption — End-to-end transit security, client-to-server and server-to-database tunnels
- NoSQL Injection Prevention — express-mongo-sanitize
- PII Data Masking & Obfuscation(Data Privacy
- Secret / Key Vault Management
- Data Lineage & Provenance — 4-hop lineage tracking, Time-Travel audits, immutable blob archives
- Cloud Platform Governance — Automated Azure DevOps pipelines enforcing RBAC and ADLS Gen2 directory ACLs.
- xUnit + FluentAssertions — C# / .NET 8 unit and integration testing, round-trip validation, theory-driven test suites
- pytest (30+ test suites) — Python unit testing, integration testing, GitHub Actions automation
- Automated SQL Unit Testing — Live SQL Server 2022 container provisioning on every PR
- GitHub Actions — Multi-job CI pipelines (lint → build → test → security scan → deploy → smoke test)
- Docker / Docker Compose — Multi-container local stacks, environment parity (local-to-cloud), init containers
- sqlfluff — T-SQL linting in CI pipeline
- Azure Bicep — Declarative IaC, parameterised multi-environment deployments
- Power BI — Production reporting dashboards
- KQL (Kusto Query Language) — Event Hubs monitoring, ADX telemetry, ingestion lag analysis
- Azure Monitor / Application Insights — Live telemetry, bottleneck isolation, health dashboards
- Streamlit — Multi-page analytical dashboards with real-time telemetry and financial pipeline models
| Qualification | Institution | Year |
|---|---|---|
| Postgraduate Diploma in Data Analytics (NQF 8) | Varsity College, Cape Town | 2023–2024 |
| BCom Computer & Information Sciences — App Development (NQF 7) | Varsity College, Cape Town | 2019–2023 |
| National Senior Certificate — STEM Specialisation (NQF 4) | Parklands College, Cape Town | 2015–2019 |
- 📧 Email: siyamondemathe@gmail.com
- 📍 Location: Umhlanga Rocks, Durban, South Africa
- 💼 Open to: Senior Data Engineer · AI Engineer · Cloud Data Architect roles — remote or hybrid