➜ ~ git:(main) ✗ man naveen
NAME
naveen — turn chaos into uptime
SYNOPSIS
naveen [-d | --debug] [-k | --kernel] [-f | --fire] system
naveen [-c | --cost] [-o | --optimize] aws_account ...
naveen [-a | --architect] [-s | --scale] platform teams ...
naveen [-r | --recover] [-t deadline] corrupted_data angry_client
naveen [-S | --secure] [--zero-trust] vpc iam secrets ...
naveen --why-is-this-broken production
DESCRIPTION
I fix systems that are on fire and build platforms that don't catch fire.
Currently at Nielsen. Previously Flexcar and OYO.
INCIDENTS RESOLVED
$100M contract recovery
28 days. 10TB. 7 pipelines. 100% accuracy. Zero SLA breaches.
cgroups_v2_oomkill
EKS upgrade changed memory accounting. Page cache + kernel memory now counted toward limits.
efs_martian_packets
VPC CIDR overlap + rp_filter = silent drops. One node in hundreds. Found via dmesg.
nfs_loopback_deadlock
Hard mount + dead server = D-state forever. Threads stuck in uninterruptible sleep.
etl_cascade_corruption
One missing staging folder corrupted 28 days across 7 RT pipelines. Incremental systems assume continuity.
o_n_squared_hot_path
Linear scan inside loop. 29 billion comparisons. 78 min runtime. Indexed lookup. 97% reduction.
connection_pool_exhaustion
Missing max-lifetime. Connections never recycled. Pool exhausted over days.
vpc_endpoint_sg_drift
Two teams made changes. Private DNS override + missing SG rule. Silent API timeouts.
ssl_proxy_interception
Corporate proxy MITM + missing CA in truststore = certificate validation failed.
transitive_dependency_mismatch
JDBC driver upgrade pulled Scala 2.13 into Spark 2.12 classpath. Class loading failures.
kubernetes_ip_exhaustion
CNI warm pool defaults + auto-scaling = subnet exhausted by reservations not pods.
base_image_eol
Pinned OS version EOL. Package repos disappeared. All builds failed.
stale_lookup_race
Calendar table max date in past + MAX+1 key generation = intermittent NULL failures.
COST SAVINGS
$560K+/year total
Traffic-based autoscaling ($90K). ETL redesign ($50K). AWS governance ($360K). Infrastructure right-sizing ($50K).
traffic_autoscaling
Analyzed 27 days traffic. 261K requests. 3-tier CronJob scaling. 43% compute reduction.
etl_platform_redesign
85% cost reduction. Right-sized Spark profiles per command. Extract/Ingest no longer using Transform-sized clusters.
aws_governance
14 dashboards. ML anomaly detection. Eliminated $30K/month spike. 20% cost reduction target.
output_formatter_fix
O(n⁴) to O(1). 2hr to 5min. 97% compute reduction. 29 billion operations eliminated.
SYSTEMS BUILT
async_job_orchestration
1000+ concurrent jobs. Atomic claiming with optimistic locking. Heartbeat monitoring. Autoscaling workers.
7_system_reconciliation
Cross-platform DQC. 7 systems integrated. Billions of records. Sub-hour deviation detection.
etl_platform_redesign
97 Scala files analyzed. 28 critical issues. Parent-child execution model. 85% cost reduction.
real_time_analytics
Real-time analytics on Druid. Petabyte scale. Sync/async routing. Hot/warm/cold storage tiers.
eks_fleet_migration
1.23 to 1.33 across 4 teams. Reverse-engineered NodeConfig. Filled AWS documentation gaps.
ENVIRONMENT
$LANG Java, Python, Scala, Go, Bash
$DATA Kafka, Spark, Druid, Airflow, PostgreSQL, DynamoDB
$MACHINES AWS (VPC, IAM, EKS, FSx, Compute, Security), Linux (storage,networking,monitoring)
$INFRA Kubernetes, Terraform, Helm, GitLab CI/CD
$OBSERVE Grafana, Prometheus, CloudWatch, dmesg, strace
EXIT CODES
0 system recovered, client happy, contract saved
1 found the bug, mass-produced documentation
137 OOMKilled — but now I know why
139 segfault traced, core dump analyzed
143 SIGTERM caught, graceful shutdown achieved
255 kernel said no, I said watch me
SEE ALSO
github.com/nkr-ops
linkedin.com/in/naveenkumarreddyk
HISTORY
2024– Nielsen
2022–2024 Flexcar
2021–2022 OYO
BUGS
Debugs problems that aren't assigned. Writes too much documentation.