The Runtime Question
Every enterprise AI project eventually confronts a fundamental question: what language do you build the application in? Python is the obvious choice for ML — the ecosystem is unmatched. But AI is typically 20% of the total system. The other 80% is authentication, authorization, role-based access control, data persistence, transaction management, audit logging, API gateway routing, and dozens of other enterprise concerns.
We have tried three approaches: all-Python, all-Java, and hybrid. After building production systems in each, we have converged on the hybrid architecture as our standard for enterprise deployments. This post explains why and describes the architecture in detail.
Why Not All-Python?
Python is extraordinary for ML workloads. But as the backbone of a large enterprise application, it has real limitations:
The GIL. CPython's Global Interpreter Lock means true CPU parallelism requires multiprocessing, not multithreading. For a web server handling hundreds of concurrent requests with CPU-bound processing, this is a genuine constraint. Yes, you can use async frameworks (FastAPI with uvicorn), and yes, the GIL is being addressed in Python 3.13+. But in practice, JVM-based servers handle concurrent enterprise workloads more efficiently.
Type safety at scale. Python's type hints (via mypy/pyright) have improved dramatically, but they remain optional and unenforced at runtime. In a 200,000-line enterprise application with 15 developers, the compile-time guarantees of Java catch entire categories of bugs that Python discovers at runtime — often in production.
Enterprise ecosystem maturity. Spring Security, Spring Data JPA, Spring Cloud, Hibernate — the Java enterprise ecosystem is 20+ years mature. The Python equivalents (SQLAlchemy, Flask-Login, etc.) are capable but require more glue code, more configuration, and more custom solutions for enterprise patterns like distributed transactions, LDAP integration, and SAML/OIDC authentication.
Why Not All-Java?
Java's ML ecosystem is not competitive with Python's. Libraries like DJL and DL4J exist, but they lag significantly behind PyTorch, Transformers, and the broader Python ML ecosystem in model availability, community support, and documentation.
More practically: every pre-trained model, every ML paper's reference implementation, and every cutting-edge technique is available in Python first. Often exclusively in Python. Building an ML pipeline in Java means either wrapping Python libraries via Jython/GraalPython (fragile), using ONNX Runtime for inference only (limited), or reimplementing algorithms from scratch (expensive).
The Hybrid Architecture
Our standard architecture splits the system into two runtimes connected via gRPC:
The Spring Boot Application Layer owns: - REST/GraphQL API gateway - Authentication and authorization (Spring Security with OAuth2/OIDC) - Business logic and workflow orchestration - Data persistence (Spring Data JPA with PostgreSQL) - Event streaming (Spring Cloud Stream with Kafka) - Audit logging and compliance - Rate limiting and request validation - Admin dashboards and reporting
The Python ML Service Layer owns: - LLM integration (OpenAI, Anthropic, local models) - RAG pipeline (embedding, retrieval, re-ranking) - Model inference (classification, NER, anomaly detection) - Feature engineering and data preprocessing - Model evaluation and monitoring - Training pipeline orchestration
gRPC as the Connective Tissue
We chose gRPC over REST for inter-service communication for three reasons:
1. Protocol Buffers enforce a contract. The .proto file is the single source of truth for the API between services. Both Java and Python codegen from the same definition. Type mismatches are caught at compile time, not runtime.
2. Performance. Binary serialization (protobuf) is 5-10x faster than JSON for our typical payloads. Streaming RPCs allow the ML service to send partial results as they are computed — critical for long-running analysis tasks.
3. Bidirectional streaming. For real-time applications (HALO anomaly detection), the ML service pushes events to the Spring application via server-streaming RPCs. No polling required.
protobuf// contract_analysis.proto syntax = "proto3"; package exosolve.lexis; service ContractAnalysis { // Unary: analyze a single contract rpc AnalyzeContract(AnalysisRequest) returns (AnalysisResponse); // Server streaming: stream risk findings as they are discovered rpc StreamAnalysis(AnalysisRequest) returns (stream RiskFinding); // Unary: get model health and metrics rpc GetServiceHealth(HealthRequest) returns (HealthResponse); } message AnalysisRequest { string document_id = 1; bytes document_content = 2; string document_type = 3; // "pdf", "docx", "txt" AnalysisConfig config = 4; } message RiskFinding { string finding_id = 1; string clause_text = 2; string clause_type = 3; RiskSeverity severity = 4; float confidence = 5; string explanation = 6; repeated string source_references = 7; } enum RiskSeverity { LOW = 0; MEDIUM = 1; HIGH = 2; CRITICAL = 3; }
Service Communication Pattern
The typical request flow for LEXIS AI:
1. Client sends HTTPS request to Spring Boot API gateway. 2. Spring Security validates JWT, checks permissions. 3. Spring controller validates request, loads document metadata from PostgreSQL. 4. Spring service sends gRPC request to Python ML service with document content. 5. Python service runs the analysis pipeline (clause extraction, classification, risk analysis). 6. Results stream back via gRPC server-streaming. 7. Spring service persists results, updates audit log, sends Kafka event. 8. Spring controller returns structured JSON response to client.
The Python service is stateless — it receives a document, analyzes it, and returns findings. All state management (user sessions, document storage, analysis history, permissions) lives in the Spring application.
Deployment Architecture
We deploy both runtimes in Kubernetes with distinct scaling profiles:
Spring Boot pods: Scaled based on HTTP request rate. Typically 3-8 replicas. Each pod runs on 2 vCPU / 4GB RAM. JVM warm-up is handled via readiness probes that wait for the first GC cycle to complete.
Python ML pods: Scaled based on GPU utilization and queue depth. Typically 2-4 replicas. Each pod runs on 4 vCPU / 16GB RAM with optional GPU (NVIDIA T4) for model inference. We use Celery with Redis for async task queuing when analysis takes longer than 30 seconds.
yaml# python-ml-service deployment (simplified) apiVersion: apps/v1 kind: Deployment metadata: name: ml-analysis-service spec: replicas: 3 template: spec: containers: - name: ml-service image: exosolve/ml-analysis:v2.4.1 resources: requests: cpu: "2" memory: "8Gi" limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" ports: - containerPort: 50051 # gRPC name: grpc - containerPort: 8081 # health/metrics name: metrics readinessProbe: grpc: port: 50051 initialDelaySeconds: 30 env: - name: MODEL_CACHE_DIR value: "/models" volumeMounts: - name: model-cache mountPath: /models volumes: - name: model-cache persistentVolumeClaim: claimName: ml-model-cache
When to Use Each Runtime
Over 20+ deployments, we have developed clear heuristics:
Put it in Spring Boot if: It involves user authentication/authorization, database CRUD operations, business workflow logic, API routing, event publishing, or anything that benefits from Java's type system and enterprise library ecosystem.
Put it in Python if: It involves ML model inference, embedding generation, LLM API calls, numerical computation, data preprocessing with pandas/numpy, or anything where the Python ML ecosystem has a decisive advantage.
The grey zone: Some components could reasonably live in either runtime. Text preprocessing, simple rule-based classification, and API orchestration can go either way. Our heuristic: if the component is more likely to evolve toward ML complexity, put it in Python. If it is more likely to evolve toward enterprise integration complexity, put it in Spring Boot.
Results
The hybrid architecture adds operational complexity — two runtimes, two build pipelines, gRPC code generation, cross-runtime debugging. But the benefits are substantial: each runtime is used for what it does best, teams can specialize (enterprise Java engineers and ML Python engineers collaborate without stepping on each other), and the system scales each layer independently based on actual bottlenecks.
For our typical enterprise deployment, the Spring Boot layer handles 500+ concurrent users with 3 replicas, while the Python ML layer processes 200+ analyses per hour with 2-3 GPU-backed replicas. Both layers scale independently as load patterns change.