Portal - Outgoing HTTP Error Rate Alerts¶
This guide helps the support team diagnose and resolve Outgoing HTTP server request failures detected via the Outgoing HTTP Error Rate Alert in Grafana for the portal service.
Root Cause¶
Portal Outgoing HTTP error rate alerts occur when the portal service experiences failures in making HTTP requests to services or APIs. Common causes include:
- Service Unavailability: Target services (Ranger API, Scheme Server, Ops-Server, Dataserver) are down or unresponsive
- Network/Proxy Issues: Connectivity problems or proxy configuration errors preventing outbound requests
- Resource Constraints: Memory pressure or connection pool exhaustion affecting outbound calls
- Service Dependency Failures: Services returning errors or not responding as expected
- Timeout Issues: Services taking too long to respond, causing request timeouts
Solution¶
Step 1: Identify Failing Endpoint
Use the alert metadata in Grafana (like URI
, Method
, Status
) to identify:
- Which portal endpoint is failing (Example:
Ranger API (/ranger/*)
,Scheme Server(/peg/*)
,Ops-Server (/ops-server/*)
) - What request caused the issue (Example:
GET /ranger/service/xusers/users
→500 Internal Server Error
)
Step 2: Grafana Dashboard Checks
-
Outgoing HTTP Request Rate
This panel shows how many requests per second Portal is making to services over the last 5 minutes.
What It Shows:
- Request volume to each service (Ranger, PEG, Ops-Server)
- Breakdown by HTTP status code (200, 404, 500, etc.)
- Breakdown by HTTP method (GET, POST, PUT, etc.)
- Specific endpoints being called
When to Check:
- To see if Portal is making unusually high or low requests to services
- To identify which specific endpoints are generating the most traffic
- To correlate request spikes with error increases
-
Outgoing HTTP Response Time
This panel shows how long services take to respond to Portal's requests.
What It Shows:
- Average response time for each service
- Response time trends over time
- Breakdown by endpoint and status code
- Identifies slow-performing services
When to Check:
- If Portal UI feels slow or unresponsive
- To identify which service is causing performance issues
- To see if response times correlate with error rates
-
Outgoing Connection Status
This panel displays the current proxy connection health between Portal and services.
What It Shows:
- Ranger Proxy: Portal → Ranger API communication status (Connected/Disconnected/NA)
- PEG Proxy: Portal → PEG/Scheme Server communication status (Connected/Disconnected/NA)
- Ops Server Proxy: Portal → Ops-Server communication status (Connected/Disconnected/NA)
Status Indicators:
- Connected (Green): Portal can successfully communicate with the service
- Disconnected (Red): Portal cannot reach the service (network/auth/service down)
- NA (Gray): Service not configured or monitoring not available
When to Check:
- If you see 404/503/504 errors in outgoing requests
- When Portal features dependent on services aren't working
- To verify which specific service is causing connectivity issues
Step 3: Apply Quick Fixes Based on Common Error Pattern
Error Code | Likely Cause | Quick Fix |
---|---|---|
400 | Invalid request parameters/malformed JSON | Validate input parameters and request format |
401/419 | Token expired | Verify service account tokens and check service health |
404 | Endpoint not found/resource missing | Verify URL configuration and resource existence |
500 | Internal service error | Proceed to Escalation Checklist for further investigation |
503/504 | Service unavailable | Check service health |
Escalation Checklist¶
If the issue cannot be resolved through the specific troubleshooting guides, escalate it to the appropriate team with the following details:
- Timestamp of the error : Include the exact time the alert was triggered
- Grafana dashboard and alert screenshots :
- Grafana → Dashboards → Portal folder → Portal Dashboard
- Grafana → Alerting → Alert rules → Outgoing HTTP Error Rate Alerts.
-
Portal Service Logs: Include any logs from the Portal client-side actions, or test steps that reproduce the issue
Option 1: Download Log from Diagnostic Portal (Recommended)
- Open Diagnostic Portal and go to Dashboard → Services Tab
- Type "portal" in the service column input search box
- Click on the portal service to open its details page
- Find and click on a pod that shows "active" status
- Click the "Logs" tab on the pod details page
- Click "Download Logs" button to save the logs
- If you see multiple portal pods with "active" status, repeat steps 4-6 for each one
Option 2: Manual Log Collection (If Diagnostic service is not enabled)
-
Current portal configuration details : Configuration settings and deployment information
- Relevant user actions : Actions leading up to the error
For additional assistance, see How to Contact Support for detailed guidance on reaching out to the support team.
- Back to: Troubleshooting Overview