Athena Platform - End-to-End Monitoring Solution
Digital-On Tech Healthcare Technology Monitoring
Athena Platform - End-to-End Monitoring Solution
Executive Summary
The Athena Platform is a comprehensive healthcare technology monitoring and analytics solution built on Grafana Enterprise. It provides real-time visibility into system health, automated compliance reporting, secure user management, and professional Digital-On branding throughout the user experience.
Platform URL: https://athena.digitalon.co.za
Organization: Digital-On Tech
Purpose: Healthcare Technology Infrastructure Monitoring
Architecture Overview
High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ External Access │
│ (Cloudflare Tunnel - TLS) │
│ https://athena.digitalon.co.za │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────┴────────────────────────────────────┐
│ Athena Platform Host │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Grafana Enterprise 12.3.0 │ │
│ │ - Monitoring Dashboards │ │
│ │ - User Management │ │
│ │ - Alert Management │ │
│ │ - Digital-On Branding │ │
│ └──────────┬──────────────────────┬──────────────────────┘ │
│ │ │ │
│ ┌──────────┴──────────┐ ┌────────┴─────────┐ │
│ │ PostgreSQL 16 │ │ Redis │ │
│ │ - Grafana DB │ │ - Data Source │ │
│ │ - User Data │ │ - Caching │ │
│ │ - Dashboards │ │ - Time Series │ │
│ └─────────────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Automation & Monitoring Services │ │
│ │ │ │
│ │ • grafana-compliance.timer (hourly) │ │
│ │ • grafana-auto-disable.timer (every minute) │ │
│ │ • grafana-active-users.timer (every 15 minutes) │ │
│ │ • msmtp (Mailgun SMTP) │ │
│ │ • Telegram notifications │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘ Core Components
1. Grafana Enterprise
Version: 12.3.0
Purpose: Primary monitoring and visualization platform
Capabilities: - Real-time metrics visualization - Custom dashboard creation - Alert management and routing - User access control - Query performance analysis - Plugin ecosystem (Redis datasource, image renderer)
Customizations: - Complete Digital-On branding (logo, colors, login page) - Custom email templates (welcome, password reset) - Self-registration with approval workflow - Enhanced monitoring and compliance features
2. PostgreSQL Database
Version: 16
Purpose: Grafana backend storage
Contains: - User accounts and authentication - Dashboard definitions - Alert rules and notification channels - Plugin configurations - Session data - Audit logs
Benefits: - Better performance than SQLite - Transaction safety - Easier backup and replication - Query optimization capabilities
3. Redis
Purpose: Data source and caching layer
Use Cases: - Time-series data storage - Application caching - Real-time metrics - Session storage (optional)
4. Cloudflare Tunnel
Purpose: Secure external access without port forwarding
Features: - TLS encryption end-to-end - DDoS protection - No firewall rules needed - Automatic certificate management - Traffic analytics
5. Email System (msmtp + Mailgun)
Purpose: Transactional email delivery
Configuration: - SMTP Host: smtp.eu.mailgun.org:587 - From Address: [email protected] - Sender Name: Athena Grafana
Email Types: - Welcome emails (branded) - Password reset (branded) - Compliance alerts - User approval notifications
6. Telegram Integration
Purpose: Real-time operational notifications
Notifications: - Active user monitoring (15-minute pulses) - System alerts - Operational status updates
Automated Workflows
User Registration & Approval Flow
User Registers Welcome Email Sent
↓ ↓
┌────────────────┐ ┌──────────────┐
│ Signup Form │──────────→│ Email System │
│ /signup │ └──────────────┘
└────────┬───────┘ │
│ ↓
│ ┌──────────────────┐
│ │ User receives │
│ │ branded welcome │
│ └──────────────────┘
↓
┌─────────────────────────────────────────┐
│ Account Created (Viewer role) │
└────────┬────────────────────────────────┘
│
↓ (within 1 minute)
┌─────────────────────────────────────────┐
│ Auto-Disable Script Runs │
│ • Detects new user │
│ • Disables account │
│ • Sends notification to admin │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Admin Reviews User │
│ • Checks email notification │
│ • Reviews user details │
│ • Decision: Approve or Reject │
└────────┬────────────────────────────────┘
│
↓ (if approved)
┌─────────────────────────────────────────┐
│ Admin Enables User │
│ • Via Web UI or API │
│ • User can now log in │
└─────────────────────────────────────────┘ Compliance Monitoring Flow
Every Hour (06:00-18:00 SAST)
↓
┌─────────────────────────────────────────┐
│ Compliance Script Runs │
│ • Parse Grafana logs (last 60 min) │
│ • Extract user activity │
│ • Identify accessed dashboards │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Check: Only Admin Users Active? │
│ (userId 1 or 4) │
└────────┬────────────────────────────────┘
│
┌────┴────┐
│ YES │ NO (regular users active)
↓ ↓
┌───────┐ ┌──────────────────┐
│ Send │ │ Exit silently │
│ Alert │ │ (compliance OK) │
└───┬───┘ └──────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ HTML Email to First2Lead │
│ • [email protected] │
│ • Shows admin activity │
│ • Lists dashboards accessed │
│ • Branded with Digital-On styling │
└─────────────────────────────────────────┘ Active User Monitoring Flow
Every 15 Minutes
↓
┌─────────────────────────────────────────┐
│ Active Users Script Runs │
│ • Query Grafana logs (last 15 min) │
│ • Extract authenticated requests │
│ • Count requests per user │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Aggregate User Activity │
│ • Username, user ID, org ID │
│ • Request count per user │
│ • Sort by activity level │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Send Telegram Pulse │
│ • Total users active │
│ • Total requests │
│ • Top 10 users by activity │
│ • Time window details │
└─────────────────────────────────────────┘ User Management System
User Roles & Permissions
| Role | Capabilities |
|---|---|
| Admin | Full platform access, user management, config |
| Editor | Create/edit dashboards, cannot manage users |
| Viewer | Read-only access to dashboards (default new users) |
Authentication Flow
User Access Request
↓
┌─────────────────────────────────────────┐
│ Check: Account Enabled? │
└────────┬────────────────────────────────┘
│
┌────┴────┐
│ NO │ YES
↓ ↓
┌─────────┐ ┌──────────────────┐
│ Access │ │ Check credentials│
│ Denied │ │ (username/pass) │
└─────────┘ └────────┬─────────┘
│
┌────┴────┐
│ Valid? │
↓ ↓
┌────┐ ┌───────────┐
│ NO │ │ YES │
↓ │ ↓ │
┌─────────┴────┐ │
│ Login Failed │ │
└──────────────┘ │
┌────┴──────┐
│ Create │
│ Session │
└────┬──────┘
↓
┌──────────────┐
│ Grant Access │
│ by Role │
└──────────────┘ Password Reset Flow
User Requests Reset
↓
┌─────────────────────────────────────────┐
│ "Forgot Password?" Link │
│ User enters email │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Generate Reset Token │
│ • Unique code │
│ • 4-hour expiration │
│ • Stored in database │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Send Branded Reset Email │
│ • Digital-On styling │
│ • Reset link with code │
│ • Security warnings │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ User Clicks Link │
│ • Validates token │
│ • Checks expiration │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ User Sets New Password │
│ • Password strength validation │
│ • Token invalidated │
│ • Confirmation shown │
└─────────────────────────────────────────┘ Data Flow Architecture
Monitoring Data Flow
┌──────────────┐
│ Data Sources │
│ • Servers │
│ • Apps │
│ • Databases │
│ • APIs │
└──────┬───────┘
│ (metrics, logs, traces)
↓
┌──────────────────┐
│ Data Collection │
│ • Telegraf │
│ • Prometheus │
│ • Custom agents │
└──────┬───────────┘
│
↓
┌──────────────────┐
│ Time-Series DB │
│ • Redis │
│ • InfluxDB │
│ • Prometheus │
└──────┬───────────┘
│
↓
┌──────────────────────────────────┐
│ Grafana Enterprise │
│ • Query data sources │
│ • Transform & aggregate │
│ • Apply alert rules │
│ • Render visualizations │
└──────┬───────────────────────────┘
│
├─────────────────┬─────────────────┬─────────────────┐
↓ ↓ ↓ ↓
┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐
│ Dashboards │ │ Alerts │ │ Reports │ │ APIs │
│ (Web UI) │ │ (Email/TG) │ │ (PDF/Email) │ │ (External) │
└─────────────┘ └──────────────┘ └─────────────┘ └──────────────┘ User Activity Logging Flow
User Action in Grafana
↓
┌─────────────────────────────────────────┐
│ Grafana Request Handler │
│ • HTTP request received │
│ • Session validation │
│ • Authorization check │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Logger (router_logging) │
│ • Context: userId, orgId, uname │
│ • Request: method, path, status │
│ • Performance: duration, size │
└────────┬────────────────────────────────┘
│
├──────────────┬──────────────────┐
↓ ↓ ↓
┌──────────────┐ ┌─────────────┐ ┌──────────────┐
│ journalctl │ │ /var/log/ │ │ PostgreSQL │
│ (systemd) │ │ grafana/ │ │ (audit log) │
└──────┬───────┘ └──────┬──────┘ └──────┬───────┘
│ │ │
└────────┬────────┴────────────────┘
↓
┌─────────────────────────────────────────┐
│ Monitoring Scripts │
│ • Compliance monitoring │
│ • Active user tracking │
│ • Audit reporting │
└─────────────────────────────────────────┘ Security Architecture
Security Layers
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Network Security │
│ • Cloudflare Tunnel (TLS 1.3) │
│ • No exposed ports │
│ • DDoS protection │
│ • Rate limiting │
└──────────────────────────┬──────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Authentication │
│ • Username/password (bcrypt hashed) │
│ • Session management │
│ • Password complexity requirements │
│ • Account lockout after failed attempts │
└──────────────────────────┬──────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: Authorization │
│ • Role-based access control (RBAC) │
│ • Organization-level isolation │
│ • Dashboard-level permissions │
│ • Data source access control │
└──────────────────────────┬──────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 4: Application Security │
│ • Auto-disable new accounts │
│ • Admin approval required │
│ • Input validation │
│ • XSS/CSRF protection │
│ • SQL injection prevention │
└──────────────────────────┬──────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 5: Data Security │
│ • PostgreSQL authentication │
│ • Encrypted database connections │
│ • Secure credential storage │
│ • Audit logging │
└─────────────────────────────────────────────────────────────┘ Secret Management
Secrets Storage: - /etc/default/grafana-auto-disable - Auto-disable credentials - /etc/default/grafana-compliance - Compliance monitoring credentials - /etc/default/grafana-active-users - Telegram bot credentials - /etc/grafana/grafana.ini - Database and SMTP credentials - /etc/msmtprc - Email SMTP credentials
Permissions: All secret files are 600 (root-only read/write)
Monitoring & Observability
Platform Health Monitoring
Metrics Tracked: - Grafana uptime and performance - PostgreSQL connection pool status - Redis hit/miss ratios - Email delivery success rates - User session counts - Dashboard load times - API response times
Alert Channels: 1. Email - High-priority alerts 2. Telegram - Real-time operational updates 3. Grafana UI - Dashboard-based alerts
Compliance & Audit
Compliance Monitoring: - Hourly checks for admin-only activity - Automated reporting to First2Lead - Dashboard access tracking - User activity correlation
Audit Logging: - All user logins - Dashboard access - Configuration changes - User management actions - API calls
Log Retention: - System logs: 30 days (journalctl) - Grafana logs: 30 days (/var/log/grafana/) - PostgreSQL logs: 7 days - Audit logs: 90 days (database)
Operational Workflows
Daily Operations
Automated Tasks: - ✅ User activity monitoring (every 15 minutes) - ✅ Auto-disable new users (every minute) - ✅ Compliance monitoring (hourly, 06:00-18:00 SAST) - ✅ Database connection health checks - ✅ Service status monitoring
Manual Tasks: - Review pending user approvals (as needed) - Review compliance alerts (daily) - Dashboard maintenance (weekly) - System updates (monthly)
Incident Response
Alert Triggered
↓
┌─────────────────────────────────────────┐
│ Notification Sent │
│ • Telegram pulse │
│ • Email alert │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Triage │
│ • Severity assessment │
│ • Impact analysis │
│ • Initial investigation │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Diagnosis │
│ • Check system logs │
│ • Review metrics │
│ • Identify root cause │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Remediation │
│ • Apply fix │
│ • Verify resolution │
│ • Document incident │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Post-Mortem │
│ • Root cause analysis │
│ • Prevention measures │
│ • Documentation update │
└─────────────────────────────────────────┘ Scalability & Performance
Current Capacity
- Concurrent Users: 100+
- Dashboards: Unlimited
- Data Sources: 20+
- Query Performance: < 2s average
- Uptime Target: 99.5%
Performance Optimization
Database: - Connection pooling (max 100 connections) - Query optimization with indexes - Regular VACUUM operations - Monitoring slow queries
Caching: - Redis caching for frequent queries - Browser caching for static assets - Dashboard cache TTL: 5 minutes
Resource Limits: - Grafana memory: 2GB max - PostgreSQL memory: 4GB shared buffers - Redis memory: 1GB max
Disaster Recovery
Backup Strategy
Database Backups: - Frequency: Daily automated backups - Retention: 30 days - Location: Local + offsite - Type: Full PostgreSQL dumps
Configuration Backups: - Grafana configuration files - Systemd unit files - Email templates - Scripts and automation
Recovery Time Objectives: - RTO (Recovery Time Objective): 4 hours - RPO (Recovery Point Objective): 24 hours
Failover Procedures
- Database Failure: Restore from latest backup
- Grafana Failure: Restart service, verify logs
- Network Failure: Cloudflare automatic failover
- Complete System Failure: Rebuild from documentation + backups
Integration Points
External Systems
- Mailgun (SMTP)
- Transactional emails
- Delivery tracking
- Bounce handling
- Telegram
- Bot API for notifications
- Group messaging
- Real-time alerts
- Cloudflare
- DNS management
- Tunnel service
- Analytics
- Digital-On Support Portal
- Support ticketing integration
- User documentation links
- Contact management
API Endpoints
Grafana HTTP API: - /api/users - User management - /api/dashboards - Dashboard operations - /api/datasources - Data source config - /api/admin/users - Admin operations - /api/health - Health checks
Technology Stack Summary
| Component | Technology | Version | Purpose |
|---|---|---|---|
| Platform | Grafana Enterprise | 12.3.0 | Monitoring & Visualization |
| Database | PostgreSQL | 16 | Data persistence |
| Cache | Redis | Latest | Time-series & caching |
| Tunnel | Cloudflare | Latest | Secure access |
| Mailgun + msmtp | Latest | Email delivery | |
| Notifications | Telegram Bot API | Latest | Real-time alerts |
| OS | Ubuntu Server | Latest LTS | Host operating system |
| Init | systemd | Latest | Service management |
| Scripting | Bash | 5.2+ | Automation scripts |
Success Metrics
Key Performance Indicators (KPIs)
Platform Availability: - Target: 99.5% uptime - Measured: Last 30 days average
User Satisfaction: - Login success rate > 99% - Dashboard load time < 2s - Zero data loss incidents
Security Compliance: - 100% of new users require approval - Hourly compliance monitoring active - Zero unauthorized access incidents
Operational Efficiency: - Automated user management - Automated compliance reporting - 15-minute active user visibility - Minimal manual intervention required
Future Roadmap
Planned Enhancements
- Enhanced Monitoring
- Additional data source integrations
- Custom plugin development
- Advanced alerting rules
- User Experience
- Mobile app support
- Customizable user dashboards
- Advanced visualization options
- Automation
- Automated dashboard provisioning
- Self-service user management
- Intelligent alert correlation
- Integration
- API-first architecture expansion
- Webhook support for external systems
- SSO/LDAP integration
- Analytics
- Usage analytics dashboard
- Performance trending
- Capacity planning insights
Conclusion
The Athena Platform represents a complete, production-ready monitoring solution for healthcare technology infrastructure. With automated compliance monitoring, secure user management, professional branding, and comprehensive observability, it provides Digital-On Tech with the tools needed to maintain visibility and control over critical systems.
Key Strengths: - ✅ Professional, branded user experience - ✅ Automated compliance and security workflows - ✅ Real-time monitoring and alerting - ✅ Comprehensive documentation - ✅ Production-ready and scalable
Platform Status: