Doc: algorithme detaille processus patching SANEF (12 sections)

2026-04-17 23:32:04 +00:00 · 2026-04-17 23:32:04 +00:00 · 803016458d
commit 803016458d
parent 9a72fa7eb7
1 changed files with 641 additions and 0 deletions
--- a/SANEF_PATCHING_PROCESS.md
+++ b/SANEF_PATCHING_PROCESS.md
@ -0,0 +1,641 @@
 # Processus de Patching SANEF — Algorithme détaillé
 > Document technique extrait du code PatchCenter (avril 2026).
 > Couvre le cycle complet : planification → exécution → validation → reporting.
 ---
 ## 1. Vue d'ensemble
 ```
 PLANIFICATION → CAMPAGNE → PRÉREQUIS → CORRESPONDANCE → VALIDATION → EXÉCUTION → POST-PATCH → HISTORIQUE
     S-2           S-1        S-1           continu         J-1          J          J+1         continu
 ```
 **Acteurs** :
 - Coordinateur : crée campagnes, assigne opérateurs, gère le planning
 - Opérateur SecOps (6) : Khalid, Mouaad, Thierno, Paul, Joel, Ayoub
 - Responsable applicatif : valide le post-patching non-prod
 - DSI : consulte dashboard et rapports
 **Règles métier** :
 - Pas de patching vendredi (risque weekend)
 - Fenêtre : 9h–21h (27 créneaux de 15 min/jour)
 - Budget : 35h/semaine pour ~20 serveurs
 - Non-prod AVANT prod (validation obligatoire)
 - Snapshot obligatoire avant patch (rollback possible)
 - Gel : S01, S30, S34, S51–S53 + jours fériés
 - Trafic prod : interdit janvier–mars (déhivernage)
 ---
 ## 2. Planification annuelle
 **Table** : `patch_planning`
 **Source** : `Planning Patching 2026_ayoub.xlsx` → `tools/import_planning_xlsx.py`
 ### 2.1 Structure
 | Champ | Description |
 |-------|-------------|
 | `year` | 2026 |
 | `week_number` | 1–53 |
 | `week_code` | S01–S53 |
 | `cycle` | 1 (jan–avr), 2 (avr–août), 3 (sept–déc) |
 | `domain_code` | FK → `domains.code` (INFRASTRUC, trafic, PEA, FL, BI, GESTION) |
 | `env_scope` | prod, hprod, all, pilot, prod_pilot |
 | `status` | open, freeze, holiday, empty |
 ### 2.2 Cycle type
 ```
 Cycle 1 (Patch 1) : S02–S15
    S02: Infrastructure HPROD
    S03: Trafic HPROD
    S04: Trafic PROD
    S05: Infrastructure PROD
    S06–S07: vide
    S08: Péage HPROD/PROD Pilote
    S09: Péage PROD
    S10: vide
    S11: FL Test/Recette/Dev
    S12: FL Pré-Prod
    S13: BI + Gestion
    S14–S15: FL Prod
 Cycle 2 (Patch 2) : S16–S35 (même rotation, décalé)
 Cycle 3 (Patch 3) : S36–S50 (même rotation, décalé)
 ```
 ### 2.3 Algorithme de validation planning
 ```
 SI year < année_courante → REJET
 SI year == année_courante ET week < semaine_courante → REJET
 SI week == semaine_courante ET jour > mardi → REJET (trop tard)
 week_start = lundi ISO de la semaine
 week_end = dimanche
 week_code = f"S{week:02d}"
 INSERT INTO patch_planning (year, week_number, week_code, week_start, week_end,
                            cycle, domain_code, env_scope, status)
 ```
 ---
 ## 3. Création de campagne
 **Tables** : `campaigns`, `patch_sessions`
 **Route** : `POST /campaigns/create`
 ### 3.1 Sélection des serveurs éligibles
 ```sql
 -- Critères d'éligibilité (tous obligatoires)
 WHERE servers.os_family = 'linux'
  AND servers.etat = 'Production'
  AND servers.patch_os_owner = 'secops'
  AND servers.licence_support IN ('active', 'els')
 -- Filtre domaine/environnement selon le planning de la semaine
 -- Si env_scope = 'prod'      → environment.name = 'Production'
 -- Si env_scope = 'hprod'     → environment.name != 'Production'
 -- Si env_scope = 'all'       → tous les environnements du domaine
 -- Si env_scope = 'prod_pilot'→ Production + Pilote
 -- Si domain_code = 'DMZ'     → inclut aussi zone = 'DMZ'
 ```
 ### 3.2 Partitionnement hprod / prod
 ```python
 hprod_servers = [s for s in eligible if s.env_name != 'Production' and s.id not in excluded]
 prod_servers  = [s for s in eligible if s.env_name == 'Production' and s.id not in excluded]
 # Tri par (app_group, hostname) pour grouper les applications
 hprod_servers.sort(key=lambda s: (s.app_group or '', s.hostname))
 prod_servers.sort(key=lambda s: (s.app_group or '', s.hostname))
 ```
 ### 3.3 Allocation des créneaux
 ```
 27 créneaux/jour (15 min) :
  Matin  : 09h00, 09h15, ..., 12h00, 12h15  (13 slots)
  Après-midi : 14h00, 14h15, ..., 16h15, 16h30  (14 slots)
 Jours hprod : Lundi + Mardi     → 54 créneaux max
 Jours prod  : Mercredi + Jeudi  → 54 créneaux max
 Pour chaque serveur :
  jour = jours[slot_index // 27]
  heure = DAILY_SLOTS[slot_index % 27]
  SI pref_patch_jour != 'indifferent' → forcer ce jour
  SI pref_patch_heure != 'indifferent' → forcer cette heure
  SI default_intervenant_id existe → forcer cet opérateur (forced_assignment=true)
 INSERT INTO patch_sessions (campaign_id, server_id, status='pending',
                            date_prevue, heure_prevue, intervenant_id, forced_assignment)
 ```
 ### 3.4 Assignation automatique des opérateurs
 ```
 Règles par priorité (table default_assignments) :
  1. Par serveur     (rule_type='server',    rule_value=hostname)
  2. Par app_type    (rule_type='app_type',  rule_value=app_type)
  3. Par app_group   (rule_type='app_group', rule_value=app_group)
  4. Par domaine     (rule_type='domain',    rule_value=domain_code)
  5. Par zone        (rule_type='zone',      rule_value=zone_name)
 Pour chaque règle (ordre priorité ASC) :
  MATCH sessions non assignées → SET intervenant_id = rule.user_id
 Auto-propagation app_group :
  SI opérateur assigné à serveur avec app_group X :
    → Tous les autres serveurs app_group X dans la même campagne
      reçoivent le même opérateur (propagation automatique)
 ```
 ### 3.5 Limites opérateurs
 ```sql
 -- Table campaign_operator_limits (optionnel, par campagne)
 -- Si max_servers > 0 ET count >= max_servers → refus d'assignation
 SELECT COUNT(*) FROM patch_sessions
 WHERE campaign_id = :cid AND intervenant_id = :uid AND status != 'excluded'
 ```
 ---
 ## 4. Vérification des prérequis
 **Service** : `prereq_service.py`
 **Route** : `POST /campaigns/{id}/check-prereqs`
 ### 4.1 Algorithme par serveur
 ```
 POUR chaque session (status='pending') :
  1. ÉLIGIBILITÉ
     SI licence_support = 'obsolete' → EXCLURE (raison: EOL)
     SI etat != 'Production'         → EXCLURE (raison: non_patchable)
  2. CONNECTIVITÉ TCP (port 22)
     Résolution DNS avec suffixes :
       "" → ".sanef.groupe" → ".sanef-rec.fr" → ".sanef.fr"
     POUR chaque suffixe :
       socket.connect(hostname+suffixe, port=22, timeout=5)
       SI OK → prereq_ssh = 'ok', BREAK
     SI aucun → prereq_ssh = 'ko' → EXCLURE (raison: creneau_inadequat)
  3. MÉTHODE ROLLBACK
     SI machine_type = 'vm'       → rollback_method = 'snapshot'
     SI machine_type = 'physical' → rollback_method = 'na'
  4. VÉRIFICATIONS SSH (si clé disponible)
     Connexion Paramiko avec clé /opt/patchcenter/keys/id_rsa_cybglobal.pem
     a) Espace disque :
        Commande : df -BM --output=target,avail | grep '^/ |^/var'
        Seuils   : / >= 1200 Mo, /var >= 800 Mo
        SI KO    → prereq_disk_ok = false
     b) Satellite Red Hat :
        Commande : subscription-manager identity
        SI "not_registered" → prereq_satellite = 'ko'
        SINON               → prereq_satellite = 'ok'
  5. RÉSULTAT
     prereq_validated = (ssh=ok ET disk_ok != false ET eligible)
     UPDATE patch_sessions SET
       prereq_ssh, prereq_satellite, rollback_method,
       prereq_disk_root_mb, prereq_disk_var_mb, prereq_disk_ok,
       prereq_validated, prereq_date = now()
  6. AUTO-EXCLUSION
     SI prereq_ssh='ko' OU prereq_disk_ok=false OU licence='obsolete' :
       UPDATE patch_sessions SET
         status = 'excluded',
         exclusion_reason = '...',
         excluded_by = 'system',
         excluded_at = now()
 ```
 ---
 ## 5. Correspondance prod ↔ non-prod
 **Table** : `server_correspondance`
 **Service** : `correspondance_service.py`
 ### 5.1 Détection automatique par signature hostname
 ```
 Règle de nommage SANEF :
  Position 1 : préfixe arbitraire
  Position 2 : indicateur environnement
    p, s         → Production
    r            → Recette
    t            → Test
    i, o         → Pré-production
    v            → Validation
    d            → Développement
  Position 3+  : suffixe applicatif
 SIGNATURE = pos1 + "_" + pos3+
  Exemple : "vpinfadns1" → signature "v_infadns1"
            "vdinfadns1" → signature "v_infadns1" (même signature)
            "vpinfadns1" = prod, "vdinfadns1" = dev → LIEN
 Exceptions (pas d'auto-détection) :
  - Hostnames commençant par "ls-" ou "sp"
 POUR chaque signature :
  prods    = [serveurs avec env_char ∈ {p, s}]
  nonprods = [serveurs avec env_char ∈ {r, t, i, v, d, o}]
  SI 1 prod ET N nonprods :
    → Créer N liens (prod_server_id, nonprod_server_id, source='auto')
  SI 0 prod : orphelins (pas de lien)
  SI >1 prod : ambigu (skip)
 ```
 ### 5.2 Lien manuel
 ```sql
 INSERT INTO server_correspondance
  (prod_server_id, nonprod_server_id, environment_code, source, created_by)
 VALUES (:prod_id, :nonprod_id, :env_code, 'manual', :user_id)
 ```
 ---
 ## 6. Validation post-patching
 **Table** : `patch_validation`
 **Route** : `/patching/validations`
 ### 6.1 Cycle de vie
 ```
                    ┌──────────────────┐
 Patching terminé → │   en_attente     │ ← notification responsable
                    └────────┬─────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
        validated_ok    validated_ko      forced
        (OK, RAS)     (problème détecté)  (forcé par admin)
 ```
 ### 6.2 Règle de blocage prod
 ```python
 def can_patch_prod(db, prod_server_id):
    """Le prod ne peut être patché que si TOUS ses non-prods liés sont validés."""
    nonprods = SELECT nonprod_server_id FROM server_correspondance
               WHERE prod_server_id = :prod_id
    SI aucun lien → OK (orphelin, pas de blocage)
    POUR chaque nonprod :
      last_status = SELECT status FROM patch_validation
                    WHERE server_id = nonprod.id
                    ORDER BY patch_date DESC LIMIT 1
      SI last_status NOT IN ('validated_ok', 'forced') :
        → BLOQUÉ (ce non-prod n'est pas validé)
    RETOUR : (tous_validés, liste_bloqueurs)
 ```
 ---
 ## 7. Exécution — Mode Campagne Standard
 **Table** : `patch_sessions`
 ### 7.1 Machine à états
 ```
 pending
  ├─→ excluded     (auto-exclusion prérequis OU exclusion manuelle)
  ├─→ prereq_ok    (prérequis validés)
  └─→ in_progress  (opérateur démarre le patching)
       ├─→ patched  (succès)
       ├─→ failed   (échec)
       └─→ reported (validé + reporté)
 excluded  → restaurable par admin
 patched   → terminal (crée patch_validation)
 failed    → terminal (investigation)
 reported  → terminal
 ```
 ### 7.2 Ordre d'exécution
 ```
 Lundi    → Hors-prod serveurs 1–27
 Mardi    → Hors-prod serveurs 28–54
           → Notification responsables applicatifs
 Mercredi → Validation non-prod (responsable valide OK/KO)
           → SI tous les non-prods OK → feu vert prod
 Mercredi → Prod serveurs 1–27
 Jeudi    → Prod serveurs 28–54
 ```
 ---
 ## 8. Exécution — Mode QuickWin (semi-automatique)
 **Tables** : `quickwin_runs`, `quickwin_entries`, `quickwin_logs`
 **Service** : `quickwin_service.py`, `quickwin_prereq_service.py`, `quickwin_snapshot_service.py`
 ### 8.1 Machine à états du run
 ```
 draft → prereq → snapshot → patching → result → completed
  ↑                                                  │
  └──────────────── revert to draft ─────────────────┘
 ```
 ### 8.2 Phase 1 : Création du run
 ```python
 # Entrée : year, week, label, server_ids
 # Sortie  : run_id + quickwin_entries
 reboot_pkgs = get_secret("patching_reboot_packages")
 # kernel*, glibc*, systemd*, dbus*, polkit*, linux-firmware*,
 # microcode_ctl*, tuned*, dracut*, grub2*, kexec-tools*,
 # libselinux*, selinux-policy*, shim*, mokutil*,
 # net-snmp*, NetworkManager*, network-scripts*, nss*, openssl-libs*
 POUR chaque server_id :
  branch = "prod" si environment = 'Production' sinon "hprod"
  INSERT INTO quickwin_entries
    (run_id, server_id, branch, status='pending',
     general_excludes=reboot_pkgs, specific_excludes=server.patch_excludes)
 ```
 ### 8.3 Phase 2 : Prérequis (SSE streaming)
 ```
 POUR chaque entry (branch demandé, non exclu) :
  1. RÉSOLUTION DNS
     Détection environnement par 2ème caractère hostname :
       prod/preprod : essayer sanef.groupe puis sanef-rec.fr
       recette/test : essayer sanef-rec.fr puis sanef.groupe
     socket.getaddrinfo(fqdn, 22) → SI résolu : OK
  2. CONNEXION SSH (chaîne de fallback)
     SI ssh_method = "ssh_psmp" :
       ESSAYER : connexion PSMP (psmp.sanef.fr, user="CYBP01336@cybsecope@{fqdn}")
       FALLBACK : connexion clé SSH directe
     SINON :
       ESSAYER : connexion clé SSH directe
       FALLBACK : connexion PSMP
  3. ESPACE DISQUE (via SSH)
     sudo df / /var --output=target,pcent
     SI usage >= 90% → disk_ok = false
  4. SATELLITE / YUM
     sudo subscription-manager status
     OU sudo yum repolist
     SI 0 repos → satellite_ok = false
  5. RÉSULTAT
     prereq_ok = dns AND ssh AND satellite AND disk
     UPDATE quickwin_entries SET prereq_ok, prereq_detail, prereq_date
     EMIT SSE event → affichage temps réel dans le navigateur
 ```
 ### 8.4 Phase 3 : Snapshots (VMs uniquement)
 ```
 Ordre vCenter selon la branche :
  hprod : Senlis (vpgesavcs1) → Nanterre (vpmetavcs1) → DR (vpsicavcs1)
  prod  : Nanterre → Senlis → DR
 POUR chaque entry (prereq_ok = true) :
  SI machine_type = 'physical' :
    snap_done = true (pas de snapshot, vérifier backup Commvault)
    CONTINUE
  POUR chaque vCenter dans l'ordre :
    Connexion pyVmomi → recherche VM par nom (vcenter_vm_name ou hostname)
    SI trouvé :
      snap_name = f"QW_{run_id}_{branch}_{YYYYMMDD_HHMM}"
      Créer snapshot (memory=false, quiesce=true)
      snap_done = true
      BREAK
  SI non trouvé sur aucun vCenter :
    snap_done = false → LOG ERREUR
  EMIT SSE event
 ```
 ### 8.5 Phase 4 : Patching (SSE streaming)
 ```
 POUR chaque entry (snap_done = true, branch demandé) :
  1. CONSTRUCTION COMMANDE YUM
     excludes = parse(general_excludes + " " + specific_excludes)
     args = " ".join(f"--exclude={pkg}" for pkg in excludes)
     cmd = f"yum update -y {args}"
     Exemple :
       yum update -y --exclude=kernel* --exclude=glibc* --exclude=systemd*
  2. EXÉCUTION SSH
     Connexion SSH (même chaîne que prérequis)
     stdin, stdout, stderr = client.exec_command(cmd, timeout=600)
     output = stdout.read().decode('utf-8')
     exit_code = stdout.channel.recv_exit_status()
  3. ANALYSE SORTIE
     Packages comptés : lignes "Updating", "Installing", "Upgrading"
     Rien à faire : "Rien à faire" ou "Nothing to do"
     Reboot requis : "kernel" ou "reboot" dans output
  4. MISE À JOUR
     status = "patched" si exit_code == 0 sinon "failed"
     UPDATE quickwin_entries SET
       status, patch_output, patch_packages_count,
       patch_packages, reboot_required, patch_date
  5. CRÉATION VALIDATION
     SI status = "patched" :
       INSERT INTO patch_validation
         (server_id, campaign_id=run_id, campaign_type='quickwin',
          patch_date=now(), status='en_attente')
  EMIT SSE event {hostname, ok, packages, reboot, detail}
 ```
 ### 8.6 Phase 5 : Passage prod
 ```
 AVANT de lancer le patching prod :
  can_start_prod(db, run_id) :
    SELECT COUNT(*) FROM quickwin_entries
    WHERE run_id = :rid AND branch = 'hprod'
      AND status IN ('pending', 'in_progress')
    SI count > 0 → BLOQUÉ (hprod pas terminé)
  check_prod_validations(db, run_id) :
    POUR chaque entry prod :
      Vérifier que tous les non-prods liés sont validated_ok/forced
    SI blockers > 0 → BLOQUÉ (validation manquante)
 ```
 ---
 ## 9. Post-patching
 ### 9.1 Enregistrement historique
 ```sql
 -- Après chaque patch (standard ou quickwin)
 INSERT INTO patch_history
  (server_id, campaign_id, intervenant_id, date_patch, status, notes, intervenant_name)
 VALUES (:sid, :cid, :uid, now(), 'ok'/'ko', :notes, :interv_name)
 ```
 ### 9.2 Notifications
 ```
 Déclencheurs :
  - Début patching  → notif_debut_sent = true
  - Reboot effectué → notif_reboot_sent = true
  - Fin patching    → notif_fin_sent = true
 Canal : Teams webhook (configurable dans settings)
 ```
 ### 9.3 Rollback
 ```
 SI status = 'failed' ET rollback_method = 'snapshot' :
  → Restaurer snapshot vCenter (manuel ou via quickwin_snapshot_service)
  → Marquer rollback_justif dans patch_sessions
 ```
 ---
 ## 10. Reporting
 ### 10.1 Dashboard KPIs
 | KPI | Requête |
 |-----|---------|
 | Serveurs patchés 2026 | `COUNT(DISTINCT server_id) FROM patch_history WHERE year=2026` |
 | Events patching 2026 | `COUNT(*) FROM patch_history WHERE year=2026` |
 | Jamais patchés (prod) | Serveurs Production + secops sans entrée patch_history cette année |
 | Couverture % | patchés / patchables × 100 |
 | Dernière semaine | `MAX(TO_CHAR(date_patch, 'IW'))` |
 ### 10.2 Historique (`/patching/historique`)
 Sources unifiées :
 - `patch_history` (imports xlsx + campagnes standard)
 - `quickwin_entries` WHERE status='patched'
 Filtres : année, semaine, OS, zone, domaine, intervenant, source, hostname
 ### 10.3 Intégrations
 | Source | Usage |
 |--------|-------|
 | iTop | Serveurs, contacts, applications, domaines, environnements |
 | Qualys | Assets, tags V3, agents, scans post-patch |
 | CyberArk | Accès PSMP pour SSH sur serveurs restreints |
 | Sentinel One | Agents endpoint (comparaison couverture) |
 | AD SANEF | Groupe secops (8 users), auth LDAP |
 ---
 ## 11. Schéma de données
 ```
 patch_planning ──→ campaigns ──→ patch_sessions ──→ patch_history
     (annuel)       (hebdo)      (1 par serveur)     (audit trail)
                                      │
                                      ▼
                                patch_validation
                                (non-prod → prod gate)
                                      ▲
                                      │
 quickwin_runs ──→ quickwin_entries ────┘
   (run)          (1 par serveur)
                       │
                       ▼
                  quickwin_logs
                  (traces détaillées)
 servers ──→ domain_environments ──→ domains
   │              │                 environments
   │              │
   ├──→ zones
   ├──→ server_correspondance (prod ↔ non-prod)
   ├──→ server_ips
   ├──→ server_databases
   ├──→ qualys_assets ──→ qualys_asset_tags ──→ qualys_tags
   └──→ applications (via application_id)
 users ──→ contacts (via contact_id FK)
  │         │
  │         └──→ ldap_dn (source AD)
  │
  └──→ default_assignments (règles assignation)
  └──→ campaign_operator_limits
 ```
 ---
 ## 12. Exclusions packages (yum --exclude)
 ### 12.1 Packages reboot (globaux)
 ```
 kernel*, glibc*, systemd*, dbus*, polkit*, linux-firmware*,
 microcode_ctl*, tuned*, dracut*, grub2*, kexec-tools*,
 libselinux*, selinux-policy*, shim*, mokutil*,
 net-snmp*, NetworkManager*, network-scripts*, nss*, openssl-libs*
 ```
 Stockés dans `app_secrets['patching_reboot_packages']`.
 Appliqués à `quickwin_entries.general_excludes` à la création du run.
 ### 12.2 Exclusions par serveur
 Champ `servers.patch_excludes` (texte libre, séparé par espaces).
 Gestion via `/patching/config-exclusions` (UI + bulk).
 Synchronisé avec iTop (best-effort).
 ### 12.3 Commande générée
 ```bash
 yum update -y --exclude=kernel* --exclude=glibc* --exclude=systemd* \
              --exclude=<specific1> --exclude=<specific2>
 ```