0%

概述

在Kubernetes中使用Cert-Manager签发免费证书,可以让我们更方便地管理和使用证书。Cert-Manager是一个开源项目,用于管理和自动化证书。如果对安全级别和证书功能要求不高,可以利用Cert-Manager基于ACME协议与Let’s Encrypt进行证书签发,并自动续订证书。

前提条件

  • Kubernetes集群
  • 安装kubectl/helm

使用Helm安装Cert-Manager

1
2
3
4
5
6
7
8
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install \
cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.13.2 \
# --set installCRDs=true

ACME

ACME有两种协议,分别是HTTP01和DNS01。

HTTP01

HTTP01是一种基于HTTP的证书签发协议,可以让用户免费使用Let’s Encrypt提供的免费证书。

DNS01

ACME是一种基于DNS的证书签发协议,可以让用户免费使用Let’s Encrypt提供的免费证书。

如果使用DNS01,需要在DNS服务器上配置一条TXT记录,并将其值设置为cert-manager-webhook-dns-solver。如果DNS供应商提供了API,可以通过API实现自动化配置。

由于我的DNS不提供API,所以我使用了HTTP01。

ACME Issuer

Issuer是Cert-Manager中最重要的资源,用于管理证书签发流程。有两种Issuer:

  • ClusterIssuer:用于管理集群范围内的证书签发流程。
  • Issuer:用于管理命名空间范围内的证书签发流程。

创建ClusterIssuer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-staging
spec:
acme:
# You must replace this email address with your own.
# Let's Encrypt will use this to contact you about expiring
# certificates, and issues related to your account.
email: user@example.com
server: https://acme-staging-v02.api.letsencrypt.org/directory
privateKeySecretRef:
# Secret resource that will be used to store the account's private key.
name: example-issuer-account-key
# Add a single challenge solver, HTTP01 using nginx
solvers:
- http01:
ingress:
ingressClassName: nginx

注意:

  • email:用于接收Let’s Encrypt的通知邮件,如果不指定,会收到一封默认的邮件。
  • server:Let’s Encrypt的ACME服务器地址,这里的地址是Let’s Encrypt的测试服务器地址,https://acme-staging-v02.api.letsencrypt.org/directory为Staging环境,https://acme-v02.api.letsencrypt.org/directory为Production环境。Staging环境产生的证书是不会被浏览器信任的,生产环境需要使用正式的ACME服务器地址。
  • privateKeySecretRef:用于存储私钥的Secret资源名称。
  • solvers:用于指定ACME协议的验证方式,可以选择HTTP01或者DNS01。如果使用HTTP01,需要在Ingress中指定ingressClassName,否则会报错。
  • ingressClassName:Ingress的ingressClassName,用于指定Ingress的类型。

为kubernetes的Ingress创建证书

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
# add an annotation indicating the issuer to use.
cert-manager.io/cluster-issuer: nameOfClusterIssuer
name: myIngress
namespace: myIngress
spec:
rules:
- host: example.com
http:
paths:
- pathType: Prefix
path: /
backend:
service:
name: myservice
port:
number: 80
tls: # < placing a host in the TLS config will determine what ends up in the cert's subjectAltNames
- hosts:
- example.com
secretName: myingress-cert # < cert-manager will store the created certificate in this secret.

注意:

  • cert-manager.io/cluster-issuer:用于指定证书签发流程使用的Issuer名称。
  • hosts:用于指定证书的域名。
  • secretName:用于指定证书存储的Secret名称。

正常情况下,这时候,证书就会签发成功了,可以通过https://example.com访问到服务。只是我遇到了问题。

问题

在创建Ingress后,发现证书颁发失败,检查状态和日志:

1
2
3
4
5
6
7
8
9
10
11
12
kubectl get Issuers,ClusterIssuers,Certificates,CertificateRequests,Orders,Challenges -n partner 
NAME READY AGE
clusterissuer.cert-manager.io/letsencrypt-staging True 42h

NAME READY SECRET AGE
certificate.cert-manager.io/myingress-cert False myingress-cert 42h

NAME APPROVED DENIED READY ISSUER REQUESTOR AGE
certificaterequest.cert-manager.io/myingress-cert-1 True False letsencrypt-staging system:serviceaccount:cert-manager:cert-manager 11h

NAME STATE AGE
order.acme.cert-manager.io/myingress-cert-1-1904933461 invalid 11h

在看看Challenges细节:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
kubectl describe challenges.acme.cert-manager.io/myingress-cert-1-1904933461 -n partner
Name: myingress-cert-1-1904933461-2937188937
Namespace: partner
Labels: <none>
Annotations: <none>
API Version: acme.cert-manager.io/v1
Kind: Challenge
Metadata:
Creation Timestamp: 2023-12-05T13:39:58Z
Finalizers:
finalizer.acme.cert-manager.io
Generation: 1
Owner References:
API Version: acme.cert-manager.io/v1
Block Owner Deletion: true
Controller: true
Kind: Order
Name: myingress-cert-1-1904933461
UID: 1691de2d-1c32-4f10-bee7-81b1d4c9f7bc
Resource Version: 4440169
UID: 4092e71c-4028-47f9-8c0f-f0d5e95a4e81
Spec:
Authorization URL: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/9874881604
Dns Name: example.com
Issuer Ref:
Group: cert-manager.io
Kind: ClusterIssuer
Name: letsencrypt-staging
Key: xxxxxxxxx.xxxxxxxxx
Solver:
http01:
Ingress:
Ingress Class Name: nginx
Token: xxxxxxxxx
Type: HTTP-01
URL: https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/9874881604/fKmvZw
Wildcard: false
Status:
Presented: false
Processing: false
Reason: Error accepting authorization: acme: authorization error for example.com: 400 urn:ietf:params:acme:error:connection: xxx.xx.xx.xxx: Fetching http://example.com/.well-known/acme-challenge/xxxxxxxxx: Timeout during connect (likely firewall problem)
State: invalid
Events: <none>

其中xxx.xx.xx.xxx是我的公网IP

检查Cert-Manager的日志,发现实际访问http://example.com/.well-known/acme-challenge/xxxxxxxxx时,开始有访问404,并不是400。怀疑是网络原因。

我的网络是在内网有一个DNS服务器,将域名直接解析为内网IP,而通过公网访问时使用公共DNS访问公网IP。在出口路由器上做了端口映射,将公网端口映射到内网端口,这样就可以通过公网访问内网服务了。并且有一个限制,就是从内网不允许通过公网访问映射端口在回到内网。

怀疑是Cert-Manager在校验token时,通过公网访问了example.com,导致超时。先解除从内网走公网再进入内网的限制,测试成功。

分析

Cert-Manager在做域名校验时,会从容器环境访问域名,而这个域名解析并没有使用内网DNS,使用的公网DNS。实际更好的方法应该是使用内网DNS,这样就不会有访问公网的限制,但目前我还没有找到这个配置方式,留到后面再研究。

今天升级Kubernetes1.24,发现PVC创建时一直处于pending状态,经过检查发现,和之前遇到的问题一样,参考:升级Kubernetes V1.20后,pvc无法创建问题解决

然后检查/etc/kubernetes/manifests/kube-apiserver.yaml文件,之前设置的RemoveSelfLink=false,在升级后的确没有了,按照之前的方法增加,等待API Server重启。

然而,kube-apiserver启动不了了。

检查kube-apiserver,发现RemoveSelfLink=false在新的版本中已经不允许使用了。经过检查和验证,解决办法是更换nfs-client-provisioner

需要将nfs-client-provisioner更换为nfs-subdir-external-provisioner,我使用的是helm部署,更换方法如下:

  1. 先增加helm

    1
    2
    helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
    helm repo update
  2. 删除原来的部署

    1
    helm uninstall nfs-prod
  3. 修改value文件,新的文件如下

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    image:
    repository: registry.cn-shanghai.aliyuncs.com/c7n/nfs-subdir-external-provisioner
    storageClass:
    name: <name>
    archiveOnDelete: false
    defaultClass: true
    nfs:
    server: <nfs-ip>
    path: <nfs-path>
    nodeSelector: {}
  4. 重新部署

    1
    helm install <name> nfs-subdir-external-provisioner/nfs-subdir-external-provisioner -f <value file>

测试正常。

最近在MacOS搭建Bicep环境,发现VS Code始终无法成功识别BicepBicep插件在启动时依赖.net环境,虽然手动安装了.net环境,但依然无法启动。

参考:

https://github.com/dotnet/vscode-dotnet-runtime/blob/main/Documentation/troubleshooting-runtime.md#install-script-timeouts

https://www.azuredeveloper.cn/article/azure-tutorial-auzre-bicep-introduction

修改设置,增加以下内容:

1
2
3
4
"dotnetAcquisitionExtension.existingDotnetPath": [
{"extensionId": "ms-azuretools.vscode-bicep", "path": "/usr/local/share/dotnet/dotnet"},
{"extensionId": "msazurermtools.azurerm-vscode-tools", "path": "/usr/local/share/dotnet/dotnet"}
]

这样就可以正常使用了。

今天在部署KubernetesDeployment时,删除Deployment,发现不会自动删除Replica SetPod,开始怀疑是某个工作node故障,重启或者迁移到其他node问题一样。后来重启了master节点,发现有2个master启动无法注册node

我的环境中有3个master节点,检查容器,发现master01启动正常,master02master03etcd容器都无法启动,查看日志发现提示如下:

1
2
Jan 14 22:29:29 master02.k8s kubelet[1998]: E0114 22:29:29.187573    1998 kubelet.go:2291] "Error getting node" err="node \"master02.k8s\" not found"

master01上查看etcd的日志发现:

1
2022-01-14 15:57:13.963722 I | embed: rejected connection from "192.168.203.4:45008" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")

明确是证书过期导致的,根据上次过期的经验检查如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# kubeadm certs renew
missing subcommand; "renew" is not meant to be run on its own
To see the stack trace of this error execute with --v=5 or higher
[root@master02 kubernetes]# kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[check-expiration] Error reading configuration from the Cluster. Falling back to default configuration

CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf Sep 10, 2022 00:10 UTC 238d no
apiserver Sep 10, 2022 00:20 UTC 238d ca no
apiserver-etcd-client Sep 10, 2022 00:20 UTC 238d etcd-ca no
apiserver-kubelet-client Sep 10, 2022 00:20 UTC 238d ca no
controller-manager.conf Sep 10, 2022 00:09 UTC 238d no
etcd-healthcheck-client Dec 22, 2021 23:53 UTC <invalid> etcd-ca no
etcd-peer Dec 22, 2021 23:53 UTC <invalid> etcd-ca no
etcd-server Dec 22, 2021 23:53 UTC <invalid> etcd-ca no
front-proxy-client Sep 10, 2022 00:20 UTC 238d front-proxy-ca no
scheduler.conf Sep 10, 2022 00:09 UTC 238d no

CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
ca Dec 24, 2029 07:18 UTC 7y no
etcd-ca Dec 24, 2029 07:18 UTC 7y no
front-proxy-ca Dec 24, 2029 07:18 UTC 7y no

可以看出etcd的证书的确过期,renew证书:

1
2
3
# kubeadm certs renew healthcheck-client
# kubeadm certs renew etcd-peer
# kubeadm certs renew etcd-server

重新检查确定新的证书已经生效,重启容器和kubelet,检查发现问题件已经解决。

产生这个问题的原因是,master01已经自动更新了证书,保证容器环境正常使用,但其他master节点的证书没有自动更新,需要手工处理一下,估计未来版本会解决这个问题。

问题背景

我们订阅了Micorsoft 365,并通过Azure AD Connect将本地的AD的用户同步到Azure AD中。一次误操作将一个用户从AD删除,在同步为执行前,立刻重新创建了正用户。导致该用户无法同步了。

问题原因分析

Azure AD中的用户通过Azure AD Connect同步后,会将AD中用户的属性sourceAnchor标记到Azure AD Connect的属性immutableId,用来标识两方的用户。

sourceAnchor有很多中选择,可以参考: Azure AD Connect:设计概念

在目前的版本中,系统将ConsistencyGuid作为sourceAnchor属性,老版本默认使用ObjectGuid

ConsistencyGuid一般是和ObjectGuid相同,当重新创建一个用户时,ObjectGuid就会发生改变,最终导致在同步时,AD中的ConsistencyGuidAzure ADimmutableId不同,但用户有相同的UPNuserPrincipalName),导致同步冲突。

如果删除Azure AD的中用户,当然可以简单解决,但也会导致该用户的邮件等全部丢失。

解决方法

了解了问题的原因,我们就容易解决。思路是将AD中的ConsistencyGuidAzure ADimmutableId进行统一。

  1. 查询Azure ADimmutableId

    1
    2
    3
    4
    5
    PS C:\>  Get-MsolUser -UserPrincipalName user1@abc.com | Select-Object UserprincipalName,ImmutableID

    UserPrincipalName ImmutableId
    ----------------- -----------
    user1@abc.com K0fuEdJZe0ulRQq3+WlZTA==

    可以看到该用户的immutableId值为K0fuEdJZe0ulRQq3+WlZTA==

  2. 转换immutableId的值
    immutableId的值为Base64编码的十六进制,需要进行转换,可以用 在线工具 进行处理,输出结果需要选择十六进制。

    通过转换我们得到的值类似: 2B 47 EE 11 D2 59 7B 4B A5 45 0A B7 F9 69 59 4C

  3. 修改AD中的ConsistencyGuid
    登录AD服务器,打开ADSI编辑器,找到该用户,点开属性,找到属性ms-DS-ConsistencyGuid修改值为上步转换的值。

  4. 同步
    可以等待同步,也可以手工同步。同步后确定结果是否正确。

后续补充

在本次问题解决中,也尝试过,去修改Azure ADimmutableId,感觉这个更合理一些。但在修改过程中遇到2个问题:

  1. 如果Azure AD中用户的状态是同步状态,这个值是无法修改的,必须等这个用户变成非同步用户或者关闭同步
  2. 我从测试中修改过Azure ADimmutableId,然后通过手工同步发现依旧有错,后续没有再测试。理论上这个方法应该可以

记录一下两个命令:

  • 修改Azure ADimmutableId

    1
    Get-MsolUser -UserPrincipalName user1@abc.com | Set-MsolUser -ImmutableId L0b1Dn3oIkGiFLPW9fhY+Q==
  • 关闭同步

    1
    Set-MsolDirSyncEnabled -EnableDirSync $false

    注意: 官方文档说明修改后如果要重新打开需要等待72小时。

另外,网上找到有人说可以将Azure AD Connect卸载后,重新安装来解决,其思路也是通过重新对应ADAzure AD的用户,但这个方法我试过,问题没有解决。

最后,本次误操作的最好解决办法是,打开AD的回收站功能,具体在AD的管理中心进行修改,这样就可以完全避免这种问题了。

Windows Server 2016上安装MSOnline模块时提示错误,具体错误如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
PS C:\> Install-Module MSOnline
需要使用 NuGet 提供程序来继续操作
PowerShellGet 需要使用 NuGet 提供程序“2.8.5.201”或更高版本来与基于 NuGet 的存储库交互。必须在“C:\Program
Files\PackageManagement\ProviderAssemblies”或“C:\Users\administrator.WAFERSYSTEMS1\AppData\Local\PackageManagement\ProviderAssemblies”中提供 NuGet 提供程序。也可以通过运行
'Install-PackageProvider -Name NuGet -MinimumVersion 2.8.5.201 -Force' 安装 NuGet 提供程序。是否要让 PowerShellGet 立即安装并导入 NuGet 提供程序?
[Y] 是(Y) [N] 否(N) [S] 暂停(S) [?] 帮助 (默认值为“Y”): y
警告: 无法从 URI“https://go.microsoft.com/fwlink/?LinkID=627338&clcid=0x409”下载到“”。
警告: 无法下载可用提供程序列表。请检查 Internet 连接。
PackageManagement\Install-PackageProvider : 找不到提供程序“NuGet”的指定搜索条件的匹配项。程序包提供程序要求 "PackageManagement" 和 "Provider" 标记。请检查指定的程序包是否具有标记。
所在位置 C:\Program Files\WindowsPowerShell\Modules\PowerShellGet\1.0.0.1\PSModule.psm1:7405 字符: 21
+ ... $null = PackageManagement\Install-PackageProvider -Name $script:N ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidArgument: (Microsoft.Power...PackageProvider:InstallPackageProvider) [Install-PackageProvider],Exception
+ FullyQualifiedErrorId : NoMatchFoundForProvider,Microsoft.PowerShell.PackageManagement.Cmdlets.InstallPackageProvider

PackageManagement\Import-PackageProvider : 未找到与指定搜索条件和提供程序名称“NuGet”匹配的项目。请尝试运行 'Get-PackageProvider -ListAvailable' 以查看系统中是否存在该提供程序。
所在位置 C:\Program Files\WindowsPowerShell\Modules\PowerShellGet\1.0.0.1\PSModule.psm1:7411 字符: 21
+ ... $null = PackageManagement\Import-PackageProvider -Name $script:Nu ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidData: (NuGet:String) [Import-PackageProvider],Exception
+ FullyQualifiedErrorId : NoMatchFoundForCriteria,Microsoft.PowerShell.PackageManagement.Cmdlets.ImportPackageProvider

警告: 无法从 URI“https://go.microsoft.com/fwlink/?LinkID=627338&clcid=0x409”下载到“”。
警告: 无法下载可用提供程序列表。请检查 Internet 连接。
PackageManagement\Get-PackageProvider : 找不到程序包提供程序“NuGet”。可能尚未导入该提供程序。请尝试使用 'Get-PackageProvider -ListAvailable'。
所在位置 C:\Program Files\WindowsPowerShell\Modules\PowerShellGet\1.0.0.1\PSModule.psm1:7415 字符: 30
+ ... tProvider = PackageManagement\Get-PackageProvider -Name $script:NuGet ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : ObjectNotFound: (Microsoft.Power...PackageProvider:GetPackageProvider) [Get-PackageProvider], Exception
+ FullyQualifiedErrorId : UnknownProviderFromActivatedList,Microsoft.PowerShell.PackageManagement.Cmdlets.GetPackageProvider

Install-Module : 需要使用 NuGet 提供程序来与基于 NuGet 的存储库交互。请确保已安装 NuGet 提供程序“2.8.5.201”或更高版本。
所在位置 行:1 字符: 1
+ Install-Module msonline
+ ~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (:) [Install-Module],InvalidOperationException
+ FullyQualifiedErrorId : CouldNotInstallNuGetProvider,Install-Module

从提示来看,是由于NuGet安装时错误,经过查询,发现使用由于PowerShell默认没有使用TLS1.2导致,强制指定一下,然后很执行就可以了。指定方法:

1
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12;

Kubernetes1.20升级到1.21时,遇到了两个问题。

Core DNS image下载失败

通过以下命令进行image下载:

1
2
3
4
5
6
7
8
9
10
11
> kubeadm config images pull --image-repository registry.aliyuncs.com/google_containers     
I0910 08:00:20.155181 16619 version.go:254] remote version is much newer: v1.22.1; falling back to: stable-1.21
[config/images] Pulled registry.aliyuncs.com/google_containers/kube-apiserver:v1.21.4
[config/images] Pulled registry.aliyuncs.com/google_containers/kube-controller-manager:v1.21.4
[config/images] Pulled registry.aliyuncs.com/google_containers/kube-scheduler:v1.21.4
[config/images] Pulled registry.aliyuncs.com/google_containers/kube-proxy:v1.21.4
[config/images] Pulled registry.aliyuncs.com/google_containers/pause:3.4.1
[config/images] Pulled registry.aliyuncs.com/google_containers/etcd:3.4.13-0
failed to pull image "registry.aliyuncs.com/google_containers/coredns:v1.8.0": output: Error response from daemon: manifest for registry.aliyuncs.com/google_containers/coredns:v1.8.0 not found: manifest unknown: manifest unknown
, error: exit status 1
To see the stack trace of this error execute with --v=5 or higher

registry.aliyuncs.com/google_containers/coredns:v1.8.0一直下载不成功,经过检查发现,该镜像的tag1.8.0不是v1.8.0,我们直接手工下载,重新设置tag即可:

对于docker运行时方法如下:

1
2
docker pull registry.aliyuncs.com/google_containers/coredns:1.8.0
docker tag registry.aliyuncs.com/google_containers/coredns:1.8.0 registry.aliyuncs.com/google_containers/coredns:v1.8.0

containerd方法如下:

1
2
crictl pull registry.aliyuncs.com/google_containers/coredns:1.8.0
ctr -n k8s.io i tag registry.aliyuncs.com/google_containers/coredns:1.8.0 registry.aliyuncs.com/google_containers/coredns:v1.8.0

然后在正常升级就可以了。

Failed to start ContainerManager异常

升级完成后,发现工作节点上,kubelet启动异常,如下:

1
Failed to start ContainerManager failed to initialise top level QOS containers

解决办法:

1
2
systemctl stop kubepods-burstable.slice
systemctl restart kubelet

注意,这样会导致部署在该节点的所有Pod重启,重启后就正常了。

今天在升级vCenter 7.0时,升级检查时提示错误,错误日志显示如下:

1
2
3
2021-02-19T06:12:41.374Z - debug: initiateFileTransferFromGuest error: ServerFaultCode: Failed to authenticate with the guest operating system using the supplied credentials.
2021-02-19T06:12:41.374Z - debug: Failed to get fileTransferInfo:ServerFaultCode: Failed to authenticate with the guest operating system using the supplied credentials.
2021-02-19T06:12:41.374Z - debug: Failed to get url of file in guest vm:ServerFaultCode: Failed to authenticate with the guest operating system using the supplied credentials.

经过检查发现是vCenterroot密码过期导致,ssh登录vCenter,进入shell,执行:

1
2
3
4
root@vcsa [ ~ ]# chage -l root
You are required to change your password immediately (root enforced)
chage: PAM: Authentication token is no longer valid; new one required

说明root的密码过期,修改密码即可:

1
2
3
4
root@vcsa [ ~ ]# passwd
New password:
Retype new password:
passwd: password updated successfully

然后重新执行升级程序检查,通过。

今天访问Kubernetes时得到如下错误:

1
2
> kubectl get node     
error: You must be logged in to the server (Unauthorized)

昨天还正常今天无法访问,怀疑是证书到期了,可以直接看看master上的证书文件,文件位于/etc/kubernetes/pki中,执行命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
> for item in `find /etc/kubernetes/pki -maxdepth 2 -name "*.crt"`;do openssl x509 -in $item -text -noout| grep Not;echo ======================$item===============;done

Not Before: Dec 27 07:18:44 2019 GMT
Not After : Dec 24 07:18:44 2029 GMT
======================/etc/kubernetes/pki/ca.crt===============
Not Before: Dec 27 07:18:44 2019 GMT
Not After : Dec 22 23:43:16 2021 GMT
======================/etc/kubernetes/pki/apiserver.crt===============
Not Before: Dec 27 07:18:44 2019 GMT
Not After : Dec 22 23:43:17 2021 GMT
======================/etc/kubernetes/pki/apiserver-kubelet-client.crt===============
Not Before: Dec 27 07:18:45 2019 GMT
Not After : Dec 24 07:18:45 2029 GMT
======================/etc/kubernetes/pki/front-proxy-ca.crt===============
Not Before: Dec 27 07:18:45 2019 GMT
Not After : Dec 22 23:43:17 2021 GMT
======================/etc/kubernetes/pki/front-proxy-client.crt===============
Not Before: Dec 27 07:18:45 2019 GMT
Not After : Dec 24 07:18:45 2029 GMT
======================/etc/kubernetes/pki/etcd/ca.crt===============
Not Before: Dec 27 07:18:45 2019 GMT
Not After : Dec 22 23:42:44 2021 GMT
======================/etc/kubernetes/pki/etcd/server.crt===============
Not Before: Dec 27 07:18:45 2019 GMT
Not After : Dec 22 23:42:44 2021 GMT
======================/etc/kubernetes/pki/etcd/peer.crt===============
Not Before: Dec 27 07:18:45 2019 GMT
Not After : Dec 22 23:42:45 2021 GMT
======================/etc/kubernetes/pki/etcd/healthcheck-client.crt===============
Not Before: Dec 27 07:18:45 2019 GMT
Not After : Dec 22 23:43:17 2021 GMT
======================/etc/kubernetes/pki/apiserver-etcd-client.crt===============

查看后发现的确到期,那么我们renew证书即可:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> kubeadm alpha certs renew all
Command "all" is deprecated, please use the same command under "kubeadm certs"
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

从提示来看这个命令已经不建议使用,未来会使用kubeadm certs,这次就先这样,下次可以试试这个命令。

执行完成后,将/etc/kubernetes/admin.conf复制到~/.kube/config,就可以正常使用了。我没有重启kube-apiserver, kube-controller-manager, kube-scheduler and etcd前已经可以连接了,安全起见,还是重启一下。

可以连接后,也可以通过k8s命令查看证书状态:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'

CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf Dec 28, 2021 00:06 UTC 364d no
apiserver Dec 28, 2021 00:06 UTC 364d ca no
apiserver-etcd-client Dec 28, 2021 00:06 UTC 364d etcd-ca no
apiserver-kubelet-client Dec 28, 2021 00:06 UTC 364d ca no
controller-manager.conf Dec 28, 2021 00:06 UTC 364d no
etcd-healthcheck-client Dec 28, 2021 00:06 UTC 364d etcd-ca no
etcd-peer Dec 28, 2021 00:06 UTC 364d etcd-ca no
etcd-server Dec 28, 2021 00:06 UTC 364d etcd-ca no
front-proxy-client Dec 28, 2021 00:06 UTC 364d front-proxy-ca no
scheduler.conf Dec 28, 2021 00:06 UTC 364d no

CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
ca Dec 24, 2029 07:18 UTC 8y no
etcd-ca Dec 24, 2029 07:18 UTC 8y no
front-proxy-ca Dec 24, 2029 07:18 UTC 8y no

今天升级Kubernetes1.20,发现PVC创建时一直处于pending状态,经过检查发现,nfs-client-provisioner日志有如下错误:

1
provision "test/test-sql" class "nfs-storage": unexpected error getting claim reference: selfLink was empty, can't make reference

经过查找发现这个是V1.10之后的配置修改,具体参考: https://github.com/kubernetes/enhancements/issues/1164

找到原因,直接修改/etc/kubernetes/manifests/kube-apiserver.yaml,增加参数:

1
- --feature-gates=RemoveSelfLink=false