-
Notifications
You must be signed in to change notification settings - Fork 923
Add terraform for provisioning s390x build cluster on ibmcloud #8413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Welcome @sudharshanibm! |
Hi @sudharshanibm. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sudharshanibm The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
167d452
to
2867a10
Compare
2867a10
to
1ef2ece
Compare
create_duration = "180s" # Wait 3 minutes for full initialization | ||
} | ||
|
||
resource "null_resource" "bastion_setup" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not very declarative - is this truly required or can the verification be run as a one-off by someone? do we expect things to not be setup correct via Terraform (i.e. - verification here fails) periodically or at all? (if so - that seems problematic and who or what is intended to fix that 😅 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bastion instance’s initial setup and networking configuration are handled through a cloud-init
script in user_data
. Runtime verification of the bastion’s internal state is treated as an operational concern, and can be performed outside of Terraform via automation/manual checks.
vpc = data.ibm_is_vpc.vpc.id | ||
} | ||
|
||
data "ibm_is_security_group" "master_sg" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use an improved alternate terminology? main
, primary
, control-plane
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated all occurrences of master
to control-plane
@@ -0,0 +1,129 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the use of k8s-*
throughout the file names, resource names, etc. seems to be redundant (i.e. - its defined in a K8s repo/org so its assumed to be for K8s, no need to re-state that in the names)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bryantbiggs , thanks for pointing that out
I followed the same naming convention since the existing s390x setup is already connected to ArgoCD with k8s-* prefixed names (ref: #8332 ).
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
resource "ibm_is_lb" "k8s_load_balancer" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the names should be viewed as a hierarchy level just below the resource type, but read together. so ibm_is_lb.k8s_load_balancer
is essentially IBM load balancer - Kubernetes load balancer
what about something shorter and more descriptive public
(ibm_is.lb.public
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated k8s_load_balancer
to public
limitations under the License. | ||
*/ | ||
resource "ibm_is_lb" "k8s_load_balancer" { | ||
name = "k8s-s390x-lb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, the name is very redundant. Its in a K8s repo, is this infrastructure deployed in an account that hosts resources for things outside of K8s? If yes = fine, keep the k8s
, if not, drop it. Also, the resource is a load balancer so no need to add that to the name (my name is Bryant Biggs
not Bryant Biggs human
😅 ). Lastly, I assume this is only for testing on s390x platforms so is that adding anything to the name? Maybe it is, maybe it isn't - I don't have enough context to say. But what about something more descriptive as to its intent and aligned with resources already defined in the repo
kops-infra-ci-name = "kops-infra-ci" |
what about s390x-ci
? or if you desire the full name k8s-s390x-ci
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this same thinking above applies throughout the change set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated k8s-s390x-lb
to k8s-s390x-ci
health_timeout = 2 | ||
health_type = "tcp" | ||
health_monitor_url = "/" | ||
health_monitor_port = 6443 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6443 appears in a number of places - good chance it should be a local variable to avoid mis-matches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored to use a variable api_server_port
(default 6443
)
name = "k8s-s390x-subnet" | ||
} | ||
|
||
data "ibm_is_security_group" "bastion_sg" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data "ibm_is_security_group" "bastion_sg" { | |
data "ibm_is_security_group" "bastion" { |
Security group is already in the resource type - don't repeat in the resource name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already have resource ibm_is_instance
bastion
defined in bastion.tf
, so reusing bastion here would cause a naming conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no - thats a different resource from this data source so no conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed bastion_sg
--> bastion
} | ||
|
||
resource "ibm_is_instance" "control_plane" { | ||
count = var.control_plane.count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a for_each
so that we avoid amplifying disruptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the control-plane
to use for_each
in nodes.tf
default = "eu-de-1" | ||
} | ||
|
||
variable "resource_group_name" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the variable, Thanks!
} | ||
} | ||
|
||
variable "bastion" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bastion
now generated in locals
} | ||
} | ||
|
||
variable "control_plane" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
control-plane
now generated in locals
} | ||
} | ||
|
||
variable "compute" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compute
now generated in locals
@@ -0,0 +1,136 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variables are mostly used in modules - I would argue, are these required? Should they be local variables instead (if they are used in multiple places but not intended to be configured dynamically outside of whats stored in git)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, these values are intended to be configurable by automation (Ansible playbooks) rather than hard-coded in the repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ansible to deploy Terraform - why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for circling back on this. I realize my earlier explanation may have caused some confusion. What I meant is that while we do have downstream automation that consumes certain outputs for cluster setup, that doesn’t mean all of the Terraform configuration itself needs to be exposed as variables.
After revisiting your feedback, I’ve updated the code so that only true external inputs like API key
, region
, secrets manager ID
, counts
, profiles
, etc., remain as variables. The structured definitions for bastion
, control-plane
, and compute
nodes are now generated in locals
, which keeps the interface cleaner and avoids over-exposing internal implementation details
} | ||
|
||
resource "ibm_is_instance" "compute" { | ||
count = var.compute.count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise, use for_each
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the compute
to use for_each
in nodes.tf
} | ||
} | ||
|
||
resource "null_resource" "node_setup" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't appear to be doing anything - can it be removed?
@@ -0,0 +1,43 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outputs are intended to be consumed elsewhere - are all of these intended to be consumed somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these outputs are consumed in our Ansible automation for installing Kubernetes. Specifically, the IPs and load balancer hostname are passed into the Ansible playbook ansible-playbook -v -i examples/k8s-build-cluster/hosts.yml install-k8s-ha.yaml -e @group_vars/bastion_configuration --extra-vars @group_vars/all
to configure the bastion, control-plane, and worker nodes.
@@ -0,0 +1,47 @@ | |||
# _TF: IBM K8s Account Infrastructure_ | |||
This Terraform configuration sets up an organized structure for deploying various IBM Cloud resources using modules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using what modules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the README
to explicitly list the modules being used vpc
, secrets_manager
and resource_group
<br> You need to be in the `k8s-s390x-infra-setup` directory to run the automation. | ||
|
||
**2. Check the `versions.tf` file** | ||
<br> Set `secret_key` and `access_key` in `versions.tf` to configure the remote S3 backend (IBM Cloud COS). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use AWS IAM roles for state access - the use of secrets like this is bound to end up getting committed at some point. Plus sharing static secrets is not good either, no auditability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed access_key
and secret_key
from versions.tf
, instead it picks up HMAC credentials from environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
**3. Initialize Terraform** | ||
<br> Execute the following command to initialize Terraform in your project directory. This command will download the necessary provider plugins and prepare the working environment. | ||
``` | ||
terraform init -reconfigure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why -reconfigure
? should this be -upgrade
?
**4. Check the `variables.tf` file** | ||
<br> Open the `variables.tf` file to review all the available variables. This file lists all customizable inputs for your Terraform configuration. | ||
|
||
`ibmcloud_api_key` is the only required variable that you must set in order to proceed. You can set this key either by adding it to your `var.tfvars` file or by exporting it as an environment variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get around this? if we are using an AWS IAM role for statefile access, can we place the API key in an AWS SSM parameter or SecretManager secret and dynamically fetch that API Key via Terraform? Goal is to improve reproducibility, reduce sharing of static secrets, improve auditability, and make secret rotation painless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that dynamically fetching the IBM Cloud API key from a secret manager would be ideal for auditability and rotation. However, in this setup the Terraform run creates the Secrets Manager itself, so the key must be provided upfront via var.tfvars
or environment variable
|
||
`ibmcloud_api_key` is the only required variable that you must set in order to proceed. You can set this key either by adding it to your `var.tfvars` file or by exporting it as an environment variable. | ||
|
||
**Option 1:** Set in `var.tfvars` file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tfvars or variables.tf
should not be required here in my opinion - just define the configuration plainly, its not a module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right, since this isn’t a module, ideally we wouldn’t require variables.tf
or var.tfvars
. however in this setup the TF run creates the Secrets Manager itself, so we need the IBM Cloud API key upfront to provision that resource.
For now, the key must be provided via var.tfvars
or environment variable.
Terraform will display a plan of the actions it will take, and you'll be prompted to confirm the execution. Type `yes` to proceed. | ||
|
||
**6 .Get Output Information** | ||
<br> Once the infrastructure has been provisioned, use the terraform output command to list details about the provisioned resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for what purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these outputs are used by our ansible automation to install k8s. The IPs and load balancer hostname from Terraform are passed into the playbook (install-k8s-ha.yaml) to configure the bastion, control plane, and worker nodes
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
resource "ibm_resource_group" "build_resource_group" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single resource modules are no bueno - just use the resource as is in the configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the resource_group
module and now define ibm_resource_group
directly in main.tf
.
the other modules secrets_manager
and vpc
consume its ID from there
z_service_cred_secret_name = "k8s-s390x-sm-service-credentials-secret" | ||
} | ||
|
||
# Try to find existing Secrets Manager by name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no no no - declarative. Terraform should not be treated as "if this else that". its just "do this"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored the code to make it fully declarative
the secrets manager
resource is always created and downstream secrets reference it directly instead of using any conditional logic
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
variable "resource_group_id" {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well thats not nice! we should have the proper definition - description
/type
/default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with description and type
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
terraform { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need the min required versions of Terraform and the provider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimum required versions are defined in versions.tf
resource "ibm_is_security_group_rule" "master_inbound_from_internal_cidr" { | ||
group = ibm_is_security_group.master_sg.id | ||
direction = "inbound" | ||
remote = "10.243.0.0/24" # adjust based on your CIDR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??? who should adjust based on what CIDR - this is hard coded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced hardcoded CIDR "10.243.0.0/24" with dynamic references to the actual subnet CIDR block: ibm_is_subnet.subnet.ipv4_cidr_block
limitations under the License. | ||
*/ | ||
|
||
output "secrets_manager_id" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a module so shouldn't required variables.tf
or outputs.tf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right that this folder isn’t a reusable module in the TF sense. The reason I’ve added an outputs.tf
here is because the k8s-s390x-infra-setup
stage produces values secrets_manager_id
that are explicitly required by the k8s-s390x-build-cluster
stage.
as shown in the build-cluster README, we need to export this ID before running the cluster provisioning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't copy paste across statefiles though - either refer to the state itself or use the AWS provider to refer to the value directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that makes sense
The intent was to make the secrets_manager_id
available for the build-cluster
stage, but I see your point about not copy-pasting state outputs manually.
I can switch this to use a terraform_remote_state
data source so that the build-cluster config reads the value directly from the infra-setup state instead of requiring a manual export
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bryantbiggs,
I removed outputs.tf
and am not using terraform_remote_state
because:
- the infra setup is one-time, while build-clusters are multiple, so each cluster creation shouldn’t depend on the infra state.
- to avoid state coupling, there’s no need to copy outputs or access remote state, keeping configs independent.
Also, the README
has been updated with clear steps to retrieve the Secrets Manager GUID and export it as a Terraform variable for cluster provisioning.
I’d be happy to hear your suggestions or feedback on this approach.
zone = "eu-de-1" | ||
} | ||
|
||
provider "ibm" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min required versions of Terraform and providers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimum required versions are defined in versions.tf
region = local.region | ||
zone = local.zone | ||
} | ||
provider "ibm" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 3 copies of the same provider if their configurations are all the same? typically aliased providers are required if you are deploying resources into multiple regions or multiple accounts but these are all using the same API key and region
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the duplicate provider blocks
0cf7f9c
to
e5948c9
Compare
|
||
locals { | ||
bastion_nodes = { | ||
for idx in range(1, var.bastion_count + 1) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still susceptible to the issues created by using index based resource creation (i.e. like count
). any disruption to the first bastion will cause all instances to be replaced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing this out. I’ve updated the code to replace the count
/range
style with a for_each
over a map of bastion nodes
using named keys (primary
, secondary
, ...) now tracks bastions by stable identifiers rather than by index.
if one bastion is changed/removed only that resource is affected instead of forcing all to be recreated
right now I’ve defined a single primary
bastion but the pattern allows adding more bastions safely just by extending the map
75dbb4d
to
d98ca65
Compare
Signed-off-by: Sudharshan Muralidharan <[email protected]>
d98ca65
to
fc5bee8
Compare
Add terraform for provisioning s390x build cluster on ibmcloud